Home
Home
German Version
Support
Impressum
23.6 Release ►

    Main Navigation

    • Preparation
      • Connectors
      • Create an InSpire VM on Hyper-V
      • Initial Startup for G7 appliances
      • Setup InSpire G7 primary and Standby Appliances
    • Datasources
      • Configuration - Atlassian Confluence Connector
      • Configuration - Best Bets Connector
      • Configuration - Box Connector
      • Configuration - COYO Connector
      • Configuration - Data Integration Connector
      • Configuration - Documentum Connector
      • Configuration - Dropbox Connector
      • Configuration - Egnyte Connector
      • Configuration - GitHub Connector
      • Configuration - Google Drive Connector
      • Configuration - GSA Adapter Service
      • Configuration - HL7 Connector
      • Configuration - IBM Connections Connector
      • Configuration - IBM Lotus Connector
      • Configuration - Jira Connector
      • Configuration - JiveSoftware Jive Connector
      • Configuration - JVM Launcher Service
      • Configuration - LDAP Connector
      • Configuration - Microsoft Azure Principal Resolution Service
      • Configuration - Microsoft Dynamics CRM Connector
      • Configuration - Microsoft Exchange Connector
      • Configuration - Microsoft File Connector (Legacy)
      • Configuration - Microsoft File Connector
      • Configuration - Microsoft Graph Connector
      • Configuration - Microsoft Project Connector
      • Configuration - Microsoft SharePoint Connector
      • Configuration - Microsoft SharePoint Online Connector
      • Configuration - Microsoft Stream Connector
      • Configuration - Microsoft Teams Connector
      • Configuration - Salesforce Connector
      • Configuration - SemanticWeb Connector
      • Configuration - ServiceNow Connector
      • Configuration - Web Connector
      • Configuration - Yammer Connector
      • Data Integration Guide with SQL Database by Example
      • Indexing user-specific properties (Documentum)
      • Installation & Configuration - Atlassian Confluence Sitemap Generator Add-On
      • Installation & Configuration - Caching Principal Resolution Service
      • Installation & Configuration - Jive Sitemap Generator
      • Installation & Configuration - Mindbreeze InSpire Insight Apps in Microsoft SharePoint On-Prem
      • Mindbreeze InSpire Insight Apps in Microsoft SharePoint Online
      • Mindbreeze Web Parts for Microsoft SharePoint
      • User Defined Properties (SharePoint 2013 Connector)
      • Whitepaper - Mindbreeze InSpire Insight Apps in Salesforce
      • Whitepaper - Web Connector - Setting Up Advanced Javascript Usecases
    • Configuration
      • CAS_Authentication
      • Configuration - Alternative Search Suggestions and Automatic Search Expansion
      • Configuration - Back-End Credentials
      • Configuration - Chinese Tokenization Plugin (Jieba)
      • Configuration - CJK Tokenizer Plugin
      • Configuration - Collected Results
      • Configuration - CSV Metadata Mapping Item Transformation Service
      • Configuration - Entity Recognition
      • Configuration - Exporting Results
      • Configuration - External Query Service
      • Configuration - Filter Plugins
      • Configuration - GSA Late Binding Authentication
      • Configuration - Identity Conversion Service - Replacement Conversion
      • Configuration - InceptionImageFilter
      • Configuration - Index-Servlets
      • Configuration - Item Property Generator
      • Configuration - Japanese Language Tokenizer
      • Configuration - Kerberos Authentication
      • Configuration - Management Center Menu
      • Configuration - Metadata Enrichment
      • Configuration - Metadata Reference Builder Plugin
      • Configuration - Mindbreeze Proxy Environment (Remote Connector)
      • Configuration - Notifications
      • Configuration - Personalized Relevance
      • Configuration - Plugin Installation
      • Configuration - Principal Validation Plugin
      • Configuration - Profile
      • Configuration - Reporting Query Logs
      • Configuration - Reporting Query Performance Tests
      • Configuration - Request Header Session Authentication
      • Configuration - Shared Configuration (Windows)
      • Configuration - Vocabularies for Synonyms and Suggest
      • Configuration of Thumbnail Images
      • Cookie-Authentication
      • Documentation - Mindbreeze InSpire
      • I18n Item Transformation
      • Installation & Configuration - Outlook Add-In
      • Installation - GSA Base Configuration Package
      • JWT Authentication
      • Language detection - LanguageDetector Plugin
      • Mindbreeze Personalization
      • Mindbreeze Property Expression Language
      • Mindbreeze Query Expression Transformation
      • SAML-based Authentication
      • Trusted Peer Authentication for Mindbreeze InSpire
      • Using the InSpire Snapshot for Development in a CI_CD Scenario
      • Whitepaper - Create AWS 10M InSpire Appliance
      • Whitepaper - Create AWS 1M InSpire Appliance
      • Whitepaper - Create AWS 2M InSpire Appliance
      • Whitepaper - MMC_ Services
      • Whitepaper - Natural Language Question Answering (NLQA)
      • Whitepaper - SSO with Microsoft AAD or AD FS
      • Whitepaper - Text Classification Insight Services
    • Operations
      • app.telemetry Statistics Regarding Search Queries
      • Configuration - app.telemetry dashboards for usage analysis
      • Configuration Usage Analysis
      • Deletion of Hard Disks
      • Handbook - Backup & Restore
      • Handbook - Command Line Tools
      • Handbook - Distributed Operation (G7)
      • Handbook - Filemanager
      • Handbook - Indexing and Search Logs
      • Handbook - Updates and Downgrades
      • Index Operating Concepts
      • Inspire Diagnostics and Resource Monitoring
      • Mindbreeze InSpire SFX Update
      • Provision of app.telemetry Information on G7 Appliances via SNMPv3
      • Restoring to As-Delivered Condition
    • User Manual
      • Browser Extension
      • Cheat Sheet
      • iOS App
      • Keyboard Operation
    • SDK
      • api.v2.alertstrigger Interface Description
      • api.v2.export Interface Description
      • api.v2.personalization Interface Description
      • api.v2.search Interface Description
      • api.v2.suggest Interface Description
      • api.v3.admin.SnapshotService Interface Description
      • Debugging (Eclipse)
      • Developing an API V2 search request response transformer
      • Developing Item Transformation and Post Filter Plugins with the Mindbreeze SDK
      • Development of a Query Expression Transformer
      • Development of Insight Apps
      • Embedding the Insight App Designer
      • Java API Interface Description
    • Release Notes
      • Release Notes 20.1 Release - Mindbreeze InSpire
      • Release Notes 20.2 Release - Mindbreeze InSpire
      • Release Notes 20.3 Release - Mindbreeze InSpire
      • Release Notes 20.4 Release - Mindbreeze InSpire
      • Release Notes 20.5 Release - Mindbreeze InSpire
      • Release Notes 21.1 Release - Mindbreeze InSpire
      • Release Notes 21.2 Release - Mindbreeze InSpire
      • Release Notes 21.3 Release - Mindbreeze InSpire
      • Release Notes 22.1 Release - Mindbreeze InSpire
      • Release Notes 22.2 Release - Mindbreeze InSpire
      • Release Notes 22.3 Release - Mindbreeze InSpire
      • Release Notes 23.1 Release - Mindbreeze InSpire
      • Release Notes 23.2 Release - Mindbreeze InSpire
      • Release Notes 23.3 Release - Mindbreeze InSpire
      • Release Notes 23.4 Release - Mindbreeze InSpire
      • Release Notes 23.5 Release - Mindbreeze InSpire
      • Release Notes 23.6 Release - Mindbreeze InSpire
    • Security
      • Known Vulnerablities
    • Product Information
      • Product Information - Mindbreeze InSpire - Standby
      • Product Information - Mindbreeze InSpire
    Home

    Path

    Sure, you can handle it. But should you?
    Let our experts manage the tech maintenance while you focus on your business.
    See Consulting Packages

    Microsoft File Connector

    Installation and Configuration

    Copyright ©

    Mindbreeze GmbH, A-4020 Linz, 2023.

    All rights reserved. All hardware and software names used are registered trade names and/or registered trademarks of the respective manufacturers.

    These documents are highly confidential. No rights to our software or our professional services, or results of our professional services, or other protected rights can be based on the handing over and presentation of these documents.

    Distribution, publication or duplication is not permitted.

    The term ‘user‘ is used in a gender-neutral sense throughout the document.

    Video Tutorial „Set up a basic Microsoft File Connector”Permanent link for this heading

    This Video describes how to configure the Microsoft File Connector. See what preconditions are necessary and how to configure the index. Also have a look at the Active Directory Based Authentication, as well as LDAP and how to analyze crawled documents and crawl runs in app.telemetry.

    https://www.youtube.com/watch?v=S2JCrM98W30

    Configuration of MindbreezePermanent link for this heading

    Click on the “Indices” tab and then on the “Add new index” symbol to create a new index.

    Enter the index path, e.g. “/data/indices/filesystem”. Change the Display Name of the Index Service and the related Filter Service if necessary.

    Add a new data source with the symbol “Add new custom source” at the bottom right.

    • „Ignore Category Instance”: When multiple file crawlers are configured on an index, the search is not restricted to specific category instances.
    • „Authorization Service“: Currently we provide no Authorization Serivce for Microsoft File.

    Configuration of Data SourcePermanent link for this heading

    Caching Principal Resoution ServicePermanent link for this heading

    To use the Caching Principal Resolution Service you have to select CachingLdapPrincipalResoution. Then it is used to resolve the AD group membership of a user in the search.

    For more details click here Caching Principal Resolution Service.

    SourcesPermanent link for this heading

    Root Directories (UNC Path)

    In this option you can specify which directories should be crawled.

    Notes:

    • Directories should be separated by new line (maximum 24 directories).
    • If Azure File Shares are crawled, the Kerberos authentication method must be selected and used in the “Authentication Type” option.
    • For crawling using “Content Location Optimization” option in Linux make sure that root paths are mounted.

    Supports SMBv2/v3

    If disabled, only SMBv1 protocol is used.

    If enabled, SMBv2/v3 protocols are also used.

    Disable SMB Packet Signing
    (Advanced Settings)

    If enabled, no signature is generated for sent SMB packets and the signature is not verified for received packets.

    Encrypt Data

    (Advanced Setting)

    Enables data encryption. Ensure that “Maximum SMB2 Dialect” is either Auto or one of the following SMB2 dialects: 3.0.0, 3.0.2, 3.1.1.

    Disable SMB2 Multi-Protocol Negotiate
    (Advanced Settings)

    If enabled, this can result in better error messages if the server only supports SMBv1.

    Minimum SMB2 Dialect

    (Advanced Setting)

    Supported SMB2 dialects are 2.0.2, 2.1.0, 3.0.0, 3.0.2 and 3.1.1. This value should be less than or equal to the “Maximum SMB2 Dialect”.

    The actual SMB2 dialect used is determined by the result of SMB2 Protocol Negotiation with the file share server.

    Maximum SMB2 Dialect

    (Advanced Setting)

    Supported SMB2 dialects are 2.0.2, 2.1.0, 3.0.0, 3.0.2 and 3.1.1. This value should be greater than or equal to “Minimum SMB2 Dialect”.

    The default value is Auto. For Azure file shares, the value is set to 3.1.1. For all other file shares, it is set to 3.0.2.

    The effective SMB2 dialect is determined by the result of the SMB2 Protocol Negotiation with the file share server.

    SMB Client Transaction Timeout
    (Advanced Settings)

    Here you can specify the thread timeout (in seconds) for SMB connections.

    SMB Client Socket Timeout
    (Advanced Settings)

    Here you can specify the socket timeout (in seconds) for SMB connections.

    Crawl Last Modified Directory Files First
    (Advanced Settings)

    If enabled, while traversing a directory, the files and subdirectories are sorted by modification date.

    This causes the most recently changed files and directories to be crawled first.

    Root Traversal Threads Count

    Here you can set the number of threads that traverse the directories from the "Root Directories" field in parallel.

    Documents Dispatcher Threads Count

    Here you can define the number of threads that send the directories and their documents that are in the "Documents Dispatcher Queue" to the index in parallel.

    Documents Dispatcher Queue Size

    Here you can specify the maximum number of directories and their documents that should be in the queue before they are removed from the queue by "Document Dispatcher Threads" and sent to Index.

    Directory Files Lister Threads Count

    Here you can define the number of threads that retrieve the files, subdirectories and the ACLs of a directory from the filesystem share via SMB. The subdirectories are stored in the "Directory Files Lister Queue". The directories and their files are stored in the "Document Dispatcher Queue".

    Directory Files Lister Queue Size

    Here you can specify the maximum number of directories for which no files, subdirectories and ACLs have yet been retrieved from the filesystem share to be queued.

    Document Size Limit (MB)

    Here you can set the maximum document size. Documents larger than this value will be ignored.

    Note: If this value is changed, the "Document Size Limit (MB)" and "Filter RPC Timeout (non-streamed)" options in the Filter Service should also be adjusted.

    Maximum Crawled Content Length in MB.

    If documents exceed the size (in MB) specified in this option, they will be sent to the filter with empty content.

    Includes (Regexp)

    If this option is configured, only those files and directories are indexed which match the specified pattern (regular expression).

    Excludes have higher priority than includes (i.e. if a document is both included and excluded, it will not be indexed).

    Excludes (Regexp)

    If this option is configured, those files and directories that match the specified pattern (regular expression) will be ignored.

    Excludes have higher priority than includes (i.e. if a document is both included and excluded, it will not be indexed).

    Include Patterns
    (Advanced Settings)

    Only those files and directories are indexed which match the specified pattern (regular expression).

    In contrast to the "Includes (Regexp)" field, here you have the possibility to define "case-sensitive" patterns (reqular expression) by using "regexpIgnoreCase:", "case-insensitive" and "regexp:" or to comment out the pattern with the "#" character at the beginning of the line.

    Exclude Patterns
    (Advanced Settings)

    Those files and directories are ignored which match the specified pattern (regular expression)

    In contrast to the "Includes (Regexp)" field, here you have the possibility to define "case-sensitive" patterns (reqular expression) by using "regexpIgnoreCase:", "case-insensitive" and "regexp:" or to comment out the pattern with the "#" character at the beginning of the line

    Exclude Directories

    If enabled, directories are not indexed.

    Full Traversal Interval (Hours)

    Here you can define the interval (in hours) between two complete traversals of all documents in the fileshare. Modified documents are also indexed during incremental traversal at the "Crawler Interval" interval. The default setting (-1) is sufficient for most use cases and it is a full traversal of all documents. For very large fileshares it may be useful to perform incremental traversal to speed it up. In this case, deleted (from the file share) documents are not removed from the index. These documents are removed from the index at the end of the full traversal.

    Remove Deleted Documents From Index

    If enabled, the documents deleted from the fileshare will be deleted from the index at the end of a full traversal.

    Content Location Optimization

    The description of this option can be found here.

    Security Rights SettingsPermanent link for this heading

    ACL Security Level

    File: The ACLs are calculated per document. Share rights are not included.

    Directory: All documents get only the ACLs of the corresponding directory. Share rights are not included. All documents get only the ACLs of the corresponding directory. Share rights are not included

    Share: All documents get only the ACLs of the share. To read the share rights, the service user must be a member of the following local (Share Server) groups: Administrator, Power User, Print Operator or Server Operator.

    None: Documents do not get ACLs. May only be configured together with the "Unrestricted Public Access" option of the index.

    Trustee: ACLs are calculated from the Trustee info file.

    Normalize ACLs

    (Advanced Settings)

    If this checkbox is activated, all ACLs are saved in “Distinguished Name” format. If it is not activated, the ACLs remain in SID format. In this case, it is important to configure objectsid in the “User Alias Name LDAP Attribute” and “Group Alias Name LDAP Attribute” fields in the selected LDAP principal resolution service.

    Resolve Local Group Members

    (Advanced Settings)

    Sometimes there are local groups in ACLs of the documents. In order to resolve the domain users or domain groups inside these local groups there is an access to LSA (Local Security Authority) and SAM (Service Account Manager) using the RPC-SMB protocol needed. However, this is generally not recommended and should only be disabled in exceptional cases

    LSA/SAM Desired Access

    (Advanced Settings)

    The preferred access permission of the crawler service user to LSA and SAM: Maximum allowed, Generic all, Generic execute, Generic Read or Read Control. For crawling NetApp share Read Control may be needed as Desired LSA/SAM Access to be selected. If the access with selected permission was not successful, the other access permissions will be tried

    Resolve All Domains

    (Advanced Settings)

    To correctly assign the file permissions (ACLs) of different domains, select the Resolve All Domains option. For this it is necessary that either the LDAP servers of these domains are configured directly under "LDAP Server" or can be resolved via DNS SRV Records from AD using LDAP. Therefore, the domains should be configured in the Network Tab under LDAP Setting. If "Resolve All Domains" is not selected, only the ACLs from the File Share Server domain will be resolved correctly

    Trustee Information SettingsPermanent link for this heading

    • “Trustee Information File Path”: Path to the Trustee information file which can reside in a local folder or an UNC-Path.
    • “Trustee Volume Path”: The Volume-Path of the Root folder in the Trustee information file. Can be omitted if the Root folder is equal to the Volume-Path.

    Extensions (Index File Lister)Permanent link for this heading

    These are plugins that can be provided by Mindbreeze to cover special use cases. The files are not indexed by classical "browsing" through the file trees, but a file or a database or something similar is bound, which contains a list of files to be indexed. So only the URLs files of these lists are indexed instead of "browsing" through all trees. This mechanism is similar to Sitemaps in the Web Connector.

    Microsoft File Connector provides the Interface IndexFileListerPlugin (index-filelister-spi.jar) to list documents together with additional properties from an index file for crawling.

    public interface IndexFileListerPlugin {

    boolean isIndexFile(ReadonlyFile file);

    void init(Properties properties);

    Collection<Map.Entry<ReadonlyFile, TypesProtos.Item>> listIndexFile(FilesystemContext context, ReadonlyFile  indexFile);

    }

    messdk-generated.jar and protobuf-java-3.0.0.jar from Java service API together with index-filelister-spi.jar are needed to implement the IndexFileListerPlugin. After implementing the plugin, it should be configured as follows. Provide the path of JAR file containing the implementation in the „Index File Lister Plugin“ field. The optional „Index File Lister Plugin Property“ fields define the properties needed by the plugin to be initialized with.

    Index Files are queued „Queue Size“ during directory traversal, which are then handled by parallel threads „Thread Count”. The option „Skip unchanged Index File Listing during Incremental Traversal“ should be selected only if the option „Full Traversal Interval” is also configured. By this means index files which are not changed are ignored during incremental traversals “Crawler Interval”.

    The Microsoft File Connector contains a preconfigured content mapping file (XML) which provides necessary rules to be applied on documents according to their content type. Sometimes it is necessary to change these rules and save this mapping file in separate location. In order to use this modified mapping file, it is necessary to configure this file’s location in “Content Type Mapping Description File”.

    Content Location OptimizationPermanent link for this heading

    For crawling large files it is beneficial to use Content Location Optimzation. For example if you want to crawl Outlook PST Files.

    Configure the mount point according to the screenshot above. The following configuration Options are needed:

    • “Root Directory (UNC Path)”: Use the same root path you used in the source config above.
    • “Root Directory (Mount Path)”: The local path to which the UNC Path is mounted.
    • “Files Pattern (Regex)”: A regex pattern matchting those files which should be indexed using Content Location Optimization.

    The Content Location Optimization feature requires that the UNC Path is mounted locally. This can be configured using the System Configuration in the Management Center:

    1. Create the local folder using Filemin:

    2. Grant Permissions to the Mindbreeze user (mes):
    3. Add a CIFS mount using the “Disk and Network Filesystems” Module:
    4. Configure the mount:
    5. After you press “Create” the Network filesystem will be mounted and is ready for use.

    Crawling Outlook PST FilesPermanent link for this heading

    In addition to the File Crawler configuration above you also need to add an Outlook PST Datasource to crawl PST Files remove “Default” from Category Instance field.

    And finally ensure that a Filter Plugin is enabled for .pst extension.

    CredentialsPermanent link for this heading

    The user must have read permissions for the shared directory that is to be crawled. The credentials for this can be configured in the following "Credentials" area.

    • Username
    • The username of the user.
    • Domain
    • The user's domain name.
    • Password
    • The user's domain password.
    • (Advanced Settings) Authentication Type
    • Here you can specify which authentication method should be used.

    NTLM authentication is used by default. This requires that "Username", "Domain" and "Password" must be configured.

    If Kerberos authentication is selected, a Kerberos keytab and Principal must be selected for the crawler in the “Authentication” tab. More information can be found here.
    Alternatively, "Username", "Domain" and "Password" can also be configured for this, but this is not recommended for this authentication method.

    Additional SettingsPermanent link for this heading

    • “Always Update Files Matching Regex”: Documents matching this regex will always be sent to filter service no matter if they were changed or not.
    • “Ignore Content of Documents without Extension”: If this checkbox is selected than the automatic mimetype detection is deactivated for documents without extension. The contents of these documents are not indexed.
    • “Disable Default Extension”: If selected documents which have no extension and an automatic mime type detection attempt is failed have no extension.
    • “Fetch Preview Content from Datasource”: To provide PDF Preview for PDF documents, the binary content of PDF documents is stored in the index. If this option is selected, the binary content will be fetched directly from datasource instead. The storing of the PDF Content in the index can then be disabled in the Filter Configuration reducing the disk usage of the index.
    • “Enable Heap Dump On OutOfMemory”: If the crawler needs more memory than configured in Plugins.xml <vm_arg> a heap dump is generated in log directory.
    • “Max. Retry Duration by Filter Connection Problems”: Maximum amount of time the crawler is allowed to retry sending a document to the filter service during connection problems.
    • “Retry Interval during Repository Connection Problems”: The amount of time the crawler waits before retrying to connect to the data source during connection problems.
    • “Max. Retry Duration during Repository Connection Problems”: Maximum amount of time the crawler is allowed to retry connecting to the data source during connection problems.

    Open Search ResultsPermanent link for this heading

    Search results from a Microsoft File datasource (Microsoft Word, Microsoft Excel and Microsoft Powerpoint) are opened on Windows 10 directly in the respective program if the current user is signed in to the respective fileserver and Microsoft Office 2019 is installed.

    Download PDF

    • Configuration - Microsoft File Connector

    Content

    • Video Tutorial „Set up a basic Microsoft File Connector”
    • Configuration of Mindbreeze

    Download PDF

    • Configuration - Microsoft File Connector