Microsoft File Connector

Installation and Configuration

Copyright ©

Mindbreeze GmbH, A-4020 Linz, 2018.

All rights reserved. All hardware and software names used are registered trade names and/or registered trademarks of the respective manufacturers.

These documents are highly confidential. No rights to our software or our professional services, or results of our professional services, or other protected rights can be based on the handing over and presentation of these documents.

Distribution, publication or duplication is not permitted.

The term ‘user‘ is used in a gender-neutral sense throughout the document.

Configuration of MindbreezePermanent link for this heading

Click on the “Indices” tab and then on the “Add new index” symbol to create a new index.

Enter the index path, e.g. “/data/indices/filesystem. Change the Display Name of the Index Service and the related Filter Service if necessary.

Add a new data source with the symbol “Add new custom source” at the bottom right.

Configuration of Data SourcePermanent link for this heading

Caching Principal Resoution ServicePermanent link for this heading

To use the Caching Principal Resolution Service you have to select CachingLdapPrincipalResoution. Then it is used to resolve the AD group membership of a user in the search.

For more details click here Caching Principal Resolution Service.

SourcesPermanent link for this heading

  • “Root Paths”: The root path must be a UNC path.
  • “Supports SMB 2”: The crawler uses SMB2 protocol by selecting this option.
  • “Disable SMB Packet Signing”: SMB packets will be sent without signature and the signature of receiving SMB packets will not be verified.
  • “Thread Count”: Both traversal of directories and retrieval of documents are performed in parallel.
  • “Batch Size”: The size of the queue where the documents are queued before retrieving their properties and content.
  • “Includes”: Only those files and directories are crawled whose path match this pattern (case sensitive).
  • “Document Size Limit (MB)”: The Crawler will ignore documents larger than this Size. If this limit is changed, the limit and the rpc-timeout of the filter service should be adapted as well.
  • “Include Patterns”: With “regexpIgnoreCase:” allows to use case insensitive regular expressions.
  • “Excludes”: Those files and directories whose path match this pattern are not crawled (case sensitive).
  • “Exclude Patterns”: With “regexpIgnoreCase:” allows to use case insensitive regular expressions.
  • Exclude Directories”: If selected directories are not crawled.
  • “Always Use Directory Rights”: If selected direct security permissions on files itself are ignored.
  • Full Traversal Interval (Hours)”: Interval between two full traversals of all documents in file share. Modified documents are crawled during incremental traversal after “Crawler Interval”.
  • Remove Deleted Documents from Index”: By selecting this option documents which are deleted from file share are deleted from index at the end of a full traversal.

Extensions (Index File Lister)Permanent link for this heading

Microsoft File Connector provides the Interface IndexFileListerPlugin (index-filelister-spi.jar) to list documents together with additional properties from an index file for crawling.

public interface IndexFileListerPlugin {

boolean isIndexFile(ReadonlyFile file);

void init(Properties properties);

Collection<Map.Entry<ReadonlyFile, TypesProtos.Item>> listIndexFile(FilesystemContext context, ReadonlyFile  indexFile);


messdk-generated.jar and protobuf-java-3.0.0.jar from Java service API together with index-filelister-spi.jar are needed to implement the IndexFileListerPlugin. After implementing the plugin, it should be configured as follows. Provide the path of JAR file containing the implementation in the „Index File Lister Plugin“ field. The optional „Index File Lister Plugin Property“ fields define the properties needed by the plugin to be initialized with.

Content Location OptimizationPermanent link for this heading

For crawling large files it is beneficial to use Content Location Optimzation. For example if you want to crawl Outlook PST Files.

Configure the mount point according to the screenshot above. The following configuration Options are needed:

  • “Root Directory (UNC Path): Use the same root path you used in the source config above.
  • Root Directory (Mount Path)”: The local path to which the UNC Path is mounted.
  • Files Pattern (Regex): A regex pattern matchting those files which should be indexed using Content Location Optimization.

The Content Location Optimization feature requires that the UNC Path is mounted locally. This can be configured using the System Configuration in the Management Center:

  1. Create the local folder using Filemin:

  2. Grant Permissions to the Mindbreeze user (mes):
  3. Add a CIFS mount using the “Disk and Network Filesystems” Module:
  4. Configure the mount:
  5. After you press “Create” the Network filesystem will be mounted and is ready for use.

Crawling Outlook PST FilesPermanent link for this heading

In addition to the File Crawler configuration above you also need to add an Outlook PST Datasource to crawl PST Files remove “Default” from Category Instance field.

And finally ensure that a Filter Plugin is enabled for .pst extension.

CredentialsPermanent link for this heading

The user must have read access permission in order to crawl a share.

  • “Username”:
  • “Domain”:
  • “Password”:
  • “LDAP Server”: Provide this only if you want to override LDAP setting configured under “Network Setting” tab.

Additional SettingsPermanent link for this heading

  • “Always Update Files Matching Regex”: Documents matching this regex will always be sent to filter service no matter if they were changed or not.
  • “Enable Heap Dump On OutOfMemory”: If the crawler needs more memory than configured in Plugins.xml <vm_arg> a heap dump is generated in log directory.
  • “Max. Retry Duration by Filter Connection Problems”: Maximum amount of time the crawler is allowed to retry sending a document to the filter service during connection problems.
  • “Retry Interval during Repository Connection Problems”: The amount of time the crawler waits before retrying to connect to the data source during connection problems.
  • Max. Retry Duration during Repository Connection Problems”: Maximum amount of time the crawler is allowed to retry connecting to the data source during connection problems.

Open Search ResultsPermanent link for this heading

To open search results from a Microsoft File datasource the current user has to be signed in to the corresponding fileserver.