Copyright ©
Mindbreeze GmbH, A-4020 Linz, 2024.
All rights reserved. All hardware and software names used are registered trade names and/or registered trademarks of the respective manufacturers.
These documents are highly confidential. No rights to our software or our professional services, or results of our professional services, or other protected rights can be based on the handing over and presentation of these documents.
Distribution, publication or duplication is not permitted.
The term ‘user‘ is used in a gender-neutral sense throughout the document.
In our tutorial video you will find all necessary steps to set up the Microsoft SharePoint Connector:
https://www.youtube.com/watch?v=yzTyTz1SpXo
Before installing the Microsoft SharePoint Connector ensure that the Mindbreeze Server is already installed and this connector is included in the Mindbreeze license.
The Microsoft SharePoint Connector allows you to index and search in Microsoft SharePoint items and objects.
The following requirements must be met before configuring a Microsoft SharePoint data source:
Adding a user to the SharePoint site administrators can be done as follows:
Selecting authentication policy for Web Applications can be done as follows:
Navigate to Central Administration Manager service application User Profile Service Application
The services for SharePoint must be installed as follows:
Save the SharePoint SSL certificate in for example c:\temp\sharepointserver.cer file.
Installation:
<jre_home>/binkeytool -import -noprompt -trustcacerts -alias sharepointserver –file /tmp/sharepointserver.cer -keystore ../lib/security/cacerts –storepass changeit
Click on the “Indices” tab and then on the “Add new index” symbol to create a new index.
Enter the index path, e.g. “/data/indices/sharepoint”. Change the Display Name of the Index Service and the related Filter Service if necessary.
Add a new data source with the symbol “Add new custom source” at the bottom right.
This information is only needed when basic authentication is used:
If the Sharepoint Principal Cache is used, it is possible to configure credential information in the Network tab (section Endpoints).
https://msdn.microsoft.com/en-us/library/bb498017.aspx
You can select one of the following three caching principal resolution services to be used.
CachingLdapPrincipalResoution: If selected, it is used to resolve a user’s AD group membership when searching. However, the SharePoint groups in the ACLs must be resolved while crawling. To do this, select "Resolve SharePoint Groups”. Do not select “Use ACLs References”. “Normalize ACLs” can be selected. For details on configuring the caching principal resolution service, see Caching Principal Resolution Service.
SharePointPrincipalResolutionCache: If selected, it is used to resolve a user’s SharePoint group membership when searching. This service also resolves the user’s AD group membership. Therefore, it is no longer necessary to select "Resolve SharePoint Groups". Do not select "Use ACLs References” in this case. “Normalize ACLs” can be selected. (Also see section Configuration of SharePointPrincipalCache)
SharePointACLReferenceCache: When selected, the URLs from the SharePoint site, SharePoint list, and folder of the document are saved as ACLs during crawling to speed up the crawl. “Use ACLs References" must be selected in this case. “Resolve SharePoint Groups” and “Normalize ACLs” may not be selected. (Also see section Configuration of SharepointACLReferenceCache)
The SharePoint crawler initially detects all SharePoint sites of a SharePoint server “SharePoint Server URL.” Alternatively, you can enter the path of a CSV file in the field “Include Sites File”, in which only certain sites (URLs) that should be indexed are entered. It is also possible to limit the data to be crawled to specific pages (URLs). To do this, you can restrict these pages (URLs) with a regular expression in the field "Included URL". It is also possible to exclude pages (URLs) or not to crawl certain pages. These pages must be restricted to the field "Excluded URL" with a regular expression. A regular expression must have a "regexp:" or "regexpIgnoreCase:" prefix.
For crawling user profiles,"Crawl User Profile" must be selected and the "MySite URL" and "Collection Name for User Profiles" must be configured accordingly.
Please note that the following restrictions apply after using “Include URL” and “Exclude URL”. This means that a site URL that is excluded by applying the "Exclude URL" rule will not be crawled even if it is in the "Include Sites File".
The option "Use ACLs References" should only be selected if "SharePointACLReferenceCache" is selected as the "Principal Resolution Service Cache" (see: Configuration of SharepointACLReferenceCache).
Moving documents from one directory to another also changes the URLs of these documents. To update these changes in the index, select the “Track Document URL Changes” option.
If “Track Only Effective ACL Changes of Web Application Policy” option is not selected any change in the permissions of web application policy (changing permission from Full Read to Full Control for a user) which may not effectively change granting permission in Mindbreeze will cause recrawling and rechecking of ACLs of all documents in all sites of that web application.
The option "Resolve SharePoint Groups" should not be selected if "SharePointPrincipalCache" is selected as the "Principal Resolution Service Cache" (see: Configuration of SharePointPrincipalCache). By configuring “Normalize ACLs,” all AD users and groups are converted to ACLs in “Distinguished Name” format. To crawl SharePoint pages with anonymous access rights, select "Include Documents without ACLs". If you want to exclude SharePoint pages from crawling by activating certain features, it is necessary to enter the ID (GUID) of these features in the field "Exclude Documents From Sites With These Features".
In order to provide documents with open URLs according to „Alias URLs“ configuration the „Rewrite Open URL“ must be selected. If the service user has not access to internal download URLs of documents. These URLs can be rewritten by URLs configured in „Alias URLs“ configuration.
The external URLs in SharePoint Alternative Access Mapping configuration should be in FQDN format.
A regular expression pattern to match additional content types is needed in “Additional Content Types (regex)” field in order to crawl content type which are not crawled per default
For crawling documents with unpublished state select “Enabled” from “Include Unpublished Documents” dropdown list and “Last Major Version” for crawling last major version of unpublished documents. Make sure that at least version 20.3 of the prerequisites (includes MesLists.asmx) is installed on the SharePoint Server.
The SharePoint Connector contains a preconfigured content mapping file (XML) which provides necessary rules to be applied on documents according to their content type. Sometimes it is necessary to change these rules and save this mapping file in separate location. In order to use this modified mapping file it is necessary to write this file’s location in “Content Type Mapping Description File”. One of the important rules in this mapping file is to include or exclude documents with some specific content types. By selecting “Delete Ignored Documents from Index” the documents already crawled with a different mapping rules will be deleted from index if they are not included anymore.
To extract metadata from HTML content the following configuration is needed.
Contains settings for diagnostic purposes.
When opening Office documents from the search result in Internet Explorer, the opened documents can be edited and saved in SharePoint. This requires write permissions to the document. When using other browsers, the documents are opened read-only.
Windows:
If the installation is made on a Microsoft Windows Server, the Kerberos authentication of the current Mindbreeze Service user can also be used for the Microsoft SharePoint Crawler. In this case, the Service user must be authorized to access the Microsoft SharePoint Web Services.
Linux:
Find the documentation for Linux here: Configuration - Kerberos Authentication
In the following chapters the configuration of SharepointPrincipalCache and SharePointACLReferenceCache will be explained. For more information about additional configuration options and how to create a cache and how to do the basic configuration of a cache for a Principal Resolution Service, see Installation & Configuration - Caching Principal Resolution Service.
Use LDAP Principals Cache Service | If this option is enabled, the group memberships from the parent cache are calculated first and the results are passed to the child cache. This allows the current cache to use the results of the parent cache for lookups. |
LDAP Principals Cache Service Port | The port used for the "Use LDAP Principals Cache Service" option if enabled. |
https://msdn.microsoft.com/en-us/library/bb498017.aspx
Identity Encryption Credential | This option allows you to display the user identity in encrypted form in app.telemetry. |
Cache In Memory Items Size | Number of items stored in the cache. Depends on the available memory of the JVM. |
Database Directory Path | Defines the directory path for the cache. Example: /data/principal_resolution_cache If a Mindbreeze Enterprise product is used, a path must be set. If a Mindbreeze InSpire product is used, the path must not be set. If the directory path is not defined, the following path is defined under Linux: /data/currentservices/<server name>/data. |
Group Members Resolution And Inversion Threads | This option determines the number of threads that will resolve group members at the same time and invert those groups. Values less than 1 are assumed to be 1. |
In-Memory Containers Inversion Threshold (Advanced Setting) | This option sets the maximum number of groups. If this number is exceeded, further RAM consumption during inversion is avoided by using hard drives. |
Generally, if you are having troubles indexing a SharePoint data source you should primarily look at the Mindbreeze log-folders.
Inside the Mindbreeze base log-folder there is also a sub-folder for the SharePoint-crawler located which is called similar to the following example:
C:\logs\current\log-mescrawler_launchedservice-Microsoft_SharePoint_Sharepoint+2007
This folder will contain several date-based sub-folders each containing two main log files:
If the file mes-pusher.csv does not appear, there may be basic configuration or permission troubles preventing the crawler from retrieving documents from SharePoint, which should be recorded in the base log file mentioned above.
Problem Cause:
The crawler does not retrieve any documents from SharePoint and therefore does not create the log file mes-pusher.csv.
The log file log-mescrawler_launchedservice.log may contain error message similar to the following ones:
com.mindbreeze.enterprisesearch.gsabase.crawler.InitializationException: Invalid connector config: message Cannot connect to the given SharePoint Site URL with the supplied Domain/Username/Password.Reason:(401)Unauthorized
Or:
com.mindbreeze.enterprisesearch.gsabase.crawler.InitializationException: Unable to set connector config, response message: Cannot connect to the Services for SharePoint on the given Crawl URL with the supplied Domain/Username/Password.Reason:(401)Unauthorized, status message:null, status code:5223 (INVALID_CONNECTOR_CONFIG)
Or:
enterprise.connector.sharepoint.wsclient.soap.GSBulkAuthorizationWS INTERNALWARNING: Can not connect to GSBulkAuthorization web service. cause:(401)Unauthorized
Problem description and solution:
The used service user is not allowed to obtain the file listings from SharePoint. Either because the login fails or the permissions inside SharePoint are not enough.
The following issues have to be checked:
Problem Cause:
The crawler does not retrieve any documents from SharePoint and therefore does not create the log file mes-pusher.csv.
The log file log-mescrawler_launchedservice.log may contain an error message similar to the following:
com.mindbreeze.enterprisesearch.gsabase.crawler.InitializationException: Unable to set connector config, response message: The SharePoint Site URL must contain a fully qualified domain name., status message:null, status code:5223 (INVALID_CONNECTOR_CONFIG)
Problem description and solution:
In order to use the Mindbreeze SharePoint Connector it is important that the target SharePoint server is accessed using the FQDN-hostname.
Testing SharePoint Web Services with SOAP-Calls and curl
In order to analyze and solve permission problems or other problematic issues regarding the SharePoint web services you could use the command line tool curl to perform simple SOAP-calls.
The command line tool curl is already present on Mindbreeze InSpire (for Microsoft Windows) and is located in the following folder: C:\setup\tools\curl\bin. For a more convenient utilization, you should add the folder path value to the Microsoft Windows environment variable PATH.
The procedure of preparing the SOAP-Calls is quite similar for every test case and will be explained based on following example: CheckConnectivity from GSSiteDiscovery.asmx
The first step is to open the desired SharePoint web service in a web browser window and follow the link to the desired action method to get the interface description and the template for the content to be sent later on.
For simplicity, we take the interface description based on SOAP 1.2 and copy the XML-content of the first block (request part) into a file in a local temporary folder (e.g. C:\Temp\sp-site-check.xml).
Based on the interface definition some property values must be replaced by custom values from the own SharePoint infrastructure.
Based on the previous example will are now going to test the SOAP calls using curl in a command line window.
Switch to the file system folder containing the prepared XML content file and run the curl-command similar to the following example: (<Values in angle brackets> have to be replaced with own values)
C:\Temp>curl --ntlm --user <testlab\domainsrv>:<MYPASSWORD> --header "Content-Type: application/soap+xml;charset=utf-8" --data @<sp-site-check.xml> http://<spserver2007.testlab...>/_vti_bin/GSSiteDiscovery.asmx
The output will be displayed directly or could also be redirected for easier reading into an output file: > out.xml
The following SharePoint web services and methods are quite useful for detecting problems:
http://<spserver2007.testlab>/_vti_bin/GSSiteDiscovery.asmx
CheckConnectivity: should return success
GetAllSiteCollectionFromAllWebApps: requires a SharePoint admin account!
http://<spserver2007.testlab>/_vti_bin/GSBulkAuthorization.asmx
CheckConnectivity: should return success
http://<spserver2007.testlab>/Docs/_vti_bin/GssAcl.asmx (this test should be invoked on the subdirectory URL containing the SharePoint-documents - e.g.: /Docs)
CheckConnectivity: should return success
GetAclForUrls: this is the first test requiring to change the content XML file (see below) … you could specify the URL to the basic documents overview page e.g. AllItems.aspx, or the SharePoint URL of a chosen document. This test should return all permitted user accounts for the chosen documents …
GetAclForUrls Content-XML:
<?xml version="1.0" encoding="utf-8"?>
<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
<soap12:Body>
<GetAclForUrls xmlns="gssAcl.generated.sharepoint.connector.enterprise.google.com">
<urls>
<string>http://spserver2007.testlab.mindbreeze.fabagl.fabasoft.com/Docs/Documents/Forms/AllItems.aspx</string>
<string>http://spserver2007.testlab.mindbreeze.fabagl.fabasoft.com/Docs/Documents/testdoc2_server2007.rtf</string>
</urls>
</GetAclForUrls>
</soap12:Body>
</soap12:Envelope>
SOAP-Call with curl:
C:\Temp>curl --ntlm --user <testlab\domainsrv>:<MYPASSWORD> --header "Content-Type: application/soap+xml;charset=utf-8" --data @data.xml http://spserver2007.testlab.mindbreeze.fabagl.fabasoft.com/Docs/_vti_bin/GssAcl.asmx > out.xml
The result shows all SharePoint-permissions for the specified URLs:
The documents are retrieved correctly from SharePoint by the crawler (as listed in the main log file) but are still not inserted into the index you should check the following log file mes-pusher.csv.
If the column ActionType contains the value „IGNORED“ there is another column called Message showing the cause why the document was ignored.
Possible causes and solutions:
The following examples show, how Rules in the file ConnectorMetadataMapping.xml can be used to generate metadata from existing metadata.
<ConversionRule class="HTMLContentRule">
<Arg>//*[@id='ArticleContent'] </Arg> <!-- include XPath -->
<Arg>//*[starts-with(@id, 'ECBItems_']</Arg> <!-- exclude XPath -->
</ConversionRule>
<Metadatum join="true">
<SrcName>srcName</SrcName> <!—srcName should be item ID -->
<MappedName>mappedRef</MappedName>
<ConversionRule class="SharePointKeyReferenceRule">
<Arg>http://site/list/AllItems.aspx|%s</Arg>
</ConversionRule>
</Metadatum>
Joining Metadata:
<Metadatum join="true">
<SrcName>srcName1,srcName2</SrcName> <!—join values with ‘|’ -->
<MappedName>mappedName</MappedName>
<ConversionRule class="FormatStringRule">
<Arg>%s|%s</Arg>
</ConversionRule>
</Metadatum>
Splitting Metadata:
<Metadatum split="true">
<SrcName>srcName</SrcName>
<MappedName>mapped1,mapped2</MappedName> <!-- split srcName value -->
<ConversionRule class="SplitStringRule">
<Arg>:</Arg>
</ConversionRule>
</Metadatum>
Generation of Metadata using regular Expressions:
<Metadatum>
<SrcName>srcName</SrcName>
<MappedName>mappedName</MappedName>
<ConversionRule class="StringReplaceRule">
<Arg>.*src="([^"]*)".*</Arg> <!—regex pattern-->
<Arg>http://mycompany.com$1</Arg> <!-- replacement -->
</ConversionRule>
</Metadatum>