Microsoft SharePoint Connector

Installation and Configuration

Copyright ©

Mindbreeze GmbH, A-4020 Linz, 2018.

All rights reserved. All hardware and software names used are registered trade names and/or registered trademarks of the respective manufacturers.

These documents are highly confidential. No rights to our software or our professional services, or results of our professional services, or other protected rights can be based on the handing over and presentation of these documents.

Distribution, publication or duplication is not permitted.

The term ‘user‘ is used in a gender-neutral sense throughout the document.

InstallationPermanent link for this heading

Before installing the Microsoft SharePoint Connector ensure that the Mindbreeze Server is already installed and this connector is also included in the Mindbreeze license.

Extending Fabasoft Mindbreeze Enterprise for Use with the Microsoft SharePoint ConnectorPermanent link for this heading

The Microsoft SharePoint Connector is available as a ZIP file. This file must be registered with the Fabasoft Mindbreeze Enterprise Server via mesextension.exe as follows:

mesextension --interface=plugin --type=archive --file=MicrosoftSharePointConnector<version>.zip install

PLEASE NOTE: The Connector can be updated by calling the same mesextention. Fabasoft Mindbreeze Enterprise will automatically carry out the required update.

Needed Rights for Crawling UserPermanent link for this heading

The Microsoft SharePoint Connector allows you to index and search in Microsoft SharePoint items and objects.

The following requirements must be met before configuring a Microsoft SharePoint data source:

  • The Microsoft SharePoint version must be SharePoint 2013, SharePoint 2010 or SharePoint 2007.
  • For Kerberos Authentication the service user on the Fabasoft Mindbreeze Enterprise node with the SharePoint data source must have at least Full Read permissions on SharePoint Web Applications. Kerberos must be selected as authentication policy for these Web Applications.
  • For Basic Authentication username and password of the account which has Full Read permission on SharePoint Web Applications should be provided in Mindbreeze Manager Configuration. Basic Authentication must be selected as authentication policy for these Web Application

Adding a user to the SharePoint site administrators can be done as follows:

  • Navigate to Central Administration -> Application Management and then click on Manage web applications
  • Select Web Application and then click on User Policy (see screenshot below)
  • Give the service user “Full Read” permission.

Selecting authentication policy for Web Applications can be done as follows:

  • Navigate to Central Administration -> Application Management and then click on Manage web applications
  • Select Web Application and then click on Authentication Providers (see screenshot below)
  • Choose desired authentication policy

  • If NTLM or Basic authentication is selected, the username and password should be provided in Mindbreeze configuration. (See 2.1.1)

  • In order to crawl user profiles in SharePoint 2013 the service user must be in the list of search crawlers of User Profile Service Application.

Navigate to Central Administration Manager service application User Profile Service Application

Installation of Services for SharePointPermanent link for this heading

The services for SharePoint must be installed as follows:

  1. Login to the SharePoint server whose sites are to be crawled by the connector.
  2. Go to the ISAPI directory of SharePoint. If you are using the standard default installation, path of this directory would be C:\Program Files\Common Files\Microsoft Shared\web server extensions\14\ISAPI (SharePoint 2010) and C:\Program Files\Common Files\Microsoft Shared\web server extensions\15\ISAPI (SharePoint 2013).
  3. Copy these files found in Prerequisites folder into the ISAPI folder as specified in step 2.
  4. GSBulkAuthorization.asmx
  5. GSBulkAuthorizationdisco.aspx
  6. GSBulkAuthorizationwsdl.aspx
  7. GSSiteDiscovery.asmx
  8. GSSiteDiscoverydisco.aspx
  9. GSSiteDiscoverywsdl.aspx
  10. GssAcl.asmx
  11. GssAcldisco.aspx
  12. GssAclwsdl.aspx
  13. MesAcl.asmx
  14. MesAcldisco.aspx
  15. MesAclwsdl.aspx

  1. The connectivity of web services can be verified using following URLs:
    http://mycomp.com/_vti_bin/GSBulkAuthorization.asmx
    http://mycomp.com/_vti_bin/GSSiteDiscovery.asmx

    http://mycomp.com/_vti_bin/GssAcl.asmx

    Where http://mycomp.com is the SharePoint site URL. After opening the above URL(s), you should be able to see all the web methods exposed by the web service. Click on the "Service Description" link available on the top to view the wsdl file description.

Installation of Sharepoint SSL Certificate for JavaPermanent link for this heading

Save the SharePoint SSL certificate in for example c:\temp\sharepointserver.cert file.

Installation:

<jre_home>/binkeytool -import -noprompt -trustcacerts -alias sharepointserver –file /tmp/sharepointserver.cer -keystore ../lib/security/cacerts –storepass changeit

Configuration of MindbreezePermanent link for this heading

Select the “Advanced” installation method:

Click on the “Indices” tab and then on the “Add new index” symbol to create a new index.

Enter the index path, e.g. “/data/indices/sharepoint. Change the Display Name of the Index Service and the related Filter Service if necessary.

Add a new data source with the symbol “Add new custom source” at the bottom right.

Configuration of Data SourcePermanent link for this heading

Microsoft Sharepoint ConnectionPermanent link for this heading

This information is only needed when basic authentication is used:

  • SharePoint Server URL“: To crawl all sharepoint sites this URL can be without port and site path, which will cause that alle sharepoint sites will be crawled. For example “http://mycompany.com” would cause that all sharepoint sites with “http://mycompany.com:<any port>/<any site>”  URL will be crawled. The needed credentials must be configured in Network tab under Endpoints. The “LocationField of the Endpoint and “SharePoint Server URLmust be identical.
  • Logon Account For Principal Resolution, Domain and Password: These fields should not be configured if a “Principal Resolution Cache Service” is selected or in case of Kerberos authentication.

If the Sharepoint Principal Cache is used, it is possible to configure credential information in the Network tab (section Endpoints).

Caching Principal Resolution ServicePermanent link for this heading

You can select one of the following three caching principal resolution services to be used.

CachingLdapPrincipalResoution: If selected, it is used to resolve a user’s AD group membership when searching. However, the SharePoint groups in the ACLs must be resolved while crawling. To do this, select "Resolve SharePoint Groups”. Do not select “Use ACLs References”. “Normalize ACLs” can be selected.

SharePointPrincipalResolutionCache: If selected, it is used to resolve a user’s SharePoint group membership when searching. This service also resolves the user’s AD group membership. Therefore it is no longer necessary to select "Resolve SharePoint Groups". Do not select "Use ACLs References” in this case. “Normalize ACLs” can be selected.

SharePointACLReferenceCache: When selected, the URLs from the SharePoint site, SharePoint list, and folder of the document are saved as ACLs during crawling to speed up the crawl. “Use ACLs References" must be selected in this case. “Resolve SharePoint Groups” and “Normalize ACLs” may not be selected.

For details on configuring the caching principal resolution service, see Caching Principal Resolution Service.

Crawl URLsPermanent link for this heading

The SharePoint crawler initially detects all SharePoint sites of a SharePoint server “SharePoint Server URL.” Alternatively, you can enter the path of a CSV file in the field “Include Sites File”, in which only certain sites (URLs) that should be indexed are entered. It is also possible to limit the data to be crawled to specific pages (URLs). To do this, you can restrict these pages (URLs) with a regular expression in the field "Included URL". It is also possible to exclude pages (URLs) or not to crawl certain pages. These pages must be restricted to the field "Excluded URL" with a regular expression. A regular expression must have a "regexp:" or "regexpIgnoreCase:" prefix.

For crawling user profiles,"Crawl User Profile" must be selected and the "MySite URL" and "Collection Name for User Profiles" must be configured accordingly.

Site RestrictionsPermanent link for this heading

Please note that the following restrictions apply after using “Include URL” and “Exclude URL”. This means that a site URL that is excluded by applying the "Exclude URL" rule will not be crawled even if it is in the "Include Sites File".

  • All Sites File: The path to a CSV file containing the site URLs that are to be crawled. If this field is empty, all sites are detected by the SiteDiscovery service.
  • Include Sites File: The path to a CSV file containing the site URLs that are to be crawled without the congruence class calculation, which can lead to an exclusion of the site. If this field is empty, only those sites are crawled that correspond to the congruence class of this crawler and do not exist in the "Exclude Sites File" file.
  • Exclude Sites File: The path to a CSV file containing the site URLs that are not crawled. If this field is empty, the sites that correspond to the congruence class of this crawler or exist in the "Include Sites File" are crawled.
  • Congruence Modulus: The maximum number of crawlers that distribute all sites among themselves.
  • Congruence Class: Only sites with this congruence class (CRC of the site URL modulo maximum number of crawlers) are crawled.

Security settingsPermanent link for this heading

The option "Use ACLs References" should only be selected if "SharePointACLReferenceCache" is selected as the "Principal Resolution Service Cache" (see: Caching Principal Resolution Service).

Moving documents from one directory to another also changes the URLs of these documents. To update these changes in the index, select the “Track Document URL Changes” option.

The option "Resolve SharePoint Groups" should not be selected if "SharePointPrincipalCache" is selected as the "Principal Resolution Service Cache" (see: Caching Principal Resolution Service). By configuring “Normalize ACLs,” all AD users and groups are converted to ACLs in “Distinguished Name” format. To crawl SharePoint pages with anonymous access rights, select "Include Documents without ACLs". If you want to exclude SharePoint pages from crawling by activating certain features, it is necessary to enter the ID (GUID) of these features in the field "Exclude Documents From Sites With These Features".

Alias URLs MappingPermanent link for this heading

In oder to provide documents with open URLs according to „Alias URLs“ configuration the „Rewrite Open URL“ must be selected. If the service user has not access to internal download URLs of documents. These URLs can be rewritten by URLs configured in „Alias URLs“ configuration.

Content Type SettingsPermanent link for this heading

A regular expression pattern to match additional content types is needed in “Additional Content Types (regex)” field in order to crawl content type which are not crawled per default

For crawling documents with unpublished state select “Include Unpublished Documents”.  

The SharePoint Connector contains a preconfigured content mapping file (XML) which provides necessary rules to be applied on documents according to their content type. Sometimes it is necessary to change these rules and save this mapping file in separate location. In order to use this modified mapping file it is necessary to write this file’s location in “Content Type Mapping Description File”. One of the important rules in this mapping file is to include or exclude documents with some specific content types. By selecting “Delete Ignored Documents from Index” the documents already crawled with a different mapping rules will be deleted from index if they are not included anymore.

Synchronization SettingsPermanent link for this heading

  • Synchronize with Index on Startup: The crawler stores its status from the last run locally. This avoids matching individual documents in the index with those on the Sharepoint Server. Sometimes this status can deviate from the index due to transport or filter problems. To correct this deviation, select the “Synchronize with Index on Startup” option.
  • Synchronization Timeout (Hours): Specifies a number of hours after which synchronization is aborted and the stored state is used.
  • Reset Connector State if it is not consistent with index: If the crawler status is not consistent with the index status, it is deleted and a full indexing run is started. If this option is disabled, the status will not be deleted.
  • Include Documents Only From Keys File: The path to a CSV file with the keys that are to be indexed again. This means only these documents from SharePoint are crawled and indexed. We recommend backing up the “Connector State” directory beforehand.
  • Connector State Directory Path: The path to a directory in which the crawler persists the status of the documents already indexed, which is used after a crawl run or restart of the crawler. If this field is empty, a directory is created in /tmp.
  • Delete HTTP Response Codes: At the end of a crawl run, all sites and lists that supply these HTTP response codes after HTTP access (connectivity check) are also deleted from the index.

Crawler Performance SettingsPermanent link for this heading

  • Batch size: Defines the number of documents that the SharePoint server retrieves before sending them to the index.
  • Number of threads:  Threads that send the collected documents to the index simultaneously.
  • Document size limit (MB): This value must correspond to “Maximum input size (MB)” from the filter service.
  • Disable webpage thumbnails: If selected, no thumbnails are generated for these pages.
  • Retry duration if connection problems (seconds): The maximum number of seconds that the system will attempt to resend a document to the filter/index service in the event of connection problems or during syncdelta.

Content Metadata Extract SettingsPermanent link for this heading

Metadata that should be extracted from content.

  • Name”: Defined name for the metadata.
  • XPath”: XPath locating the the metadata value in content.
  • Format”: String, URL, Path, Number, Signature and Date
  • Format Options”:

Editing Microsoft Office Documents in SharePointPermanent link for this heading

It is possible to directly edit a Microsoft Office Document from the search-results and save it back to SharePoint. Write Permissions are required to use this feature.

Configuring the integrated Authentication of the Microsoft SharePoint CrawlerPermanent link for this heading

Windows:

If the installation is made on a Microsoft Windows Server, the Kerberos authentication of the current Mindbreeze Service user can also be used for the Microsoft SharePoint Crawler. In this case the Service user must be authorized to access the Microsoft SharePoint Web Services.

Linux:

Find the documentation for Linux here: Configuration - Kerberos Authentication

TroubleshootingPermanent link for this heading

Generally, if you are having troubles indexing a SharePoint data source you should primarily take a look at the Mindbreeze log-folders.

Inside the Mindbreeze base log-folder there is also a sub-folder for the SharePoint-crawler located which is called similar to the following example:

C:\logs\current\log-mescrawler_launchedservice-Microsoft_SharePoint_Sharepoint+2007

This folder will contain several date-based sub-folders each containing 2 main log files:

  • log-mescrawler_launchedservice.log: basic log file containing relevant information about what’s going on as well as possible error messages occurred during crawling the data source.
  • mes-pusher.csv: CSV-file containing the SharePoint-URLs that have been crawled including status information about success or errors.

If the file mes-pusher.csv does not appear, there may be basic configuration or permission troubles preventing the crawler from retrieving documents from SharePoint which should be recorded in the base log file mentioned above.

Crawling User UnauthorizedPermanent link for this heading

Problem Cause:

The crawler does not retrieve any documents from SharePoint and therefore does not create the log file mes-pusher.csv.

The log file log-mescrawler_launchedservice.log may contain error message similar to the following ones:

com.mindbreeze.enterprisesearch.gsabase.crawler.InitializationException: Invalid connector config: message Cannot connect to the given SharePoint Site URL with the supplied Domain/Username/Password.Reason:(401)Unauthorized

or:

com.mindbreeze.enterprisesearch.gsabase.crawler.InitializationException: Unable to set connector config, response message: Cannot connect to the  Services for SharePoint on the given Crawl URL with the supplied Domain/Username/Password.Reason:(401)Unauthorized, status message:null, status code:5223 (INVALID_CONNECTOR_CONFIG)

or:

enterprise.connector.sharepoint.wsclient.soap.GSBulkAuthorizationWS INTERNALWARNING: Can not connect to GSBulkAuthorization web service. cause:(401)Unauthorized

Problem description and solution:

The used service user is not allowed to obtain the file listings from SharePoint. Either because the login fails or the permissions inside SharePoint are not enough.

The following issues have to be checked:

  • Check the used authentication method configured inside SharePoint/IIS:
    • If you are using Integrated/Kerberos authentication, the Mindbreeze Node service must be configured to run as the service user.
    • For NTLM/Basic authentication the service user must be configured in the Mindbreeze Configuration UI der SharePoint Datenquelle eingetragen sein.
  • Check the permissions oft he service user inside SharePoint
  • Test the following web services GssSiteDiscovery.asmx and GSBulkAuthorization.asmx (for details see below)
  • You should also verify a simple open of SharePoint document-pages or content documents from a web browser on the Mindbreeze server using the service account.

SharePoint URL – FQDN

Problem Cause:

The crawler does not retrieve any documents from SharePoint and therefore does not create the log file mes-pusher.csv.

The log file log-mescrawler_launchedservice.log may contain an error message similar to the following:

com.mindbreeze.enterprisesearch.gsabase.crawler.InitializationException: Unable to set connector config, response message: The SharePoint Site URL must contain a fully qualified domain name., status message:null, status code:5223 (INVALID_CONNECTOR_CONFIG)

Problem description and solution:

In order to use the Mindbreeze SharePoint Connector it is important that the target SharePoint server is accessed using the FQDN-hostname.

  • Either in the SharePoint configuration the external URL must be configured correctly to the FQDN-hostname (see SharePoint „Operations“ > group „Global Configuration“ > „Alternate access mappings“)

  • Also in the Mindbreeze configuration the SharePoint crawling root must be defined using the FQDN hostname in the URL.

Testing SharePoint Web Services with SOAP-Calls and curl

In order to analyze and solve permission problems or other problematic issues regarding the SharePoint web services you could use the command line tool curl to perform simple SOAP-calls.

The command line tool curl is already present on Mindbreeze InSpire (for Microsoft Windows) and is located in the following folder: C:\setup\tools\curl\bin. For a more convenient utilization you should add the folder path value to the Microsoft Windows environment variable PATH.

Preparing the SOAP-Calls

The procedure of preparing the SOAP-Calls is quite similar for every test case and will be explained based on following example: CheckConnectivity from GSSiteDiscovery.asmx

The first step is to open the desired SharePoint web service in a web browser window and follow the link to the desired action method to get the interface description and the template for the content to be sent later on.

For simplicity we take the interface description based on SOAP 1.2 and copy the XML-content of the first block (request part) into a file in a local temporary folder (e.g. C:\Temp\sp-site-check.xml).

Based on the interface definition some property values must be replaced by custom values from the own SharePoint infrastructure.

Testing SOAP-Calls

Based on the previous example will are now going to test the SOAP calls using curl in a command line window.

Switch to the file system folder containing the prepared XML content file and run the curl-command similar to the following example: (<Values in angle brackets> have to be replaced with own values)

C:\Temp>curl --ntlm --user <testlab\domainsrv>:<MYPASSWORD> --header "Content-Type: application/soap+xml;charset=utf-8" --data @<sp-site-check.xml> http://<spserver2007.testlab...>/_vti_bin/GSSiteDiscovery.asmx

The output will be displayed directly or could also be redirected for easier reading into an output file: > out.xml

The following SharePoint web services and methods are quite useful for detecting problems:

http://<spserver2007.testlab>/_vti_bin/GSSiteDiscovery.asmx

CheckConnectivity: should return success

GetAllSiteCollectionFromAllWebApps: requires a SharePoint admin account!

http://<spserver2007.testlab>/_vti_bin/GSBulkAuthorization.asmx

CheckConnectivity: should return success

http://<spserver2007.testlab>/Docs/_vti_bin/GssAcl.asmx (this test should be invoked on the subdirectory URL containing the SharePoint-documents - e.g.: /Docs)

CheckConnectivity: should return success

GetAclForUrls: this is the first test requiring to change the content XML file (see below) … you could specify the URL to the basic documents overview page e.g. AllItems.aspx, or the SharePoint URL of a choosen document. This test should return all permitted user accounts for the choosen documents …

GetAclForUrls Content-XML:

<?xml version="1.0" encoding="utf-8"?>

<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">

  <soap12:Body>

    <GetAclForUrls xmlns="gssAcl.generated.sharepoint.connector.enterprise.google.com">

      <urls>

        <string>http://spserver2007.testlab.mindbreeze.fabagl.fabasoft.com/Docs/Documents/Forms/AllItems.aspx</string>

<string>http://spserver2007.testlab.mindbreeze.fabagl.fabasoft.com/Docs/Documents/testdoc2_server2007.rtf</string>

      </urls>

    </GetAclForUrls>

  </soap12:Body>

</soap12:Envelope>

SOAP-Call with curl:

C:\Temp>curl --ntlm --user <testlab\domainsrv>:<MYPASSWORD> --header "Content-Type: application/soap+xml;charset=utf-8" --data @data.xml http://spserver2007.testlab.mindbreeze.fabagl.fabasoft.com/Docs/_vti_bin/GssAcl.asmx > out.xml

The result shows all SharePoint-permissions for the specified URLs:

Documents IGNORED by Crawler

The documents are retrieved correctly from SharePoint by the crawler (as listed in the main log file) but are still not inserted into the index you should check the following log file mes-pusher.csv.

If the column ActionType contains the value „IGNORED“ there is another column called Message showing the cause why the document was ignored.

Possible causes and solutions:

  • IGNORED, property ContentType with value null not matched pattern …
    • Some basic document content types are already predefined in the standard SharePoint connector. But your SharePoint installation may user other content types for documents you also want to be indexed. You could extend the list of indexed document types by simply defining your own list of conent types in the following property of the Mindbreeze Configuration: “Additional Content Types“

  • Unable to generate SecurityToken from acl null
    • If the crawler is not able to obtain the current ACLs for a given document from SharePoint, this document will be ignored and not sent to the index for further processing. In this case you have to check if the permissions of the service user are enough and you could also test the SharePoint web service gssAcl.asmx on behalf of the used service user (as already described above).

Configuration of Metadata Conversion Rules in the File: ConnectorMetadataMapping.xmlPermanent link for this heading

The following examples show, how Rules in the file ConnectorMetadataMapping.xml can be used to generate metadata from existing metadata.

Content XPath KonfigurationPermanent link for this heading

       <ConversionRule class="HTMLContentRule">

            <Arg>//*[@id='ArticleContent'] </Arg> <!-- include XPath -->

            <Arg>//*[starts-with(@id, 'ECBItems_']</Arg> <!-- exclude XPath -->

        </ConversionRule>

ReferencesPermanent link for this heading

       <Metadatum join="true">

             <SrcName>srcName</SrcName> <!—srcName should be item ID -->

             <MappedName>mappedRef</MappedName>

             <ConversionRule class="SharePointKeyReferenceRule">

                 <Arg>http://site/list/AllItems.aspx|%s</Arg>

             </ConversionRule>

           </Metadatum>

String FormattingPermanent link for this heading

Joining Metadata:

           <Metadatum join="true">

<SrcName>srcName1,srcName2</SrcName>  <!—join values with ‘|’ -->

<MappedName>mappedName</MappedName>

<ConversionRule class="FormatStringRule">

    <Arg>%s|%s</Arg>

</ConversionRule>

           </Metadatum>

Splitting Metadata:

          <Metadatum split="true">

             <SrcName>srcName</SrcName>

             <MappedName>mapped1,mapped2</MappedName> <!-- split srcName value  -->

             <ConversionRule class="SplitStringRule">

                  <Arg>:</Arg>

             </ConversionRule>

         </Metadatum>

Generation of Metadata using regular Expressions:

      <Metadatum>

          <SrcName>srcName</SrcName>

          <MappedName>mappedName</MappedName>

          <ConversionRule class="StringReplaceRule">

            <Arg>.*src=&quot;([^&quot;]*)&quot;.*</Arg> <!—regex pattern-->

            <Arg>http://mycompany.com$1</Arg> <!-- replacement -->

          </ConversionRule>

        </Metadatum>

Uninstalling the Microsoft SharePoint ConnectorPermanent link for this heading

To uninstall the Microsoft SharePoint Connector, first delete all Microsoft SharePoint Crawlers and then carry out the following command:

mesextension --interface=plugin --type=archive --file=MicrosoftSharePointConnector<version>.zip uninstall