Microsoft SharePoint Connector

Installation and Configuration

Copyright ©

Mindbreeze GmbH, A-4020 Linz, 2017.

All rights reserved. All hardware and software names used are registered trade names and/or registered trademarks of the respective manufacturers.

These documents are highly confidential. No rights to our software or our professional services, or results of our professional services, or other protected rights can be based on the handing over and presentation of these documents.

Distribution, publication or duplication is not permitted.

The term ‘user‘ is used in a gender-neutral sense throughout the document.

InstallationPermanent link for this heading

Before installing the Microsoft SharePoint Connector ensure that the Mindbreeze Server is already installed and this connector is also included in the Mindbreeze license.

Extending Fabasoft Mindbreeze Enterprise for Use with the Microsoft SharePoint ConnectorPermanent link for this heading

The Microsoft SharePoint Connector is available as a ZIP file. This file must be registered with the Fabasoft Mindbreeze Enterprise Server via mesextension.exe as follows:

mesextension --interface=plugin --type=archive --file=MicrosoftSharePointConnector<version>.zip install

PLEASE NOTE: The Connector can be updated by calling the same mesextention. Fabasoft Mindbreeze Enterprise will automatically carry out the required update.

Needed Rights for Crawling UserPermanent link for this heading

The Microsoft SharePoint Connector allows you to index and search in Microsoft SharePoint items and objects.

The following requirements must be met before configuring a Microsoft SharePoint data source:

  • The Microsoft SharePoint version must be SharePoint 2013, SharePoint 2010 or SharePoint 2007.
  • For Kerberos Authentication the service user on the Fabasoft Mindbreeze Enterprise node with the SharePoint data source must have at least Full Read permissions on SharePoint Web Applications. Kerberos must be selected as authentication policy for these Web Applications.
  • For Basic Authentication username and password of the account which has Full Read permission on SharePoint Web Applications should be provided in Mindbreeze Manager Configuration. Basic Authentication must be selected as authentication policy for these Web Application

Adding a user to the SharePoint site administrators can be done as follows:

  • Navigate to Central Administration -> Application Management and then click on Manage web applications
  • Select Web Application and then click on User Policy (see screenshot below)
  • Give the service user “Full Read” permission.

Selecting authentication policy for Web Applications can be done as follows:

  • Navigate to Central Administration -> Application Management and then click on Manage web applications
  • Select Web Application and then click on Authentication Providers (see screenshot below)
  • Choose desired authentication policy

  • If NTLM or Basic authentication is selected, the username and password should be provided in Mindbreeze configuration. (See 2.1.1)

  • In order to crawl user profiles in SharePoint 2013 the service user must be in the list of search crawlers of User Profile Service Application.

Navigate to Central Administration Manager service application User Profile Service Application

Installation of Services for SharePointPermanent link for this heading

The services for SharePoint must be installed as follows:

  1. Login to the SharePoint server whose sites are to be crawled by the connector.
  2. Go to the ISAPI directory of SharePoint. If you are using the standard default installation, path of this directory would be C:\Program Files\Common Files\Microsoft Shared\web server extensions\14\ISAPI (SharePoint 2010) and C:\Program Files\Common Files\Microsoft Shared\web server extensions\15\ISAPI (SharePoint 2013).
  3. Copy these files found in Prerequisites folder into the ISAPI folder as specified in step 2.
  4. GSBulkAuthorization.asmx
  5. GSBulkAuthorizationdisco.aspx
  6. GSBulkAuthorizationwsdl.aspx
  7. GSSiteDiscovery.asmx
  8. GSSiteDiscoverydisco.aspx
  9. GSSiteDiscoverywsdl.aspx
  10. GssAcl.asmx
  11. GssAcldisco.aspx
  12. GssAclwsdl.aspx
  13. MesAcl.asmx
  14. MesAcldisco.aspx
  15. MesAclwsdl.aspx

  1. The connectivity of web services can be verified using following URLs:
    http://mycomp.com/_vti_bin/GSBulkAuthorization.asmx
    http://mycomp.com/_vti_bin/GSSiteDiscovery.asmx

    http://mycomp.com/_vti_bin/GssAcl.asmx

    Where http://mycomp.com is the SharePoint site URL. After opening the above URL(s), you should be able to see all the web methods exposed by the web service. Click on the "Service Description" link available on the top to view the wsdl file description.

Installation of Sharepoint SSL Certificate for JavaPermanent link for this heading

Save the SharePoint SSL certificate in for example c:\temp\sharepointserver.cert file.

Installation:

<jre_home>/binkeytool -import -noprompt -trustcacerts -alias sharepointserver –file /tmp/sharepointserver.cer -keystore ../lib/security/cacerts –storepass changeit

Configuration of MindbreezePermanent link for this heading

Select the “Advanced” installation method:

Click on the “Indices” tab and then on the “Add new index” symbol to create a new index.

Enter the index path, e.g. “/data/indices/sharepoint. Change the Display Name of the Index Service and the related Filter Service if necessary.

Add a new data source with the symbol “Add new custom source” at the bottom right.

Configuration of Data SourcePermanent link for this heading

Microsoft Sharepoint ConnectionPermanent link for this heading

This information is only needed when basic authentication is used:

  • SharePoint Server URL“: To crawl all sharepoint sites this URL can be without port and site path, which will cause that alle sharepoint sites will be crawled. For example “http://mycompany.com” would cause that all sharepoint sites with “http://mycompany.com:<any port>/<any site>”  URL will be crawled. The needed credentials must be configured in Network tab under Endpoints. The “LocationField of the Endpoint and “SharePoint Server URLmust be identical.
  • Logon Account For Principal Resolution, Domain and Password: These fields should not be configured if a “Principal Resolution Cache Service” is selected or in case of Kerberos authentication.

If the Sharepoint Principal Cache is used, it is possible to configure credential information in the Network tab (section Endpoints).

Crawl URLsPermanent link for this heading

It is possible to limit the data that should be crawled, for instance some particular sites. Therefore the sites that should be crawled must correspond to regular expressions which should be entered in the field “Included URL”. The reqular expression must have the prefix “regexp:”. It is possible to exclude site that should be crawled. These sites must correspond to regular expressions which should be entered in the field “Excluded URL”. Case insensitive regular expressions must have the prefix “regexpIgnoreCase:”.

For crawling user profiles the option “Crawl User Profile” must be “Yes” and the “MySite URL” and the “Collection Name for User Profiles” options should be configured appropriately.

For crawling sites with enabled anonymous access select “Include Documents without ACLs”.

Security SettingsPermanent link for this heading

Track Document URL Changes” allows tracking of URL modifications of document. For example if a document is moved from one folder to another folder.

Resolve Sharepoint Groups” should not be selected if a “Sharepoint Principal Resolution Service” is used. By choosing “Normalize ACLs” all active directory users and groups in ACLs will have distinguished name format. “Allow Documents Without ACLs” enables crawling of sharepoint sites with anonymous access rights. To exclude documents of site with certain active features from crawling it is necessary to add the feature ID (GUID) in “Exclude Documents From Sites With these Features” field.

Alias URLs MappingPermanent link for this heading

In oder to provide documents with open URLs according to „Alias URLs“ configuration the „Rewrite Open URL“ must be selected. If the service user has not access to internal download URLs of documents. These URLs can be rewritten by URLs configured in „Alias URLs“ configuration.

Synchronization SettingsPermanent link for this heading

During delta indexing the crawler uses a local state which enables it to detect only changes on the sharepoint server. Sometimes the documents are not correctly sent to index because of transport or filter errors. „Synchronize with Index on Startup“ allows crawler to synchronize with index on startup and resume from its local state after synchronization is done.

“Synchronization Timeout (Hours)”: The synchronization is aborted after the configured number of hours is reached and the persisted state is loaded.

“Reset Connector State if it is not consistent with Index”: if the local crawler state is not consistent with the index it is deleted and a full crawl run is started. If this option is disabled, no deletion occurs.

Content Type SettingsPermanent link for this heading

A regular expression pattern to match additional content types is needed in “Additional Content Types (regex)” field in order to crawl content type which are not crawled per default

For crawling documents with unpublished state select “Include Unpublished Documents”.  

The SharePoint Connector contains a preconfigured content mapping file (XML) which provides necessary rules to be applied on documents according to their content type. Sometimes it is necessary to change these rules and save this mapping file in separate location. In order to use this modified mapping file it is necessary to write this file’s location in “Content Type Mapping Description File”. One of the important rules in this mapping file is to include or exclude documents with some specific content types. By selecting “Delete Ignored Documents from Index” the documents already crawled with a different mapping rules will be deleted from index if they are not included anymore.

Crawler Performance SettingsPermanent link for this heading

Batch Size“ allows crawler to fetch so many documents before sending them to index. These documents are sent to index concurrently by „Number of Threads“ threads. It important that the “Docment Size Limit (MB)” corresponds the “Maximum Input Size (MP)” of Filter Service.

Content Metadata Extract SettingsPermanent link for this heading

Metadata that should be extracted from content.

  • Name”: Defined name for the metadata.
  • XPath”: XPath locating the the metadata value in content.
  • Format”: String, URL, Path, Number, Signature and Date
  • Format Options”:

Editing Microsoft Office Documents in SharePointPermanent link for this heading

It is possible to directly edit a Microsoft Office Document from the search-results and save it back to SharePoint. Write Permissions are required to use this feature.

Configuring the integrated Authentication of the Microsoft SharePoint CrawlerPermanent link for this heading

Windows:

If the installation is made on a Microsoft Windows Server, the Kerberos authentication of the current Mindbreeze Service user can also be used for the Microsoft SharePoint Crawler. In this case the Service user must be authorized to access the Microsoft SharePoint Web Services.

Linux:

For installations under Linux, the following steps must be taken:

  • Create a keytab for the privileged user with ktutil:
    • Start ktutil on the command line and carry out these commands in the ktutil shell:
      • addent -password -p <principal>@<REALM> -k 0 -e DES-CBC-MD5
      • (for example: addent -password -p crawler_user@MYDOMAIN.COM -k 0 -e DES-CBC-MD5)
      • Enter the user password.
      • wkt <keyab_path>
  • Upload the keytab:

  • Configure the keytab and the contained principal (in the authentication tab):

IMPORTANT: The keytab must contain the key of the abovementioned user. The keytab for the Client Service cannot be used here.

Troubleshooting

Generally, if you are having troubles indexing a SharePoint data source you should primarily take a look at the Mindbreeze log-folders.

Inside the Mindbreeze base log-folder there is also a sub-folder for the SharePoint-crawler located which is called similar to the following example:

C:\logs\current\log-mescrawler_launchedservice-Microsoft_SharePoint_Sharepoint+2007

This folder will contain several date-based sub-folders each containing 2 main log files:

  • log-mescrawler_launchedservice.log: basic log file containing relevant information about what’s going on as well as possible error messages occurred during crawling the data source.
  • mes-pusher.csv: CSV-file containing the SharePoint-URLs that have been crawled including status information about success or errors.

If the file mes-pusher.csv does not appear, there may be basic configuration or permission troubles preventing the crawler from retrieving documents from SharePoint which should be recorded in the base log file mentioned above.

Crawling User Unauthorized

Problem Cause:

The crawler does not retrieve any documents from SharePoint and therefore does not create the log file mes-pusher.csv.

The log file log-mescrawler_launchedservice.log may contain error message similar to the following ones:

com.mindbreeze.enterprisesearch.gsabase.crawler.InitializationException: Invalid connector config: message Cannot connect to the given SharePoint Site URL with the supplied Domain/Username/Password.Reason:(401)Unauthorized

or:

com.mindbreeze.enterprisesearch.gsabase.crawler.InitializationException: Unable to set connector config, response message: Cannot connect to the  Services for SharePoint on the given Crawl URL with the supplied Domain/Username/Password.Reason:(401)Unauthorized, status message:null, status code:5223 (INVALID_CONNECTOR_CONFIG)

or:

enterprise.connector.sharepoint.wsclient.soap.GSBulkAuthorizationWS INTERNALWARNING: Can not connect to GSBulkAuthorization web service. cause:(401)Unauthorized

Problem description and solution:

The used service user is not allowed to obtain the file listings from SharePoint. Either because the login fails or the permissions inside SharePoint are not enough.

The following issues have to be checked:

  • Check the used authentication method configured inside SharePoint/IIS:
    • If you are using Integrated/Kerberos authentication, the Mindbreeze Node service must be configured to run as the service user.
    • For NTLM/Basic authentication the service user must be configured in the Mindbreeze Configuration UI der SharePoint Datenquelle eingetragen sein.
  • Check the permissions oft he service user inside SharePoint
  • Test the following web services GssSiteDiscovery.asmx and GSBulkAuthorization.asmx (for details see below)
  • You should also verify a simple open of SharePoint document-pages or content documents from a web browser on the Mindbreeze server using the service account.

SharePoint URL – FQDN

Problem Cause:

The crawler does not retrieve any documents from SharePoint and therefore does not create the log file mes-pusher.csv.

The log file log-mescrawler_launchedservice.log may contain an error message similar to the following:

com.mindbreeze.enterprisesearch.gsabase.crawler.InitializationException: Unable to set connector config, response message: The SharePoint Site URL must contain a fully qualified domain name., status message:null, status code:5223 (INVALID_CONNECTOR_CONFIG)

Problem description and solution:

In order to use the Mindbreeze SharePoint Connector it is important that the target SharePoint server is accessed using the FQDN-hostname.

  • Either in the SharePoint configuration the external URL must be configured correctly to the FQDN-hostname (see SharePoint „Operations“ > group „Global Configuration“ > „Alternate access mappings“)

  • Also in the Mindbreeze configuration the SharePoint crawling root must be defined using the FQDN hostname in the URL.

Testing SharePoint Web Services with SOAP-Calls and curl

In order to analyze and solve permission problems or other problematic issues regarding the SharePoint web services you could use the command line tool curl to perform simple SOAP-calls.

The command line tool curl is already present on Mindbreeze InSpire (for Microsoft Windows) and is located in the following folder: C:\setup\tools\curl\bin. For a more convenient utilization you should add the folder path value to the Microsoft Windows environment variable PATH.

Preparing the SOAP-Calls

The procedure of preparing the SOAP-Calls is quite similar for every test case and will be explained based on following example: CheckConnectivity from GSSiteDiscovery.asmx

The first step is to open the desired SharePoint web service in a web browser window and follow the link to the desired action method to get the interface description and the template for the content to be sent later on.

For simplicity we take the interface description based on SOAP 1.2 and copy the XML-content of the first block (request part) into a file in a local temporary folder (e.g. C:\Temp\sp-site-check.xml).

Based on the interface definition some property values must be replaced by custom values from the own SharePoint infrastructure.

Testing SOAP-Calls

Based on the previous example will are now going to test the SOAP calls using curl in a command line window.

Switch to the file system folder containing the prepared XML content file and run the curl-command similar to the following example: (<Values in angle brackets> have to be replaced with own values)

C:\Temp>curl --ntlm --user <testlab\domainsrv>:<MYPASSWORD> --header "Content-Type: application/soap+xml;charset=utf-8" --data @<sp-site-check.xml> http://<spserver2007.testlab...>/_vti_bin/GSSiteDiscovery.asmx

The output will be displayed directly or could also be redirected for easier reading into an output file: > out.xml

The following SharePoint web services and methods are quite useful for detecting problems:

http://<spserver2007.testlab>/_vti_bin/GSSiteDiscovery.asmx

CheckConnectivity: should return success

GetAllSiteCollectionFromAllWebApps: requires a SharePoint admin account!

http://<spserver2007.testlab>/_vti_bin/GSBulkAuthorization.asmx

CheckConnectivity: should return success

http://<spserver2007.testlab>/Docs/_vti_bin/GssAcl.asmx (this test should be invoked on the subdirectory URL containing the SharePoint-documents - e.g.: /Docs)

CheckConnectivity: should return success

GetAclForUrls: this is the first test requiring to change the content XML file (see below) … you could specify the URL to the basic documents overview page e.g. AllItems.aspx, or the SharePoint URL of a choosen document. This test should return all permitted user accounts for the choosen documents …

GetAclForUrls Content-XML:

<?xml version="1.0" encoding="utf-8"?>

<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">

  <soap12:Body>

    <GetAclForUrls xmlns="gssAcl.generated.sharepoint.connector.enterprise.google.com">

      <urls>

        <string>http://spserver2007.testlab.mindbreeze.fabagl.fabasoft.com/Docs/Documents/Forms/AllItems.aspx</string>

<string>http://spserver2007.testlab.mindbreeze.fabagl.fabasoft.com/Docs/Documents/testdoc2_server2007.rtf</string>

      </urls>

    </GetAclForUrls>

  </soap12:Body>

</soap12:Envelope>

SOAP-Call with curl:

C:\Temp>curl --ntlm --user <testlab\domainsrv>:<MYPASSWORD> --header "Content-Type: application/soap+xml;charset=utf-8" --data @data.xml http://spserver2007.testlab.mindbreeze.fabagl.fabasoft.com/Docs/_vti_bin/GssAcl.asmx > out.xml

The result shows all SharePoint-permissions fort he specified URLs:

Documents IGNORED by Crawler

The documents are retrieved correctly from SharePoint by the crawler (as listed in the main log file) but are still not inserted into the index you should check the following log file mes-pusher.csv.

If the column ActionType contains the value „IGNORED“ there is another column called Message showing the cause why the document was ignored.

Possible causes and solutions:

  • IGNORED, property ContentType with value null not matched pattern …
    • Some basic document content types are already predefined in the standard SharePoint connector. But your SharePoint installation may user other content types for documents you also want to be indexed. You could extend the list of indexed document types by simply defining your own list of conent types in the following property of the Mindbreeze Configuration: “Additional Content Types“

  • Unable to generate SecurityToken from acl null
    • If the crawler is not able to obtain the current ACLs for a given document from SharePoint, this document will be ignored and not sent to the index for further processing. In this case you have to check if the permissions of the service user are enough and you could also test the SharePoint web service gssAcl.asmx on behalf of the used service user (as already described above).

Configuration of Metadata Conversion Rules in the File: ConnectorMetadataMapping.xmlPermanent link for this heading

The following examples show, how Rules in the file ConnectorMetadataMapping.xml can be used to generate metadata from existing metadata.

Content XPath KonfigurationPermanent link for this heading

       <ConversionRule class="HTMLContentRule">

            <Arg>//*[@id='ArticleContent'] </Arg> <!-- include XPath -->

            <Arg>//*[starts-with(@id, 'ECBItems_']</Arg> <!-- exclude XPath -->

        </ConversionRule>

ReferencesPermanent link for this heading

       <Metadatum join="true">

             <SrcName>srcName</SrcName> <!—srcName should be item ID -->

             <MappedName>mappedRef</MappedName>

             <ConversionRule class="SharePointKeyReferenceRule">

                 <Arg>http://site/list/AllItems.aspx|%s</Arg>

             </ConversionRule>

           </Metadatum>

String FormattingPermanent link for this heading

Joining Metadata:

           <Metadatum join="true">

<SrcName>srcName1,srcName2</SrcName>  <!—join values with ‘|’ -->

<MappedName>mappedName</MappedName>

<ConversionRule class="FormatStringRule">

    <Arg>%s|%s</Arg>

</ConversionRule>

           </Metadatum>

Splitting Metadata:

          <Metadatum split="true">

             <SrcName>srcName</SrcName>

             <MappedName>mapped1,mapped2</MappedName> <!-- split srcName value  -->

             <ConversionRule class="SplitStringRule">

                  <Arg>:</Arg>

             </ConversionRule>

         </Metadatum>

Generation of Metadata using regular Expressions:

      <Metadatum>

          <SrcName>srcName</SrcName>

          <MappedName>mappedName</MappedName>

          <ConversionRule class="StringReplaceRule">

            <Arg>.*src=&quot;([^&quot;]*)&quot;.*</Arg> <!—regex pattern-->

            <Arg>> <!-- replacement -->

          </ConversionRule>

        </Metadatum>

Uninstalling the Microsoft SharePoint ConnectorPermanent link for this heading

To uninstall the Microsoft SharePoint Connector, first delete all Microsoft SharePoint Crawlers and then carry out the following command:

mesextension --interface=plugin --type=archive --file=MicrosoftSharePointConnector<version>.zip uninstall