Copyright ©
Mindbreeze GmbH, A-4020 Linz, 2021.
All rights reserved. All hardware and software names used are brand names and/or trademarks of their respective manufacturers.
These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes, or any other protected rights. The dissemination, publication, or reproduction hereof is prohibited.
For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.
Before installing the SharePoint Online connector, make sure that the Mindbreeze server is installed and the SharePoint Online connector is included in the license. Use the Mindbreeze Management Center to install or update the connector.
To install the plug-in, open the Mindbreeze Management Center. Select “Configuration” from the menu pane on the left-hand side. Then navigate to the “Plugins” tab. Under “Plugin Management,” select the appropriate zip file and upload it by clicking “Upload.” This automatically installs or updates the connector, as the case may be. In the process, the Mindbreeze services are restarted.
Select the “Advanced” installation method for configuration.
To create a new index, navigate to the “Indices” tab and click the “Add new index” icon in the upper right corner.
Enter the path to the index and change the display name as necessary.
Add a new data source by clicking the “Add new custom source” icon at the top right. Select the category “Microsoft SharePoint Online” and configure the data source according to your needs.
In the "Sharepoint Online" area you can define your Microsoft SharePoint Online installation that is to be indexed. The following options are available:
"Server URL" | The URL of the Sharepoint Online instance, e.g.: https://mycompany.sharepoint.com |
“Admin Server URL“ | The admin URL of the SharePoint Online instance. Often this is just the server URL with the suffix -admin. e.g: https://mycompany-admin.sharepoint.com |
"Site Relative URL. | The relative paths to the sites to be crawled, starting with a slash, e.g.: /sites/mysite. Each line can contain a path. If this is left empty all detected sites are crawled. If Sites are specified here only this sites and their subsites are crawled |
„Site Discovery Schedule“ | A Cron Expression that specifies when to run Site Discovery. The results are then used in the next crawl runs. This means that the potentially time-consuming site discovery does not have to be repeated with each crawl run. The official documentation for Cron Expressions can be found here, a generator for creating simple Cron Expressions can be found here. |
“Background Site Discovery Parallel Request Count” | Limits the number of parallel HTTP requests sent by the Site Discovery. |
“Site Discovery Strategy” | The strategy by which the site discovery is to be performed.
|
“Do Not Crawl Root Site” | If this option is activated, the root page of the Sharepoint Online instance (e.g. mycompany.sharepoint.com) and its subpages will not be crawled. |
"Included Sites URL (regex)" | Regular expression that can be used to specify which subsites are to be crawled. If this option is left empty, all subsites will be crawled. The regex matches relative URLs. e.g /sites/mysite |
"Excluded Sites URL (regex)" | Regular expression that can be used to specify which subsites are to be excluded. The regex matches relative URLs. e.g /sites/mysite |
"Included Lists/Files/Folders URL (regex)" | Regular Expression, which can be used to specify which lists, files and folders should be included. The metadata "url" (absolute url) is compared. If this option is left empty, everything is included. Note: If you want to include/exclude complete subsites, please use the option "Included Sites URL (regex)" or "Excluded Sites URL (regex)" |
"Excluded Lists/Files/Folders URL (regex)" | Regular Expression, which can be used to specify which lists, files and folders should be excluded. The metadata "url" (absolute url) is compared. For example, if you find a document in the Mindbreeze search that you want to exclude, you can copy the URL from the "Open" action and use it in the "Excluded Lists/Files/Folders URL (regex)" option |
“Included Metadata Names (regex)” | Regular expression used to include generic metadata with the name of the metadata. If nothing is specified, all metadata is included. The regex is applied to the name of the metadata (without the sp_ prefix). |
“Excluded Metadata Names (regex)” | Regular expression that excludes generic metadata with the name of the metadata. If nothing is specified, all metadata is included. The regex is applied to the name of the metadata (without the sp_ prefix). |
“Included Content Types (regex)” | Regular expression that includes content types (e.g. file, folder) via the name of the content type. If nothing is specified, all content types are included. The content type of objects can be found in the contenttype metadata. |
“Excluded Content Types (regex)” | Regular expression that excludes content types (e.g. file, folder) via the name of the content type. If nothing is specified, all content types are included. The content type of objects can be found in the contenttype metadata. |
“[Deprecated] Use delta key format” | This option is deprecated and should always be left enabled so that the full functionality of the crawler can be used. If set, a different format is used for the keys. Certain functions of the delta crawl (e.g. renaming lists, deleting attachment files) do not work without this option. If this option is changed, the index should be cleaned and re-indexed. |
“Enable Delta Crawl” | If set, the Sharepoint Online API will only fetch changes to files instead of crawling over the whole Sharepoint Online instance. For full functionality "Use delta key format" should be set. |
"Crawl hidden lists" | If set, lists that are defined as hidden are also indexed |
"Crawl lists with property 'NoCrawl'" | If this option is set, those lists are also indexed that have the "NoCrawl" property in Microsoft SharePoint Online |
“Max Change Count Per Site” | Number of changes that are processed in the delta crawl per page before the next page is processed. The remaining changes are processed at the next crawl run. |
“Trust all SSL Certificates” | Allows the use of unsecured connections, for example for test systems. Must not be activated in production systems. |
“Parallel Request Count” | Limits the number of parallel HTTP requests sent by the crawler. |
„Page Size“ | Maximum number of objects received per request to the API. A high value results in higher speed but higher memory consumption during the crawl run, a small value results in less memory consumption but reduces the speed. If the value is set to 0, no paging is used, meaning that the crawler attempts to fetch all objects at once with the request. |
"Max Content Length (MB)" | Limits the maximum document size. If a document is larger than this limit, the content of the document is not downloaded (the metadata is retained). The default value is 50 megabytes |
„[Deprecated] Send User Agent“ | This option is deprecated and should always remain enabled. The User Agent header should always be sent with the request. Adjust the "User Agent" option instead. If set, the header configured with the "User Agent" option is sent with every HTTP request. |
„User Agent“ | The specified value is sent with the HTTP request in the User-Agent header, it the option “Send User Agent” is set. |
„Thumbnail Generation for Web Content“ (Advanced Setting) | If set, thumbnails are generated for web documents. It is not recommended to enable this feature as it only works for public pages with anonymous access which have already been discontinued. |
“Dump Change Responses” | If set, the Sharepoint API responses are written to a log file during delta crawling. |
“Log All HTTP Requests” | If set, all HTTP requests sent by the crawler during the crawl run are written to a .csv file (sp-request-log.csv). |
“Update Crawler State During Crawlrun” | If set, the crawler state is continuously updated during the crawl run (instead of only at the beginning of the crawl run), which can prevent multiple downloads of the same file in certain situations. This option is only active for Delta Crawlruns. |
Custom Delta CSV Path | With this option a path to a .csv file can be specified, with which own delta points can be set. Each line must contain two entries separated by semicolons: First the Site Relative URL and then a time in the format yyyy-MM-ddTHH:mm:ssZ. Example: /sites/MySite; 2019-10-02T10:00:00Z This state is not adopted if a state already exists. If the old state is to be overwritten, it must be deleted from the DeltaState file. |
Update all ACLs | If set, the crawler updates all ACLs of SharePoint Online objects. This option should only be used for delta crawl runs after a plug-in update where ACL handling has been improved. This option should be turned off again after a crawl run because it may take a long time. |
Check for Role Update within past Days | With this setting, the crawler will check for Role Updates of all sites within the past days as configured. Due to limitations of the SharePoint Online getChanges API, the maximum value is 59 days. |
Dry Run Role Update Check | If set, the check for Role Updates since the date configured above is only logged into a role-update.csv file instead of being fully processed. |
„Ignore Sharepoint ACLs“ (Advanced Setting) | If set, no access permissions to lists or documents are fetched from Sharepoint. This option can only be set if at least one Site ACL is configured. |
“Site ACL” | With this option you can set your own ACLs. The Site URL Pattern is a regular expression for which pages this principal should be configured, Access Check Action can be used to select whether it is a grant or deny, and Principal is used to specify the group/user to which the principal should apply (e.g. everyone or max.mustermann@mycompany.onmicrosoft.com). |
Only enter the URL for the Azure ACS endpoint in the “Azure ACS endpoint” field if your SharePoint environment is hosted in a special environment (such as Germany).
The following environments require special URLs:
Germany | |
China | |
US Government |
A complete list for Azure ACS endpoints can also be found at https://docs.microsoft.com/en-us/sharepoint/dev/solution-guidance/extending-sharepoint-online-for-germany-china-usgovernment-environments.
Configure the options as follows:
“Use App-Only authentication” | When this option is selected, app-only authentication is used instead of user based authentication. If this option is selected, “Client ID” and “Client secret” also need to be configured. In addition, you need to perform all the “App Registration in Sharepoint” steps below. |
“Client ID" | The client ID that is generated as described below. |
“Client secret“ | The client secret that is generated as described below. |
There are two ways to register a new app: either directly in SharePoint Online or in Azure. The Client Secret of an app created in SharePoint Online expires after one year. Then the Secret would have to be renewed via PowerShell. In Azure you can set whether the Secret should expire after one year, two years or not at all.
In SharePoint Online:
To generate a client ID and a client secret, enter the following URL in the browser:
<Server URL>/_layouts/15/appregnew.aspx
(e.g. https://mycompany.sharepoint.com/_layouts/15/appregnew.aspx)
Click the two buttons "Generate" (for "client Id" and for "client secret") and enter the other information as follows:
Then click “Create."
Then enter the client id and the client secret into the Mindbreeze InSpire configuration. Otherwise you will not be able to access the client secret later.
In Azure:
To create an app in Azure, first go to portal.azure.com.
There you have to register a new app under the tab "Azure Active Directory" and then under the tab "App registrations":
In the newly generated app you will find the application id and under the tab "Certificates & Secrets" you can generate a key.
You can use this App Id and Secret in step 2 to authorize the app for SharePoint Online.
App Registration in Sharepoint: Step 2
Now enter the following URL in the browser:
<Server URL>/_layouts/15/appinv.aspx
(e.g. https://mycompany.sharepoint.com/_layouts/15/appinv.aspx)
Enter the client id in the “App Id” field and click “Lookup.” “Title,” “App Domain,” and “Redirect URL” will be filled in automatically. Then enter the following in the “Permission Request XML” field:
<AppPermissionRequests AllowAppOnlyPolicy="true">
<AppPermissionRequest
Scope="http://sharepoint/content/sitecollection/web"
Right="FullControl"
/>
</AppPermissionRequests>
Note: "FullControl" is required so that Mindbreeze InSpire has access to the access rights of the SharePoint documents to be indexed in order to map the authorizations in Mindbreeze InSpire.
Then click “Create."
App Registration in Sharepoint: Step 3
Additional rights are required so that the ACL information on the users and groups required by the Principal Resolution Service can also be downloaded from SharePoint Online.
Now enter the following URL in the browser:
<Admin Site URL>/_layouts/15/appinv.aspx
(z.B. https://mycompany-admin.sharepoint.com/_layouts/15/appinv.aspx)
ATTENTION: Make sure that you are on the admin page. For example, if the URL is https://mycompany.sharepoint.com, then the admin page is usually https://mycompany-admin.sharepoint.com.
Enter the Client Id in the "App Id" field and activate the "Lookup" button. "Title", "App Domain" and "Redirect URL" will be filled in automatically. Then enter the following in the "Permission Request XML" field:
<AppPermissionRequests AllowAppOnlyPolicy="true">
<AppPermissionRequest
Scope="http://sharepoint/content/tenant"
Right="FullControl" />
</AppPermissionRequests>
Then activate the "Create" button.
“Do Not Request File Author Metadata” | If active no information regarding authors is requests for Lists, Items or Files. This may help resolve the following Error: „HTTP 500: User cannot be found". |
“Get Author From File Properties” | If active, the author metadata is retrieved from the File/Properties object instead of the File/Author object. This prevents errors during crawl runs, for example, if the user has since been deleted. A complete re-index is required so that this option can be used correctly. |
“List All Content Types“ | If active, an all-content-types.csv file is created in the log directory at the beginning of a crawl run, which contains all content types of all lists of all configured pages. |
“Include Unpublished Documents“ | If active, all documents/items are always indexed in their most recent version, regardless of whether they have already been published. If this option is disabled, only published sites and only major versions (1.0, 2.0 etc., usually created with each publish) will be indexed. |
“ASPX Content Metadata” | With this option you can set your own metadata as content for .aspx files. By default, "WikiField" and "CanvasContent1" are used. If metadata is set with this option, it will be used instead, if available. For this option to be applied correctly, a full re-index is required. |
Select “Advanced Settings” to configure the following settings.
Enable the option “Enforce ACL Evaluation.”
Add a new service under “Services” by clicking on “Add new service.” Select “SharepointOnlinePrincipalCache” and assign a display name.
"Server URL" | The URL of the Sharepoint Online instance, e.g.: https://mycompany.sharepoint.com Should be configured the same way as in the crawler. |
“Admin Server URL“ | The admin URL of the SharePoint Online instance. Often this is just the server URL with the suffix -admin. e.g: https://mycompany-admin.sharepoint.com Should be configured the same way as in the crawler. |
"Site Relative URL. | The relative paths to the sites to be crawled, starting with a slash, e.g.: /sites/mysite. Each line can contain a path. If this is left empty all detected sites are crawled. If Sites are specified here only this sites and their subsites are crawled Should be configured the same way as in the crawler. |
„Site Discovery Schedule“ | A Cron Expression that specifies when to run Site Discovery. The results are then used in the next crawl runs. This means that the potentially time-consuming site discovery does not have to be repeated with each crawl run. The official documentation for Cron Expressions can be found here, a generator for creating simple Cron Expressions can be found here. |
“Background Site Discovery Parallel Request Count” | Limits the number of parallel HTTP requests sent by the Site Discovery. |
“Site Discovery Strategy” | The strategy by which the site discovery is to be performed.
|
“Do Not Crawl Root Site” | If this option is activated, the root page of the Sharepoint Online instance (e.g. mycompany.sharepoint.com) and its subpages will not be crawled. Should be configured the same way as in the crawler. |
"Included Sites URL (regex)" | Regular expression that can be used to specify which subsites are to be crawled. If this option is left empty, all subsites will be crawled. The regex matches relative URLs. e.g /sites/mysite Should be configured the same way as in the crawler. |
"Excluded Sites URL (regex)" | Regular expression that can be used to specify which subsites are to be excluded. The regex matches relative URLs. e.g /sites/mysite Should be configured the same way as in the crawler. |
„Enable Delta Update“ | If enabled, only the changes to the groups will be fetched from Sharepoint Online after the first cache creation, instead of fetching all groups each time. This is especially recommended for very large Sharepoint instances, as a regular cache update can otherwise take a long time. Delta Updating with User Based Authentication is not supported - if Delta Updating is required, App Only authentication must be used |
„[Deprecated] Send User Agent“ | This option is deprecated and should always remain enabled. The User Agent header should always be sent with the request. Adjust the "User Agent" option instead. If set, the header configured with the "User Agent" option is sent with every HTTP request. |
„User Agent“ | The specified value is sent with the HTTP request in the User-Agent header. |
„Dump Change Responses“ | If activated, the changes we receive from Sharepoint Online during the Delta Update are dumped into a file. This is very helpful for troubleshooting. |
“Log All HTTP Requests” | If set, all HTTP requests sent by the Principal Resolution Service during the cache update are written to a .csv file (sp-request-log.csv). |
„Regex for your organization“ | Regular expression that defines whether a user belongs to your organization or not. This is used to resolve the principal “everyone_except_external”. The regular expression can refer to the e-mail address, the ObjectSID or the ObjectGUID from LDAP. |
„Parallel Request Count“ | You can use this option to define how many HTTP requests are sent simultaneously by the crawler. The higher the value, the faster the crawl run should be, but too high a value can also lead to many "Too Many Requests" errors on the part of Sharepoint. A value above 30 is not recommended. |
„Page Size“ | Maximum number of objects received per request to the API. A high value results in higher speed but higher memory consumption during the cache update, a small value results in less memory consumption but reduces the speed. If the value is set to 0, no paging is used, meaning that the crawler attempts to fetch all objects at once with the request. |
“Trust all SSL Certificates” | Allows the use of unsecured connections, for example for test systems. Must not be activated in production systems. |
“Heap Dump On Out Of Memory” | When activated, the Principal Service will do a Heap Dump if it runs Out Of Memory |
This is only necessary if you have also configured app-only authentication for the data source.
The following configuration options are deprecated. Please use the Microsoft AAD Graph Principal Resolution Service instead. This makes cache updates faster and the cache is smaller.
If you have not set up the "AD Connect" function in Azure Active Directory, or crawl SharePoint sites created via Microsoft Teams, select " Resolve Groups over Graph" and fill in the fields "Tenant Context ID", "Application ID", "Generated Key" and "Protected Resource Hostname". The corresponding values can be found in the Azure Portal.
The Graph Page Size defines the maximum number of objects received per request to the Graph API. Unlike SharePoint Online, paging in Graph cannot be disabled, so this value must always be set between 1 and 999.
The option "Resolve Site Owner over Graph" is necessary if "Grant Site Owner" is enabled in the crawler and the Site Owners are Graph groups (as is usually the case in SharePoint Online). Then this group is resolved for all pages.
The creation of a new app was already shown in step 1 of the chapter "App-Only Authentication". Just repeat this step (steps 2 and 3 are not necessary).
In the section "API permissions" you must then give the application the necessary permissions. Required here are "Directory.Read.All" permissions for "Azure Active Directory Graph":
If AD Connect is set up in your Azure Active Directory, do not enable the “AD Connect is NOT configured” option.
The following table lists the protected resource hostnames for different cloud environments:
Global Service | graph.microsoft.com |
Germany | graph.microsoft.de |
China | microsoftgraph.chinacloudapi.cn |
US Government | graph.microsoft.us |
A complete list of protected resource hostnames can also be found at https://developer.microsoft.com/en-us/graph/docs/concepts/deployments
An LDAP cache is required to resolve users from the active directory. The following link describes how to set up a caching principal resolution service: https://help.mindbreeze.com/de/index.php?topic=doc/Installation--Konfiguration---Caching-Principal-Resolution-Service/index.htm
The following values should be entered in the LDAP cache under “User Alias Name LDAP Attributes” or “User Alias Name LDAP Attributes”:
cn
objectGUID
objectSID
Enter the information about the LDAP cache under “LDAP Settings.” Enable the option “Use LDAP Principal Cache Service” and enter the corresponding port of your LDAP principal cache.
Under “Cache Settings,” configure where you want the database for the cache to be located and set the desired interval for the updates.
Under “Service Settings,” enter a free port to be used for the principal cache and enable the “Lowercase Principals” option so that the SharePoint groups can be resolved correctly.
Only enter the URLs for Azure AD Endpoint and Azure ACS Endpoint in the “Azure AD Endpoint” and “Azure ACS endpoint” fields if your SharePoint environment is hosted in a special environment (such as Germany).
The following environments require special URLs for Azure AD Endpoint:
Germany | |
China | |
US Government |
The following environments require special URLs for Azure ACS Endpoint:
Germany | |
China | |
US Government |
A complete list for Azure ACS endpoints can also be found at https://docs.microsoft.com/en-us/sharepoint/dev/solution-guidance/extending-sharepoint-online-for-germany-china-usgovernment-environments.
These config options are described in the documentation for the Caching Principal Resolution Service: https://help.mindbreeze.com/en/index.php?topic=doc/Installation--Configuration---Caching-Principal-Resolution-Service/index.htm
When the SharePoint Online groups are resolved in the SharePoint Online Principal Resolution Cache, it can happen that these groups contain Office 365 groups. In this case these groups must be resolved via the Graph API and the Microsoft AAD Graph Principal Resolution Service is required. If you are not sure whether your SharePoint Online groups contain Office 365 groups, because this is often not easily visible through the SharePoint Online UI, you should set up the Microsoft AAD Graph Principal Resolution Service for safety reasons and configure it as a Parent Service for the SharePoint Online Principal Resolution Service.
The creation of a new app was already shown in step 1 of the chapter "App-Only Authentication". Just repeat this step (steps 2 and 3 are not necessary).
In the area "API permissions" you have to give the application the necessary permissions. Required here are "Directory.Read.All" permissions for "Azure Active Directory Graph":
Add a new service in the "Services" section by clicking on "Add Service". Select "MicrosoftGraphPrincipalCache" and assign a display name.
„Tenant Context ID“ | The tenant name of the Microsoft instance, e.g. mycompany.onmicrosoft.com |
„Application ID“ | The application ID of the app created in the Azure Portal |
„Generated Key“ | The secret of the app created in the Azure Portal |
„Protected Resource Hostname“ | If you are not using a specific national cloud deployment, the Protected Resource Hostname is graph.windows.net If you are using a national cloud deployment, please refer to the table below for the resource host name. |
„Graph Page Size“ | The number of graph groups fetched per request to the Graph API |
“Log All HTTP Requests” | If set, all HTTP requests sent from the cache during the update are written to a .csv file (sp-request log.csv). |
„User Agent“ | The specified value is sent in the User-Agent header with the HTTP request |
“Heap Dump On Out Of Memory” | If this option is activated, the Principal Service will make a heap dump if it runs Out Of Memory |
The following table shows the "Protected Resource Hostnames" for different cloud environments:
Globaler Service | graph.microsoft.com |
Germany | graph.microsoft.de |
China | microsoftgraph.chinacloudapi.cn |
US Government | graph.azure.us |
A complete list of "Protected Resource Hostnames" can also be found at: https://developer.microsoft.com/en-us/graph/docs/concepts/deployments
The LDAP, Cache, Health Check and Service Settings are equivalent to the SharePoint Online Principal Resolution Service.
The SharePoint Online Authorization Service makes requests to SharePoint Online for each object to find out whether the respective user actually has access to this object. Since this can take a lot of time, the Authorization Service should only be used for testing purposes - normally, all checks of the Authorization Service should be positive anyway. But this makes it easy to find out if there are any problems with the permissions of certain objects.
In order to use the Authorization Service in the search, you have to set the option "Approved Hits Reauthorize" in the index Advanced Settings to "External Authorizer" and in the crawler the Authorization Service created has to be selected as "Authorization Service".
„Server URL“ | The URL of the Sharepoint Online instance, e.g.: https://mycompany.sharepoint.com Should be configured the same way as in the crawler. |
„SharePoint Online Email Regex“ | Regex pattern for email addresses used in SharePoint Online. This pattern is used to decide which principal is used to check the permissions in SharePoint Online. |
„Parallel Request Count“ | You can use this option to define how many HTTP requests are sent simultaneously by the authorization service. The higher the value, the faster it should be, but too high a value can also lead to many "Too Many Requests" errors on the part of Sharepoint. A value above 30 is not recommended. |
„[Deprecated] Send User Agent“ | This option is deprecated and should always remain enabled. The User Agent header should always be sent with the request. Adjust the "User Agent" option instead. If set, the header configured with the "User Agent" option is sent with every HTTP request. |
„User Agent“ | The specified value is sent with the HTTP request in the User-Agent header, it the option “Send User Agent” is set. |
Only enter the URL for the Azure ACS endpoint in the “Azure ACS endpoint” field if your SharePoint environment is hosted in a special environment (such as Germany).
The following environments require special URLs:
Germany | |
China | |
US Government |
A complete list for Azure ACS endpoints can also be found at https://docs.microsoft.com/en-us/sharepoint/dev/solution-guidance/extending-sharepoint-online-for-germany-china-usgovernment-environments.
Configure the options as follows:
“Client ID" | The client ID that is generated as described below. |
“Client secret“ | The client secret that is generated as described below. |
Only AppOnly authentication can be used for the Authenticator.
If you are using app-only authentication, this section is NOT applicable to you. Otherwise, proceed as follows:
Navigate to the “Network” tab and add a new credential for Microsoft SharePoint Online under “Credentials” by clicking “Add Credential.”
Enter the credentials for the user you want to use for indexing and assign a name for the credential. Select a user with adequate permissions to read all relevant pages and permissions.
Then add a new endpoint for the credential you just created by clicking on “Add Endpoint” under “Endpoints.” Enter the server URL of your Microsoft SharePoint Online installation as the location and select the credential you just created.
With the help of the Sharepoint Online Connector, OneDrive pages can also be crawled.
For this you have to consider some points for the configuration:
The following SharePoint Online API endpoints are used by the SharePoint Online Crawler and the SharePoint Online Principal Resolution Service.
HTTP-Method | Description | |
<Azure Endpoint>/GetUserRealm.srf | POST | If the user is an ADFS user, the authentication server is fetched with this endpoint. |
<ADFS Authenticationserver> | POST | A login request is made against the server that was fetched with the GetUserRealm call. |
<Azure Endpoint>/rst2.srf | POST | A login token with the entered username/password credentials is retrieved. |
<ServerURL>/_vti_bin/idcrl.svc/ <AdminServerURL>/_vti_bin/idcrl.svc/ | GET | Login cookies are retrieved with the previously retrieved token. The cookies of the admin URL are required for SiteDiscovery. |
<AdminServerURL>/_vti_bin/sites.asmx | POST | With the previously retrieved cookies, a digest hash is retrieved, which is required for SiteDiscovery. |
<ServerURL>/_vti_bin/client.svc <AdminServerURL>/_vti_bin/client.svc | GET | This endpoint is used to get information about the tenant, e.g. the TenantId. |
<Azure Endpoint>/<Tenant Id>/tokens/OAuth/2 | POST | This endpoint is used to generate an Access Token for AppOnly Authorization. |
<AdminServerURL>/_api/ProcessQuery | POST | This endpoint finds all sites on the Sharepoint Online instance when AppOnly authentication is used and the Site Discovery Strategy is set to Auto or when the Site Discovery Strategy is set to Admin API. |
<ServerURL>/_api/search/query | GET | This endpoint finds all sites on the Sharepoint Online instance to which the user has access if User Based Authentication is used and the Site Discovery Strategy is set to Auto or if the Site Discovery Strategy is set to Search. |
<ServerURL><SiteRelativeUrl>/_api/web/siteusers | GET | With this endpoint, all users are fetched to find out which are Site Collection Administrators. |
<ServerURL><SiteRelativeUrl>/_api/Web/getChanges | POST | With this endpoint all changes of a Site are fetched during the delta crawl. |
<ServerURL><SiteRelativeUrl>/_api/Site/getChanges | POST | With this endpoint all Group Membership changes of a Site are fetched during the Delta Update. |
<ServerURL><SiteRelativeUrl>/_api/Web/webs | GET | This endpoint is used to fetch the direct Subsites of Sites. |
<ServerURL><SiteRelativeUrl>/_api/Web/RegionalSettings/TimeZone | GET | This endpoint is used to fetch the set time zone of the Site. |
<ServerURL><SiteRelativeUrl>/_api/Web | GET | This endpoint is used to fetch the metadata of a Site. |
<ServerURL><SiteRelativeUrl>/_api/web/lists | GET | With this endpoint all lists of a Site and some additional metadata are fetched, among others also the "RoleAssignments" field, for which "Enumerate Permissions" permissions are needed. |
<ServerURL><SiteRelativeUrl>/_api/web/lists(guid’<ListId>’) | GET | This endpoint is used to fetch a single list. |
<ServerURL><SiteRelativeUrl>/_api/web/lists(guid’<ListId>’)/Items | GET | With this endpoint all items of a list and some additional metadata are fetched, among others also the "RoleAssignments" field, for which "Enumerate Permissions" permissions rights. |
<ServerURL><SiteRelativeUrl>/_api/web/lists(guid’<ListId>’)/Items(<ItemId>) | GET | This endpoint is used to fetch a single item. |
<ServerURL><SiteRelativeUrl>/_api/web/GetFileByServerRelativeUrl(‘<FileRelativeUrl’) <ServerURL><SiteRelativeUrl>/_api/web/GetFileById (‘<FileId>’) <ServerURL><SiteRelativeUrl>/_api/web/GetFileById (‘<FileId>’)/ListItemAllFields <ServerURL><SiteRelativeUrl>/_api/web/GetFileById (‘<FileId>’)/ListItemAllFields/Versions(<VersionId>) | GET | These endpoints are used to fetch the metadata of a file. |
<Direct link to a file>/$value | GET | This endpoint is used to download the contents of a file. |
<ServerURL><SiteRelativeUrl>/_api/Site | GET | This endpoint is used to fetch the Site Collection metadata. |
<ServerURL><SiteRelativeUrl>/_api/Web/GetUserById(<Id>) | GET | This endpoint is used to fetch a user's metadata. |
<ServerURL><SiteRelativeUrl>/_api/web/sitegroups <ServerURL><SiteRelativeUrl>/_api/web/sitegroups(<GroupId>)/users | GET | With this endpoint all groups of a site and all users in these groups are fetched. Enumerate Permissions" permissions are required for this endpoint. |
There are two categories of possible authentication methods for the SharePoint Online Crawler to authenticate and crawl a SharePoint Online instance:
App Only Authentication is the recommended approach for SharePoint Online authentication, as it is quick to set up, requires no fine tuning with permissions or settings, and supports Delta Crawling/Update.
If full functionality of the crawler is required, i.e.:
the crawler needs to be given Full Control permissions. The full control permissions need to be granted on both the site collection level and the tenant level.
If ACLs are not required (for example, only crawling public sites), it would be possible to perform crawling with write permissions (if site discovery is required) or read permissions (if site discovery is not required).
There are currently two available user-based authentication methods available, NTLM and ADFS. There is effectively no difference in configuration and permission settings between these two methods.
The SharePoint Online Crawler will automatically determine whether the configured crawling user is from an ADFS system or if it is an NTLM User. The Crawler will then automatically login either via ADFS or directly on SharePoint Online (if NTLM)
For both of these authentication methods, credentials need to be added to the Network tab, with an endpoint pointing towards the SharePoint Online instance.
If full functionality of the crawler is required, i.e.:
the user needs to have at least the “Enumerate Permissions” and “Use Remote Interfaces” permissions on every site, list, and item which should be crawled. Unless inheritance is broken somewhere within the SharePoint hierarchy, simply assigning the user those permissions on every site should be sufficient. If inheritance is broken somewhere within the file/folder/sub-site hierarchy of the SharePoint online server, then special care needs to be taken when indexing these specific areas. The user needs to be given specific permissions for each of these inheritance exceptions. If this is not done correctly, some items might not be indexed or the crawl could abort.
Delta Update (Cache) is not supported with user-based authentication.
Since User Based Authentication uses the permissions of the user, authorization problems can occur at some points during crawling if the user does not have the necessary permissions. This section should help you to find and fix possible causes for problems (e.g. 403 Forbidden Responses during crawling) that may occur although the user has already been given permissions to all pages to be crawled.
At least "Enumerate Permissions" permissions are required for the calls used in the crawler and principal cache. Unfortunately, these are not specified in any default Permission Level (Read, Write, Manage, Full Control are the Default Levels - Enumerate Permissions is only included in Full Control), so it would be best to specify a separate Permission Level and give it to the user on each page to be crawled.
This can be done by clicking on "Permission Levels" on the Permissions page of the site () and then creating a new Permission Level with "Enumerate Permissions" and "Use Remote Interfaces" (if Enumerate Permissions is selected, all inherent permissions will be selected):
By default, all objects in SharePoint Online (i.e. all pages, lists, items, files etc.) inherit the permission settings of their parent. However, this inheritance can be broken, so that permission changes of the parent do not affect the child object. In this case, the user must be given separate permissions for these objects.
If there are objects that break the inheritance, you will get a warning on the Permission Settings page ():
Hidden Lists are usually lists that are not used by users, but by SharePoint Online itself for various things. These lists usually do not need to be indexed, because they do not contain any data that is interesting for users. The crawling of these lists can be activated and deactivated with the option "Crawl Hidden Lists".
However, if you want to crawl these lists, you have to make sure that none of these lists break the inheritance, or if they do, you have to give the user permission to crawl these lists. Since these lists are not displayed in the GUI, SharePoint Online will not warn you if the inheritance is broken. In this case, the SharePoint API must be used to check whether these lists break the inheritance.
In the ListItemCount.csv logfile, the number of items in all lists of the indexed sites are listed during a full crawl run.
The file contains the Site Relative URL, List Id, List Name, List URL and number of items in the list.