Installation and Configuration
Microsoft SharePoint Online Connector

Installation

Before installing the SharePoint Online connector, make sure that the Mindbreeze server is installed and the SharePoint Online connector is included in the license. Use the Mindbreeze Management Center to install or update the connector.

Plugin installation via Mindbreeze Management Center

To install the plug-in, open the Mindbreeze Management Center. Select “Configuration” from the menu pane on the left-hand side. Then navigate to the “Plugins” tab. Under “Plugin Management,” select the appropriate zip file and upload it by clicking “Upload.” This automatically installs or updates the connector, as the case may be. In the process, the Mindbreeze services are restarted.

Overview

To configure the SharePoint Online Connector by default, the following plugins must be configured:

SharePoint Online Crawler (Chapter: Configuring the data source)
SharePoint Online Principal Cache Service (Chapter: Configuring the principal resolution service)
Microsoft AAD Graph Principal Resolution Service (Chapter: Configuration of the Microsoft AAD Graph Principal Resolution Service)
Optional if LDAP available: Configuration - LDAP Connector

Configuring Mindbreeze

Select the “Advanced” installation method for configuration.

Configuring the index

To create a new index, navigate to the “Indices” tab and click the “Add Index” icon in the upper right corner.

Enter the path to the index and change the display name as necessary.

Configuring the data source

Add a new data source by clicking the “Add new custom source” icon at the top right. Select the category “Microsoft SharePoint Online” and configure the data source according to your needs.

Section “Sharepoint Online”

In the "Sharepoint Online" area you can define your Microsoft SharePoint Online installation that is to be indexed. The following options are available:

Setting	Description
Server URL	The URL of the Sharepoint Online instance, e.g.: https://myorganization.sharepoint.com
Admin Server URL	The admin URL of the SharePoint Online instance. Often this is just the server URL with the suffix -admin. e.g: https://myorganization-admin.sharepoint.com
Site Relative URL	The relative paths to the sites to be crawled, starting with a slash, e.g.: /sites/mysite. Each line can contain a path. If this field is left empty, all detected sites are crawled. When sites are specified in this field, only specified sites and their subsites are crawled.
Detect Yammer Attachment Sites	When this setting is enabled, additionally to your currently configured Site Discovery Settings, the Crawler tries to find every SharePoint Site which was automatically created by Yammer to store attachments. When this setting is enabled and no Site Relative URLs are defined, only Yammer Attachment Sites will be crawled. When Site Relative URLs are defined, the defined URLs as well as all Yammer Attachment Sites will be crawled.
Site Discovery Schedule	An extended cron expression that specifies when to run site discovery. The results are then used in the next crawl runs. This means that the potentially time-consuming site discovery does not have to be repeated with each crawl run. Documentation and examples of cron expressions can be found here.
Background Site Discovery Parallel Request Count (Advanced Settings)	Limits the number of parallel HTTP requests sent by the Site Discovery.
Site Discovery Strategy (Advanced Settings)	The strategy by which the site discovery is to be performed. Auto: The authentication method is used to determine automatically which strategy to use. Admin API: The SharePoint Admin API is used for site discovery. This will find ALL pages of the SharePoint Online instance. Either App-Only Authentication or User-Based Authentication with an Admin User must be used. Search: The SharePoint Online search is used for site discovery. This automatically finds only pages that the crawl user has access to. For User Based Authentication without an Admin User, only this strategy can be used.
Ignore Site Discovery Errors (Advanced Settings)	If this option is enabled, all site discovery errors will only be logged and otherwise ignored. You can use this option if you have a large and constantly changing number of sites and unexpected errors occur.
Do Not Crawl Root Site	If this option is activated, the root page of the Sharepoint Online instance (e.g. myorganization.sharepoint.com) and its subpages will not be crawled.
Included Sites URL (regex)	Regular expression that can be used to specify which subsites are to be crawled. If this option is left empty, all subsites will be crawled. The regex matches relative URLs. e.g /sites/mysite
Excluded Sites URL (regex)	Regular expression that can be used to specify which subsites are to be excluded. The regex matches relative URLs. e.g /sites/mysite
Included Lists/Files/Folders URL (regex)	Regular Expression, which can be used to specify which lists, files and folders should be included. The metadata "url" (absolute url) is compared. If this option is left empty, everything is included. Note: If you want to include/exclude complete subsites, please use the option "Included Sites URL (regex)" or "Excluded Sites URL (regex)"
Excluded Lists/Files/Folders URL (regex)	Regular Expression, which can be used to specify which lists, files and folders should be excluded. The metadata "url" (absolute url) is compared. For example, if you find a document in the Mindbreeze search that you want to exclude, you can copy the URL from the "Open" action and use it in the "Excluded Lists/Files/Folders URL (regex)" option
Included Metadata Names (regex)	Regular expression used to include generic metadata with the name of the metadata. If nothing is specified, all metadata is included. The regex is applied to the name of the metadata (without the sp_ prefix).
Excluded Metadata Names (regex)	Regular expression that excludes generic metadata with the name of the metadata. If nothing is specified, all metadata is included. The regex is applied to the name of the metadata (without the sp_ prefix).
Index Complex Metadata Types (regex) (Advanced Settings)	Regular expression that can be used to include complex metadata for indexing. By default, complex metadata (e.g. metadata entries that themselves have multiple metadata) are not indexed. This pattern refers to the "type" of the metadata, for example SP.FieldUrlValue for Link Fields or SP.Taxonomy.TaxonomyFieldValue for Managed Metadata.
Included Content Types (regex) (Advanced Settings)	Regular expression that includes content types (e.g. file, folder) via the name of the content type. If nothing is specified, all content types are included. The content type of objects can be found in the contenttype metadata.
Excluded Content Types (regex) (Advanced Settings)	Regular expression that excludes content types (e.g. file, folder) via the name of the content type. If nothing is specified, all content types are included. The content type of objects can be found in the contenttype metadata.
[Deprecated] Use delta key format (Advanced Settings)	This option is deprecated and should always be left enabled so that the full functionality of the crawler can be used. If set, a different format is used for the keys. Certain functions of the delta crawl (e.g. renaming lists, deleting attachment files) do not work without this option. If this option is changed, the index should be cleaned and re-indexed.
Enable Delta Crawl	If set, the Sharepoint Online API will only fetch changes to files instead of crawling over the whole Sharepoint Online instance. For full functionality "Use delta key format" should be set.
Skip Delta Errors (Advanced Settings)	If set, errors are skipped during the delta crawl. Otherwise, the document where the error occurred will be indexed again during the next crawl run. Warning: If this option is enabled, there may be differences between the Mindbreeze index and SharePoint Online. Enable this option only temporarily if persistent errors occur during delta crawling that do not resolve after several crawl runs.
Index Sites As Document	If set, all SharePoint Online sites are also indexed as Mindbreeze documents. The content of these documents is the Welcome Page of the site. When you deactivate this setting, a complete re-index is required.
Crawl hidden lists	If set, lists that are defined as hidden are also indexed
Crawl lists with property 'NoCrawl’	If this option is set, those lists are also indexed that have the "NoCrawl" property in Microsoft SharePoint Online
Max Change Count Per Site (Advanced Settings)	Number of changes that are processed in the delta crawl per page before the next page is processed. The remaining changes are processed at the next crawl run.
Trust all SSL Certificates	Allows the use of unsecured connections, for example for test systems. Must not be activated in production systems.
Parallel Request Count	Limits the number of parallel HTTP requests sent by the crawler.
Page Size	Maximum number of objects received per request to the API. A high value results in higher speed but higher memory consumption during the crawl run, a small value results in less memory consumption but reduces the speed. If the value is set to 0, no paging is used, meaning that the crawler attempts to fetch all objects at once with the request.
Max Content Length (MB)	Limits the maximum document size. If a document is larger than this limit, the content of the document is not downloaded (the metadata is retained). The default value is 50 megabytes
[Deprecated] Send User Agent (Advanced Settings)	This option is deprecated and should always remain enabled. The User Agent header should always be sent with the request. Adjust the "User Agent" option instead. If set, the header configured with the "User Agent" option is sent with every HTTP request.
User Agent (Advanced Settings)	The specified value is sent with the HTTP request in the User-Agent header, it the option “Send User Agent” is set.
Thumbnail Generation for Web Content (Advanced Setting)	If set, thumbnails are generated for web documents. It is not recommended to enable this feature as it only works for public pages with anonymous access which have already been discontinued.
Dump Change Responses (Advanced Setting)	If set, the Sharepoint API responses are written to a log file during delta crawling.
Log All HTTP Requests (Advanced Setting)	If set, all HTTP requests sent by the crawler during the crawl run are written to a .csv file (sp-request-log.csv).
Update Crawler State During Crawlrun (Advanced Setting)	If set, the crawler state is continuously updated during the crawl run (instead of only at the beginning of the crawl run), which can prevent multiple downloads of the same file in certain situations. This option is only active for Delta Crawlruns.
Recrawl CSV File Path (Advanced Setting)	If set, the specified file will be monitored and in case of changes during the crawl, a re-indexing of certain pages, lists or items/files can be triggered. The CSV file must have the following header: siteRelativeUrl;list;item;forceReindex Each line can have the following values: Site Relative URL: The relative URL of the site to be indexed. List: the name or id of the list to be indexed Item: The name or id of the item, file or folder to be indexed. Force Reindex: Can be either true or false. If set to true, the specified documents will always be indexed, regardless of the modification date or other data Only the Site Relative URL is required, the rest of the values are optional. Example lines: /sites/Marketing;Documents;Schedule.docx;true: The file “Schedule.docx” in the list “Documents” in the site “/sites/Marketing” will be re-indexed. /sites/Marketing;SalesData;;: The list "SalesData" in the site "/sites/Marketing" and all its items will be re-indexed if they have not been indexed yet or if their modification date or permission information is different from the currently indexed version. After the CSV file is processed, it is renamed to <filename>.old. Then a new file can be created with new objects to be indexed. For performance reasons, it is recommended to put the CSV file in its own folder, without other files that are often edited. For this the mes user needs at least Read, Write and Execute permissions on the folder and Read and Write permissions on the file.
Custom Delta CSV Path (Advanced Setting)	With this option a path to a .csv file can be specified, with which own delta points can be set. Each line must contain two entries separated by semicolons: First the Site Relative URL and then a time in the format yyyy-MM-ddTHH:mm:ssZ. Example: /sites/MySite; 2019-10-02T10:00:00Z This state is not adopted if a state already exists. If the old state is to be overwritten, it must be deleted from the DeltaState file.
Update all ACLs (Advanced Setting)	If set, the crawler updates all ACLs of SharePoint Online objects. This option should only be used for delta crawl runs after a plug-in update where ACL handling has been improved. This option should be turned off again after a crawl run because it may take a long time.
Check for Role Update within past Days (Advanced Setting)	With this setting, the crawler will check for Role Updates of all sites within the past days as configured. Due to limitations of the SharePoint Online getChanges API, the maximum value is 59 days.
Dry Run Role Update Check (Advanced Setting)	If set, the check for Role Updates since the date configured above is only logged into a role-update.csv file instead of being fully processed.
Ignore Sharepoint ACLs (Advanced Setting)	If set, no access permissions to lists or documents are fetched from Sharepoint. This option can only be set if at least one Site ACL is configured.
Site ACL	With this option you can set your own ACLs. The Site URL Pattern is a regular expression for which pages this principal should be configured, Access Check Action can be used to select whether it is a grant or deny, and Principal is used to specify the group/user to which the principal should apply (e.g. everyone or max.mustermann@myorganization.onmicrosoft.com).
Remove Old Documents From Index (Number Of Years)	If configured, documents whose modification date is older than the specified number of years are removed from the index at the end of a traversal. This date is calculated from the start date of the crawler minus the number of years configured in this field. Example: The modification date of a document is 20.09.2020 and the setting “Remove Old Documents From Index (Number Of Years)” is configured with the value “3”. Accordingly, the document is removed from the index on the 21.09.2023. Hint: Reducing the number of years is possible without reindexing but increasing or removing the setting is only possible with a reindex to fetch all data again.

Section “Azure Endpoints”

Only enter the URL for the Azure AD endpoint in the “Azure AD endpoint” field if you use a SharePoint National Cloud.

The following environments require special URLs:

Environment	URL
China	https://login.partner.microsoftonline.cn
US Government	https://login.microsoftonline.us

A complete list for Azure ACS endpoints can also be found at:

https://learn.microsoft.com/en-us/azure/active-directory/develop/authentication-national-cloud#azure-ad-authentication-endpoints.

Section “App-Only Authentication”

Configure the options as follows:

Setting	Description
Use App-Only authentication	When this option is selected, app-only authentication is used instead of user based authentication. If this option is selected, “Client ID” and “Client secret” also need to be configured. In addition, you need to perform all the “App Registration in Sharepoint” steps below.
App Credential	The credential that was created in the Network tab and contains the Client ID and the Secret generated, as described below.
[Deprecated] Client ID (Advanced Settings)	This option is deprecated and should no longer be used. Use the setting “App Credential” instead. The client ID that is generated as described below.
[Deprecated] Client secret (Advanced Settings)	This option is deprecated and should no longer be used. Use the setting “App Credential” instead. The client secret that is generated as described below.

App Registration: Step 1

There are two ways to register a new app: either directly in SharePoint Online or in Azure. The Client Secret of an app created in SharePoint Online expires after one year. Then the Secret would have to be renewed via PowerShell. In Azure you can set whether the Secret should expire after one year, two years or not at all.

In SharePoint Online:

To generate a client ID and a client secret, enter the following URL in the browser:
<Server URL>/_layouts/15/appregnew.aspx
(e.g. https://myorganization.sharepoint.com/_layouts/15/appregnew.aspx)

Click the two buttons "Generate" (for "client Id" and for "client secret") and enter the other information as follows:

"Title": user-definable
"App Domain": "localhost"
“Redirect URI”: https://localhost/

Then click “Create."

Then enter the Client ID and the Client Secret into the Mindbreeze InSpire configuration. Otherwise you will not be able to access the client secret later.

To do this, create a new credential of the type “OAuth 2” in the “Network” tab. You only need to enter the Client ID and Client Secret in this credential.

In Azure:

To create an app in Azure, first go to portal.azure.com.

There you have to register a new app under the tab "Azure Active Directory" and then under the tab "App registrations":

In the newly generated app you will find the application id and under the tab "Certificates & Secrets" you can generate a key.

Then enter the Client ID and the Client Secret into the Mindbreeze InSpire configuration. Otherwise you will not be able to access the client secret later.

To do this, create a new credential of the type “OAuth 2” in the “Network” tab. You only need to enter the Client ID and Client Secret in this credential.

You can use this App Id and Secret in step 2 to authorize the app for SharePoint Online.

Granting permissions in SharePoint: Step 2

There are two ways to assign permissions for the app in SharePoint Online. It is recommended to assign tenant permissions for the app, because this way the app simply gets the permissions for all necessary content and can find and index it without any problems.

Alternatively, you can assign permissions for each page to be indexed individually. This way you can limit the permissions for the app to the minimum, but if you wanted to index a lot or dynamic content, this variant might need a lot of maintenance work.

Preparation: Enabling Custom App Authentication for New Instances

If you have a relatively new SharePoint Online instance (as of around January 2022), it could be that Custom App Authentication is disabled by default. In this case, you will need to run the following commands in your Powershell as described in this Microsoft forum:

Install-Module -Name Microsoft.Online.SharePoint.PowerShell

$adminUPN="<the full email address of a SharePoint administrator account, example: jdoe@contosotoycompany.onmicrosoft.com>"

$orgName="<name of your Office 365 organization, example: contosotoycompany>"

$userCredential = Get-Credential -UserName $adminUPN -Message "Type the password."

Connect-SPOService -Url https://$orgName-admin.sharepoint.com -Credential $userCredential

set-spotenant -DisableCustomAppAuthentication $false

Granting Tenant Permissions

Enter the following URL in the browser:

<Admin Site URL>/_layouts/15/appinv.aspx
(e.g. https://myorganization-admin.sharepoint.com/_layouts/15/appinv.aspx)

ATTENTION: Make sure that you are on the admin page. For example, if the URL is https://myorganization.sharepoint.com, then the admin page is usually https://myorganization-admin.sharepoint.com.

Enter the Client Id in the "App Id" field and activate the "Lookup" button. "Title", "App Domain" and "Redirect URL" will be filled in automatically. Then enter the following in the "Permission Request XML" field:

<AppPermissionRequest

Scope="http://sharepoint/content/tenant"

Right="FullControl" />

</AppPermissionRequests>

Note: "FullControl" is required so that Mindbreeze InSpire has access to the access rights of the SharePoint documents to be indexed in order to map the authorizations in Mindbreeze InSpire.

Then click "Create".

Alternative: Assigning permissions per Site

You have to repeat the following steps for each Site to be indexed.

In the browser, enter the URL:
<Server URL>/<Site Relative URL>/_layouts/15/appinv.aspx
(e.g. https://myorganization.sharepoint.com/_layouts/15/appinv.aspx)

Enter the client id in the “App Id” field and click “Lookup.” “Title,” “App Domain,” and “Redirect URL” will be filled in automatically. Then enter the following in the “Permission Request XML” field:

<AppPermissionRequest
Scope="http://sharepoint/content/sitecollection"

Right="FullControl"

</AppPermissionRequests>

Note: "FullControl" is required so that Mindbreeze InSpire has access to the access rights of the SharePoint documents to be indexed in order to map the authorizations in Mindbreeze InSpire.

Then click “Create."

Section “Content Settings“ (Advanced Settings)

Setting	Description
Do Not Request File Author Metadata	If active no information regarding authors is requests for Lists, Items or Files. This may help resolve the following Error: „HTTP 500: User cannot be found".
Get Author From File Properties	If active, the author metadata is retrieved from the File/Properties object instead of the File/Author object. This prevents errors during crawl runs, for example, if the user has since been deleted. A complete re-index is required so that this option can be used correctly.
List All Content Types	If active, an all-content-types.csv file is created in the log directory at the beginning of a crawl run, which contains all content types of all lists of all configured pages.
Analyse Completeness	When enabled, a list-based completeness analysis of the index is performed during the crawl, which is written to a CompletenessAnalysis.csv file in the log directory. Note: The Completeness Analysis does not check all sites in SharePoint Online, but only those that are configured for indexing. Attachments are not accounted for in the statistic.
Include Unpublished Documents	If active, all documents/items are always indexed in their most recent version, regardless of whether they have already been published. If this option is disabled, only published sites and only major versions (1.0, 2.0 etc., usually created with each publish) will be indexed.
ASPX Content Metadata	With this option you can set your own metadata as content for .aspx files. By default, "WikiField" and "CanvasContent1" are used. If metadata is set with this option, it will be used instead, if available. For this option to be applied correctly, a full re-index is required.
Download OneNote Files as HTML via Graph API (Advanced Settings)	If enabled, all OneNote documents (.one files) are downloaded as HTML documents via the Microsoft Graph API. This allows for a more accurate interpretation of the document contents. The documents will still be displayed as .one files. Note: To use this function, the following points must be taken into account: The associated Graph API credentials must be set correctly here. The connected Graph Application requires the Notes.Read.All and Sites.Read.All permissions. After complete configuration of this option a re-index is necessary.

Section “Graph API” (Advanced Settings)

Setting	Description
Graph Service Root	The endpoint/URL of the Microsoft Graph API. By default, "https://graph.microsoft.com". Change this setting only if you are using a national (non-international) Microsoft Cloud. A list of all available national cloud endpoints can be found further below.
Graph Tenant ID	The Tenant ID of your Microsoft 365 instance. You can find this on the Overview page of the app you created in Azure.
Graph App ID	The Application (Client) ID of the app created in Azure.
Graph Client Secret	The Credential created in the Network tab, which contains the created Client Secret.

Available national Microsoft Cloud endpoints

Endpoint	URL
Globaler Service	https://graph.microsoft.com
China	https://microsoftgraph.chinacloudapi.cn
US Government L4	https://graph.microsoft.us
US Government L5 (DOD)	https://dod-graph.microsoft.us

You can find a complete list of national Cloud Endpoints here:
https://learn.microsoft.com/en-us/graph/deployments#microsoft-graph-and-graph-explorer-service-root-endpoints

Creating the Application in Azure

In order for Microsoft Graph to be retrieved from the Sharepoint Online Connector, a new app must first be created that has the permissions to read Microsoft Graph. This app can be created on portal.azure.com.

To create or register the app, navigate to "Azure Active Directory" -> "App registrations" and click on the "New registration" button:

After the app has been created, a secret must be generated so that the crawler can actually log in. This is normally requested automatically after the app has been created. Otherwise, click on the desired app under "App registrations" -> "Owned applications" and then create the secret under "Certificates & secrets" -> "New client secret".

When creating the secret, you can set the expiration time. We recommend a runtime of 6-12 months, so that the secret is changed regularly.

Note: You must copy the created secret so that you can enter it directly in the Mindbreeze configuration. You can add the secret in the Network tab under the "Credentials" area by clicking on the "Add Credential" button.

When you leave the page, you can no longer have a look at the secret.

Now you have to give the app the required permissions. To do this, navigate to "App permissions". The Microsoft Graph Crawler needs the following Application Permissions in Microsoft Graph:

Notes.Read.All
Sites.Read.All

After you have given the app the permission, you have to grant "admin consent". Use the button "Grant admin consent for <MyInstance>" for this:

Configuring the principal resolution service

In the new or existing service, select in the Service setting the option SharepointOnlinePrincipalCache. For more information about additional configuration options and how to create a cache and how to do the basic configuration of a cache for a Principal Resolution Service, see Installation & Configuration - Caching Principal Resolution Service.