Atlassian Confluence Connector

Installation and Configuration

Copyright ©

Mindbreeze GmbH, A-4020 Linz, 2018.

All rights reserved. All hardware and software names are brand names and/or trademarks of their respective manufacturers.

These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes or other protected rights.

The dissemination, publication or reproduction hereof is prohibited.

For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.


InstallationPermanent link for this heading

Before you install the Atlassian Confluence Connector plugin, you need to ensure that the Mindbreeze server is installed and that this connector is also included in the Mindbreeze license. The Atlassian Confluence Connector is installed by default on the Mindbreeze InSpire server. If you want to install or update the connector manually, use the Mindbreeze Management Center.

Plugin Installation via Mindbreeze Management CenterPermanent link for this heading

To install or update Mindbreeze plugin files, open the Mindbreeze Management Center. Navigate to the "Plugins" tab under the menu item "Configuration". Select the ZIP file under the "Plugin management" section and use the "Upload" button to upload it. This will automatically install or update the connector. Mindbreeze services are restarted after a plugin installation.

Mindbreeze ConfigurationPermanent link for this heading

Index and Crawler ConfigurationPermanent link for this heading

When prompted to choose an installation method, select "Advanced".

Click on the "Indices" tab and then on the "Add new index" icon to create a new index.

Enter the index path, e.g. "/data/indices/confluence". If necessary, adjust the Display Name of the Index Service and the associated Filter Service.

Add a new data source using the symbol "Add new custom source" on the lower right.

If not already selected, select "Atlassian Confluence" using the "Category" button.

With the "Crawler Interval" setting, you configure the amount of time that elapses between two indexing runs.

Web PagePermanent link for this heading

The “Crawling Root” field allows you to specify an URL, via which an Atlassian Confluence sitemap is accessible. If you have the Mindbreeze Sitemap Generator add-on installed on your Atlassian Confluence server and a sitemap is generated, enter the URL <Atlassian Confluence URL>/plugins/servlet/sitemapservlet?jobbased=true.

The field "URL Regex" lets you define a regular expression, which sets a pattern for the links which are to be indexed.

If certain URLs should be excluded from the crawl, they can be configured using a regular expression under "URL Exclude Pattern".

With the option „Convert URL-s to lower case“, all located URLs will be converted to lower case.

If the DNS resolution of certain Web servers doesn’t work due to a problem with the network, you can specify the IPs using "Additional Hosts File".

Use the "Accept Headers" setting if you want to add specific HTTP headers (for example, Accept-Language).

If Confluence sitemaps are crawled, you can use the option “Use Rest API for Page Content” to obtain content from pages in a performance-friendly way and without running macros.

In order to actually prevent the use of macros, HTML thumbnailing should also be disabled. If the option “Disable Web Page Thumbnail Generation” is enabled, the metadata “htmlfilter:skipthumbnailgneneration” is set on all documents. Additional options have to be configured in the filter service (see paragraph after screenshot).

The “Max Retries” option determines how often the connector tries to download a document when temporary errors (e.g. socket timeouts) occur. The default value is 0. (No further download attempts). If you are crawling across an unstable network (that causes timeouts), this value should be increased to 10, for example. If the timeouts are caused by an overloaded data source, the value should be left at 0 so that the data source is not loaded even further.

The “Retry Delay Seconds” option determines the waiting time (in seconds) between download attempts (see "Max Retries"). The default value is 1.

To disable HTML thumbnailing, set the option “Disable Thumbnails Metadata Pattern”  to “htmlfilter:skipthumbnailgeneration” in the filter service for the filter plugin “JerichoWithThumbnails.” This will index the HTML documents without thumbnails where the metadata “htmlfilter:skipthumbnailgeneration” is set.

Sitemap-based CrawlingPermanent link for this heading

To edit Confluence sitemaps, activate "Delta Crawling" and enter the Confluence sitemap URL as the crawling root.

In this mode, the Connector reads the websites solely from the sitemaps. Here both the properties lastmod and changefreq of the pages of the site map are compared with the indexed pages. Very high frequency indexing strategies can be applied using a precise sitemap.

For the "Sitemap-based Delta Crawling" mode, two options are available:

  • „Sitemap Based Incomplete“: the URLs of the configured sitemaps are indexed; documents which have already been indexed and are not included in the sitemaps remain in the index.
  • „Sitemap Based Complete“: the URLs of the configured sitemaps are indexed; documents which have already been indexed and are not included in the sitemaps will be deleted.

The "Use Stream Parser" option uses a stream parser for processing the sitemap. This option is suitable for sitemaps with a lot of URLs.

Resource ParametersPermanent link for this heading

In this section (available only when "Advanced Settings" is selected), the crawl speed can be adjusted.

Under "Number Of Crawler Threads", you can define how many threads simultaneously pick sites from the web server.

"Request Interval" defines the number of milliseconds the crawler (thread) waits between each single request. However, a "crawl-delay" robot command is always taken into consideration and will override this value.

ProxyPermanent link for this heading

You can enter a proxy server in the "Network" tab if your infrastructure so requires.

Confluence LoginPermanent link for this heading

This chapter describes the various authentication methods for the Atlassian Confluence Connector. Therefore, the methods that can be used to index content that is located behind a login are highlighted in this chapter.

Form Based LoginPermanent link for this heading

If the Atlassian Confluence sitemap is accessible by http form authentication, the login parameters in the "Form Based Login" section can be configured as follows:

  • Login URL: the Atlassian Confluence URL to which the login form is to be sent: e.g. http://<confluence_url>/dologin.action
  • Form Elements: an element with the name “os_username“ needs to be added here. The value (“Value”) should be the user name of the user who is authorized to download the sitemap.
  • Form Password Elements: an element with the name “os_password“ needs to be added here. The value (“Value”) should be the password of the previously specified user.
  • Include Matching Cookies (Regular Expression): If the http-request becomes to large because all cookies are included you choose the required cookies with this option.

Complex form-based authenticationPermanent link for this heading

If the previous scenario is not sufficient, the following settings can be used:

  • Session initialization URL: This URL is opened at the beginning so that it can then be dynamically redirected. The cookies received in the process are retained for the session.
  • Login form parameters: If hidden fields are set in the login form, they can be listed here. They are extracted and sent along with the login request. A typical example of this is the dynamically generated FormID, which is returned as a hidden parameter from the Web server.
  • Login form parameters: If hidden fields are set in the login form, they can be listed here. They are extracted and sent along with the login request. A typical example of this is the dynamically generated FormID, which is returned as a hidden parameter from the Web server.
  • Login URL patterns: All redirects that correspond to the regular expressions specified here are tracked during the login process.
  • Login post URL patterns: When tracking the redirects that correspond to the regular expressions specified here, all collected form parameters are sent using an HTTP POST request.
  • Logged in URL patterns: If you are redirected to an URL that matches the regular expressions specified here, the login process was successful.
  • Maximum allowed count of redirects: This can be used to set the maximum depth of the tracked redirects.

NTLMPermanent link for this heading

To use NTLM authentication, the user, the password, and the domain need to be configured as credentials in the Network tab first:

After this, this credential has to be selected in the Atlassian Confluence Connector in the "NTLM Credential” setting:

In the field “Mindbreeze InSpire Fully Qualified Domain Name,” the “Fully Qualified Domain Name” of the Mindbreeze InSpire server must be entered.

Note: If NTLM authentication is used, the thumbnails in Mindbreeze InSpire do not work.

Configuring “Access Check Rules”Permanent link for this heading

Access check rules are comprised of:

  • “Access Check Principal,” the user names can be in the format “username@domain," “domain\username,” or “distinguished name.” The group names can only be in the distinguished name format. In addition, a reference to a capture group in the selection pattern can be used here.
  • “Access Check Action,” grant or deny.
  • “Metadata Key for Selection,” a metadata name; can be empty (all documents are selected)
  • “Selection Pattern,” a regular expression; can be empty (all documents are selected).

Atlassian Confluence Principal ResolutionPermanent link for this heading

Add the Caching Confluence Principal Resolution Service. (Note: must first be installed in the tab "Plugins").

Enter the "Confluence Server URL".

The necessary login information for accessing the "Confluence Server URL" needs to be configured in the "Network" tab and mapped to the "Confluence server URL" endpoint.

Specify the directory path for the cache in the“Database Directory Path“ field and change the “Cache In Memory Items Size” if necessary, depending on the available memory capacity of the JVM. In the “Cache Update Interval“ field, specify the time (in minutes) that should elapse before the cache is updated. This time interval is ignored the first time the service is started. The next time the service is started, this time will be taken into account. The settings “Health Check Interval“, “Health Check max. Retries On Failure“ and “Heath Check Request Timeout“ allow this service to be restarted if, for instance, there are persistent connection problems.

The service is will be available at the specified “Webservice Port”. If multiple principal resolution services are configured, make sure that the "Web service port" parameters are different and that the configured ports are available.

The option "Lowercase Principals" allows all principals from the cache to be delivered in lower case.

If users cannot be resolved for a search query, a request will be sent directly to Confluence if the option "Suppress Confluence Service Calls" is not enabled. However, for performance reasons, it is recommended that you enable this option so that no live requests are made to Confluence.

To test the caching principal resolution service, you can use the Principal Resolution Service REST API.