Web Connector

Installation and Configuration

Copyright ©

Mindbreeze GmbH, A-4020 Linz, .

All rights reserved. All hardware and software names used are registered trade names and/or registered trademarks of the respective manufacturers.

These documents are highly confidential. No rights to our software or our professional services, or results of our professional services, or other protected rights can be based on the handing over and presentation of these documents.

Distribution, publication or duplication is not permitted.

The term ‘user‘ is used in a gender-neutral sense throughout the document.

Configuration of MindbreezePermanent link for this heading

Configuration of Index and CrawlerPermanent link for this heading

Select the “Advanced” installation method:

Click on the “Indices” tab and then on the “Add new index” symbol to create a new index.

Enter the index path, e.g. “C:\Index. Adapt the Display Name of the Index Service and the related Filter Service if necessary

Add a new data source with the symbol “Add new custom source” at the bottom right.

If necessary, choose “Web” in the “Category” field. With the “Crawler Interval” setting you are able to configure the interval between two crawl runs.

Web PagePermanent link for this heading

You can specify a regular expression for the links to follow with the field “URL Regex”. If you leave the field empty, all pages with the same host and domain parts as the “Crawling Root” will be indexed (e.g. de.wikipedia.org when the “Crawling Root” is http://de.wikipedia.org).

You can also specify a pattern for the URLs that need to be excluded using the “URL Exclude Pattern” field. The URLs matched by this pattern will not be crawled and hence not be used for further link extraction.

With the option “Convert URL-s to lowercase” set, all URL-s that are found by the crawler are converted to lowercase.

You can add an arbitrary number of crawling roots editing the “Crawling Root” field and pressing the “Add” button. The added crawling roots are displayed in the list above (e.g. Crawling Root[1]). You can remove existing crawling roots by clicking on the “Remove” button besides them.

With the “Maximum Link Depth” field you can set the maximum count of hops from the crawling roots for the URLs that are crawled. URLs having a higher hop count will be ignored.

If you want to set additional HTTP headers while crawling (e.g. for setting the Accept-Language header), do so using the Accept Headers parameter.

With the option “Incomplete Delta Crawl Runs” enabled, pages that are not reachable from the current “Crawling Root” are not deleted from the index at the end of the crawl run. To minimize the load of subsequent crawl runs on your site, you can provide a crawling root with links to updated pages only.

IMPORTANT: the option “Incomplete Delta Crawl Runs” must not be used with sitemap-based delta crawling. For this see section “Sitemap Crawling Strategy”.

With the Option: “Cleanup non matching URLs from index” all documents that do not fit the rules of URL Regex and URL Exclude Pattern are deleted from the index.

With the Option „Delete URL-s that are redirecting to excluded URL-s“ all documents that are redirecting to documents which are excluded by “URL Exclude Pattern” will be deleted.

With the Option: „Delete URL-s that are no longer available“ all documents with HTTP Status 401, 403 or 404 are also removed from the index. If these documents where reached via redirect, the documents that redirected to them are also removed from the index.

If a regular expression is set as an “Enforce extension from URL if matches” parameter, the extension is derived from the URL instead of from the “Content type” http header for documents with matching URLs.

The Inherit Crawl Root Query Parameter Pattern option lets you inherit URL query parameters from the crawling root to the children's URLs. The use case is, for example, web pages that provide different content depending on the query parameters. For example, The following crawling roots https://mysite.com/events?location=us and https://mysite.com/events?location=en provide different content. Likewise, child pages provide different content: https://mysite.com/events/sponsored?location=us and https://mysite.com/events/sponsored?location=en. That the query parameter location, which comes from the crawling root, also applies to the child pages, the option "Inherit Crawl Root Query Parameter Pattern" must be set to the value location. The value can be any regular expression matched against the query parameter name. If a child page has already set query parameters of the same name, these are overwritten by the crawling root query parameter.

The “Max Retries” option determines how often the connector tries to download a document when temporary errors (e.g. socket timeouts) occur. The default value is 0. (No further download attempts). If you are crawling across an unstable network (that causes timeouts), this value should be increased to 10, for example. If the timeouts are caused by an overloaded data source, the value should be left at 0 so that the data source is not loaded even further.

The “Retry Delay Seconds” option determines the waiting time (in seconds) between download attempts (see "Max Retries"). The default value is 1.

Sitemap Crawling StrategyPermanent link for this heading

In order to use sitemaps according to the Sitemaps.org protocol, check “Delta Crawling” and locate the site’s root sitemap as the crawling root.

In this scenario the crawler retrieves the web pages that are listed in the sitemap exclusively. The lastmod property of a sitemap URL entry is compared with the modification date of the already indexed web page (if exists). Furthermore the changefreq is interpreted between crawling runs. With a precise sitemap a high-frequent recrawling strategy can be employed.

Two “Delta Crawling” options are available:

  • “Sitemap-based Incomplete”: with this option enabled, the URL entries from the configured sitemaps will be crawled and the already indexed URL-s which are not found in the sitemaps are left in the index.
  • “Sitemap-based Complete”: with this option enabled the URL entries from the configured sitemaps will be crawled and the already indexed URL-s which are not found in the sitemaps are deleted from the index.

If the “Pass Sitemap ACL and Metadata to Redirect Target URLs” option is enabled and http redirects are allowed in root URLs, the sitemap metadata and ACLs are also applied to the redirect target URLs.

  • The option “Use Stream Sitemap Parser” is enabling a stream-based parsing of the sitemaps. This is more memory-efficient in case of large sitemap XML-s but is also less tolerant to XML errors in the sitemaps
    The Option “Sitemap Metadata Prefix”: adds the configured prefix to each metadata extracted from the sitemap.

Default Content TypePermanent link for this heading

The option “Default Content Type” can be used to set the MIME-Type for those documents for which the MIME-Type cannot be extracted from the HTTP-Header.

Resource ParametersPermanent link for this heading

This section is visible only when the “Advanced Settings” mode is activated on the “Indices” tab.

The “Number of Crawler Threads” setting defines the number of parallel threads that fetch web pages. By default, 5 threads are configured.

In the “Request Interval” field you can set the minimum delay between consecutive requests of a crawler thread in milliseconds. The default value is 250. If there is a “Crawl-Delay” setting used in a web sites robots.txt, that value overrules the “Request Interval” value.

ProxyPermanent link for this heading

In the “Proxy” section you can enter a proxy server if required in your infrastructure. To do this, enter the computer name and the port of the proxy server in “Proxy Host” and “Proxy Port”.

The Web Connector can register on the proxy server with HTTP-BASIC. Enter the user in the field “Proxy User” and the password in “Proxy Password” if the connections are to be made by an authentication-enabled proxy.

AuthenticationPermanent link for this heading

This chapter describes the various authentication methods for the Web Connector. The methods that can be used to index content that is located behind a login are also discussed.

Form-based authenticationPermanent link for this heading

This section deals with the mechanism of the form-based login, which is essentially a mechanism that allows you to perform a login using a login form and to manage user sessions using HTTP cookies.

Form-based login simulates the user behavior and browsing behavior required to automate such logins.

In this chapter, two scenarios are described. Both scenarios are based on the settings shown in the figure below.

Static form-based login with session managementPermanent link for this heading

In this scenario, a POST request is sent to a specific URL to trigger the authentication. The URL to be used for this purpose is entered under Login URL. For instance, this URL can be determined using the debugging functions of the web browser. The various options are explained below:

  • Session initialization URL

In some cases it is necessary to retrieve a dynamically generated cookie from a specific URL and send it along already with the form-based login. An HTTP GET request is executed on the URL entered here and the cookies generated from this are sent along with the actual login.

  • Include matching cookies (regular expression)

This can be used to restrict which cookies are to be stored for session management. This field must contain a regular expression that applies to the names of the cookies that are to be enabled and used for the session.

  • Form and password elements

The names and values of the elements that are used in the HTTP POST request on the login URL have to be specified in this setting. In doing this, the name of the HTML form field should be entered. All password fields have to be entered under Password Elements.

  • Follow redirects for login post

If this option is enabled, all redirects are tracked after the HTTP POST request to the login URL, and all cookies are collected until no further redirect is requested or the authentication is successful.

No further settings are required for this scenario.

Complex form-based authenticationPermanent link for this heading

If the previous scenario is not sufficient, the following settings can be used:

  • Session initialization URL

This URL is opened at the beginning so that it can then be dynamically redirected. The cookies received in the process are retained for the session.

  • Login form parameters

If hidden fields are set in the login form, they can be listed here. They are extracted and sent along with the login request. A typical example of this is the dynamically generated FormID, which is returned as a hidden parameter from the Web server.

  • Login URL patterns

All redirects that correspond to the regular expressions specified here are tracked during the login process.

  • Login post URL patterns

When tracking the redirects that correspond to the regular expressions specified here, all collected form parameters are sent using an HTTP POST request.

  • Logged in URL patterns

If you are forwarded to an URL that matches the regular expressions specified here, the login process was successful.

  • Maximum allowed count of redirects

This can be used to set the maximum depth of the tracked redirects.

  • Post to configured login URL

If this option is enabled, redirects to a "Login Post URL" are replaced by an HTTP POST request to the URL configured under "Session Initialization URL".

  • Force Session Renewal After Expiration

If this option is set, the login process is always re-executed if the session is older than the configured maximum session age (maximum session age in seconds). This option only works if “Post to configured login URL” is enabled.

  • Maximum Session Age in Seconds

Maximum session age in seconds.

NTLMPermanent link for this heading

To use NTLM authentication, the user, the password, and the domain need to be configured as credentials in the Network tab:

After this, this credential has to be selected in the Web Connector in the "NTLM Credential” setting:

Authorization basic headerPermanent link for this heading

The basic authentication following RFC 2617 is the most common type of HTTP authentication. The Web server requests an authentication using

WWW-Authenticate: Basic realm="RealmName"

where RealmName is a description of the protected area. The browser then searches for the username/password for this URL and queries the user if necessary. The browser then sends the authentication Base64-encoded and in the form Username: Password to the server using the authorization header.


Authorization: Basic d2lraTpwZWRpYQ==

To set the header specified in the example above, it has to be configured in the HTTP Request Header option as shown in the following screenshot:

Filter ConfigurationPermanent link for this heading

There are multiple environment variables.

Cache Settings for Thumbnail GenerationPermanent link for this heading

The variable MES_THUMBNAIL_CACHE_LOCATION specifies a directory for the network cache used during thumbnail generation. The maximal cache size can be defined using the variable MES_THUMBNAIL_CACHE_SIZE_MB. Only if both variables are set, a cache is created and used.

Example (Linux):

export MES_THUMBNAIL_CACHE_LOCATION=/tmp/thumbcache


On Windows, the variables can be defined in the Control Panel.

Timeout Settings for Thumbnail GenerationPermanent link for this heading

Using the variable MES_THUMBNAIL_TIMEOUT a timeout value for thumbnail generation can be redefined. Otherwise the default value of 50 seconds will be used.

Example (Linux):


On Windows, the variables can be defined in the Control Panel.

Extract Main Content with alternative Filter-ModePermanent link for this heading

Crawling of e.g. news sites often indexes useless content like menus or footers. The HTML-Filter can be switched in an alternative Mode, which can index only meaningful content, using heuristics.

There are multiple ways for configuration:

Filter Plugin PropertiesPermanent link for this heading

Click on the “Filters”-Tab and activate “Advanced Settings”.

In the section “Global Filter Plugin Properties” select “FilterPlugin.JerichoWithThumbnails(…)” and click on “Add”.

Expand the new Entry “FilterPlugin.JerichoWithThumbnails” and set the Value “Use Boilerpipe Extractor” to the Value “Article”.

Finally reindex.

Datasource XPath MetadataPermanent link for this heading

In the Tab “Indices” activate “Advanced Settings”.

Extract the affected Index. In Section “Data Source”, Subsection “Extract Metadata” click on the Plus-Symbol “Add Composite Property”.

Within the new Section “Extract Metadata” put in as Name: htmlfilter:extractor and XPath: "Article" (Important: with double quotes).

Finally reindex.

Ignore empty HTML ElementsPermanent link for this heading

If documents with empty HTML-elements appear within the index, you can define a regular Expression to remove these elements while filtering.

There are multiple ways for configuration:

Filter Plugin PropertiesPermanent link for this heading

Click on the “Filters” Tab and activate “Advanced Settings.

In Section “Global Filter Plugin Properties select “FilterPlugin.JerichoWithThumbnails(…)” and click on “Add.

Expand the new Entry “FilterPlugin.JerichoWithThumbnailsand set the Value “Ignore Empty Tags Pattern” to e.g. the Value “^(ul|li|a|div)$”. This means that empty HTML-elements ul, li, a and div will be removed, if empty.

Finally reindex.

Datasource XPath MetadataPermanent link for this heading

In the Tab “Indices” activate “Advanced Settings.

Expand the affected index. In section “Data Source, Subsection “Extract Metadata” click on the plus-symbol “Add Composite Property”.

In the new Section “Extract Metadata” put in as Name: htmlfilter:ignoreEmptyCharactersElementTagsPattern and as XPath: "^(ul|li|a|div)$" (Important: double Quotes). This means that empty HTML-elements ul, li, a and div will be removed, if empty.

Finally reindex.

Using Googleon/Googleoff tagsPermanent link for this heading

Google GSA defines a mechanism in order to be able to mark and designate certain sections as "non-searchable" within a single HTML site. Consequently, these designated sections are not indexed, although the rest of the page is. The marks take the form of HTML comments that are set in pairs.

The following tags are supported:

fish <!--googleoff: index-->shark <!--googleon: index-->dog

“fish“ and “dog“ are indexed; “shark“ is not

fish <!--googleoff: snippet-->shark <!--googleon: snippet-->dog

fish <!--googleoff: all-->shark <!--googleon: all-->dog

<!--googleoff: anchor--><A href=subsite.html>shark </A>dog <!--googleon: anchor-->

“dog“ is indexed; “shark“ is not

There are several options for configuration.

System-wide use with global filter plugin propertiesPermanent link for this heading

To enable this function, click on the "Filters" tab and then click "Advanced Settings".

Under "Global Filter Plugin Properties", select "FilterPlugin.JerichoWithThumbnails (...)" and click "Add".

Expand the new entry "FilterPlugin.JerichoWithThumbnails" and tick the "Apply googleon/googleoff tags" setting. Then re-index.

Use in Web ConnectorPermanent link for this heading

In the Web Connector settings under “Content Extraction“, check the "Apply googleon/googleoff Tags" setting. Then re-index.

Crawler-specific use with Datasource XPath MetadataPermanent link for this heading

Enable "Advanced Settings" in the “Indices“ tab.

Expand the relevant index. In the "Data Source" section, "Extract Metadata" sub-section, click the "Add Composite Property" plus-sign icon.

In the new section "Extract Metadata", enter the name: htmlfilter:applygoogleonoff and set the XPath to: "true" (important: in quotation marks).

Then re-index.

AuthorizationPermanent link for this heading

Configuring authorization settings are only possible through “AuthorizedWeb” category. In order to configure these settings the category of data source should be change from “Web” to “AuthorizedWeb”. The rest of settings in “AuthorizedWeb” category are similar to those described in previous chapters.

Configuring Access RulesPermanent link for this heading

An Access Rule is defined by:

“Access Check Principal”, which should be either in username@domain or domain\username or distinguished name format for users or only in distinguished name format for user groups. Additionally a capture group from the selection pattern can be referenced here (See Access Rules[3]).

“Access Check Action”, which controls access type (Grant or Deny).

“Metadata Key for Selection”, should be a metadata name used by access rule or empty (selects all documents)

“Selection Pattern”, should be a regular expression or empty (matches all documents).

Parallel Processing of URLsPermanent link for this heading

The option „Use hashing queue assignment policy“ enables the hash based distribution of input URLs to processing queues. The number of queues is determined by the „Parallel Queue Count“ setting.

Without the option „Use hashing queue assignment policy“ URLs are distributed by hostname.

High priority removal of documentsPermanent link for this heading

At the end of a crawl-run inaccessible documents are removed from the index. Additionally the option “Invalid document deletion schedule” can be used to define a schedule which is used to remove inaccessible documents from the index independent of crawl runs.

The example schedule: “0 */45 * * * ?” means a deletion run every 45 minutes.

This schedule is only active if the crawler schedule permits it.

The following Documents will be removed:

  • Not found documents (HTTP Status 404, 310)
  • Redirections to those documents (e.g. HTTP Status 301, 307)

If “Cleanup non matching URL-s from Index” is activated, the following documents will be removed additionally:

  • According to „URL Exclude Pattern“ ignored Documents:
  • Redirections to those documents (e.g. HTTP Status 301, 307)

Mindbreeze Sitemap ExtensionsPermanent link for this heading

If „Sitemap-based“ delta crawling is used, crawling root URLs are processed as Sitemaps. The Mindbreeze Web Connector supports sitemap extensions which allow the definition of ACL informations and metadata.

Sitemaps with Access Control Lists (ACL)Permanent link for this heading

ACL informations can be defined for all <url> elements of a Sitemap.

ACLs from sitemaps are not compatible with access check rules.
For example:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:wstxns1="tag:mindbreeze.com,2008:/indexing/interface" >

  <url xmlns:ns3="http://www.google.com/schemas/sitemap-news/0.9">

    <loc>http://myserver.mycompany.com </loc>




      <wstxns1:grant>User1 </wstxns1:grant>

      <wstxns1:deny>User2 </wstxns1:deny>



Access is allowed for User1 but denied for User2.

Role “everyone” automatically includes every user.

Sitemaps with MetadataPermanent link for this heading

It is also possible to define metadata for <url> elements. For Example:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:wstxns1="tag:mindbreeze.com,2008:/indexing/interface" >

  <url xmlns:ns3="http://www.google.com/schemas/sitemap-news/0.9">

    <loc>http://myserver.mycompany.com </loc>



    <wstxns1:meta key=”title”>

      < wstxns1:value>Page Title</wstxns1:value>



Each „meta“ Element can have multible „Value“ sub-elements if the metadata has a list of Values:

    <wstxns1:meta key=”telefonnummer”>

      < wstxns1:value>1234234245</wstxns1:value >

      < wstxns1:value>1234234344</wstxns1:value>


Appendix APermanent link for this heading

Heritrix Status CodesPermanent link for this heading

Each crawled URI gets a status code.  This code (or number) indicates the result of a URI fetch in Heritrix. Codes ranging from 200 to 599 are standard HTTP response codes. Other Heritrix status codes are listed below.




Successful DNS lookup


Fetch never tried (perhaps protocol unsupported or illegal URI)


DNS lookup failed


HTTP connect failed


HTTP connect broken


HTTP timeout


Unexpected runtime exception.  See runtime-errors.log.


Prerequisite domain-lookup failed, precluding fetch attempt.


URI recognized as unsupported or illegal.


Multiple retries failed, retry limit reached.


Temporary status assigned to URIs awaiting preconditions.  Appearance in logs may be a bug.


URIs assigned a failure status.  They could not be queued by the Frontier and may be unfetchable.


Prerequisite robots.txt fetch failed, precluding a fetch attempt.


Some other prerequisite failed, precluding a fetch attempt.


A prerequisite (of any type) could not be scheduled, precluding a fetch attempt.


Empty HTTP response interpreted as a 404.


Severe Java Error condition occured such as OutOfMemoryError or StackOverflowError during URI processing.


"Chaff" detection of traps/content with negligible value applied.


The URI is too many link hops away from the seed.


The URI is too many embed/transitive hops away from the last URI in scope.


The URI is out of scope upon reexamination.  This only happens if the scope changes during the crawl.


Blocked from fetch by user setting.


Blocked by a custom processor.


Blocked due to exceeding an established quota.


Blocked due to exceeding an established runtime


Deleted from Frontier by user.


Processing thread was killed by the operator.  This could happen if a thread is a non-responsive condition.


Robots.txt rules precluded fetch.