Our search is currently down for maintenance
We'll be back in approx. 30 minutes.
Mindbreeze GmbH, A-4020 Linz, .
All rights reserved. All hardware and software names used are registered trade names and/or registered trademarks of the respective manufacturers.
These documents are highly confidential. No rights to our software or our professional services, or results of our professional services, or other protected rights can be based on the handing over and presentation of these documents.
Distribution, publication or duplication is not permitted.
The term ‘user‘ is used in a gender-neutral sense throughout the document.
Click on the “Indices” tab and then on the “Add new index” symbol to create a new index.
Enter the index path, e.g. “C:\Index”. Adapt the Display Name of the Index Service and the related Filter Service if necessary
Add a new data source with the symbol “Add new custom source” at the bottom right.
If necessary, choose “Web” in the “Category” field. With the “Crawler Interval” setting you are able to configure the interval between two crawl runs.
You can specify a regular expression for the links to follow with the field “URL Regex”. If you leave the field empty, all pages with the same host and domain parts as the “Crawling Root” will be indexed (e.g. de.wikipedia.org when the “Crawling Root” is ).
You can also specify a pattern for the URLs that need to be excluded using the “URL Exclude Pattern” field. The URLs matched by this pattern will not be crawled and hence not be used for further link extraction. The pattern has to match the whole URL (including URL parameters).
With the option "Include URL By Metadata" or "Exclude URL by Metadata", certain pages can be excluded (when crawling sitemaps) based on the metadata in the sitemap. The Metadata Name field specifies the metadata name and the Pattern field specifies the regular expression against which the metadata value is matched.
If "URL Regex", "URL Exclude Pattern" and "Include/Exclude URL by Metadata" are used simultaneously, "URL Regex" is applied first, then the pages are excluded with "URL Exclude Pattern" and finally the remaining pages are filtered with "Include/Exclude URL by Metadata".
With the option “Convert Document Keys to Lower Case” set, the document keys (header/mes:key) of the documents are converted to lower case.
You can add an arbitrary number of crawling roots editing the “Crawling Root” field and pressing the “Add” button. The added crawling roots are displayed in the list above (e.g. Crawling Root). You can remove existing crawling roots by clicking on the “Remove” button besides them.
With the “Maximum Link Depth” field you can set the maximum count of hops from the crawling roots for the URLs that are crawled. URLs having a higher hop count will be ignored.
If you want to set additional HTTP headers while crawling (e.g. for setting the Accept-Language header), do so using the Accept Headers parameter.
With the option “Incomplete Delta Crawl Runs” enabled, pages that are not reachable from the current “Crawling Root” are not deleted from the index at the end of the crawl run. To minimize the load of subsequent crawl runs on your site, you can provide a crawling root with links to updated pages only.
IMPORTANT: the option “Incomplete Delta Crawl Runs” must not be used with sitemap-based delta crawling. For this see section “Sitemap Crawling Strategy”.
If a regular expression is set as an “Enforce extension from URL if matches” parameter, the extension is derived from the URL instead of from the “Content type” http header for documents with matching URLs.
If the "Enable Default ACLs" option is active (active by default), ACLs are set for Web documents if they do not have any explicitly defined ACLs (e.g. via sitemaps with <mes:acl>). Which ACLs are set in these cases is determined with the option "Default ACL Principals". Several "Default ACL Principals" can be specified separated by line breaks. If this field is left empty and "Enable Default ACLs" is active, "everyone" is used by default as "Default ACL Principals".
The “Inherit Crawling Root Query Parameter Pattern” option lets you inherit URL query parameters from the crawling root to the children's URLs. The use case is, for example, web pages that provide different content depending on the query parameters. For example, The following crawling roots https://mysite.com/events?location=us and https://mysite.com/events?location=en provide different content. Likewise, child pages provide different content: https://mysite.com/events/sponsored?location=us and https://mysite.com/events/sponsored?location=en. That the query parameter location, which comes from the crawling root, also applies to the child pages, the option "Inherit Crawl Root Query Parameter Pattern" must be set to the value location. The value can be any regular expression matched against the query parameter name. If a child page has already set query parameters of the same name, these are overwritten by the crawling root query parameter.
The “Max Retries” option determines how often the connector tries to download a document when temporary errors (e.g. socket timeouts) occur. The default value is 0. (No further download attempts). If you are crawling across an unstable network (that causes timeouts), this value should be increased to 10, for example. If the timeouts are caused by an overloaded data source, the value should be left at 0 so that the data source is not loaded even further.
The option “Max Document Size (MB)“ sets the maximum allowed file size of documents that are downloaded. The default value is 50. If a document is larger than this value, the document is truncated. A value of 0 means that no maximum file size is set.
With the option "Process Canonical Link" you can define a regular expression that determines for which URLs the crawler should try to read the URL from the "canonical" tag and set it as index key and URL metadata.
The option “Encode Canonical Links” determines whether URLs extracted from the “canonical” tag should be encoded before being stored in the index. To ensure that this setting is applied correctly, we recommend that you clean and re-index the index. This is useful because the URLs are then stored in the index in the same format as when you copy them from the browser, which usually automatically encodes the URLs.
In order to use sitemaps according to the Sitemaps.org protocol, check “Delta Crawling” and locate the site’s root sitemap as the crawling root.
In this scenario the crawler retrieves the web pages that are listed in the sitemap exclusively. The lastmod property of a sitemap URL entry is compared with the modification date of the already indexed web page (if exists). Furthermore the changefreq is interpreted between crawling runs. With a precise sitemap a high-frequent recrawling strategy can be employed.
Two “Delta Crawling” options are available:
If the “Pass Sitemap ACL and Metadata to Redirect Target URLs” option is enabled and http redirects are allowed in root URLs, the sitemap metadata and ACLs are also applied to the redirect target URLs.
Sitemaps from the local filesystem are also supported if a Delta Crawling mode is selected. Enter the File-URL as Crawling Root. Only File URLs pointing to the data directory are permitted. E.g file:///data/sitemap.xml.
<?xml version="1.0" encoding="UTF-8"?>
To define additional metadata and ACLs, there is a Mindbreeze extension. The following XML tags are additionally available:
Can be defined multiple times within an <url> tag to define metadata. The following attributes are available:
Can be defined multiple times within a <mes:meta> tag to define one or more values for a metadata
Can be used to define ACLs. Note that this tag is only considered if the option "Enable Default ACLs" is activated (active by default). Please note that ACLs from sitemaps are not compatible with “Access Check Rules”.
Can be defined multiple times within a <mes:acl> tag to grant access for a principal
Can be defined multiple times within a <mes:acl> tag to deny access for a principal
Can be defined multiple times within a <mes:acl> tag to define in which groups a user must be. <mes:require> generally does not grant access, but denies access if the user is not in the group. Thus, after the last <mes:require> tag an additional <mes:grant> tag is necessary to grant access.
<!-- additional metadata -->
<mes:meta key="keywords" aggregatable="true">
<!-- more -->
<!-- ACL -->
The option “Default Content Type” can be used to set the MIME-Type for those documents for which the MIME-Type cannot be extracted from the HTTP-Header.
This section is visible only when the “Advanced Settings” mode is activated on the “Indices” tab.
The “Number of Crawler Threads” setting defines the number of parallel threads that fetch web pages. By default, 5 threads are configured.
In the “Request Interval” field you can set the minimum delay between consecutive requests of a crawler thread in milliseconds. The default value is 250. If there is a “Crawl-Delay” setting used in a web sites robots.txt, that value overrules the “Request Interval” value.
If external network requests are absolutely necessary to display the content of the web page, it is possible to define security exceptions using the "Additional Network Resources Hosts" setting (Advanced Setting). This setting can define a list of hostnames that are allowed in any case. In the example from above you can set the value ajax.googleapis.com for "Additional Network Resources Hosts". This will now allow network requests such as https://ajax.googleapis.com/ajax/libs/angularjs.
Activate "Advanced Settings" to display all settings
Enable Verbose Logging
Enables advanced logging for diagnostic purposes (default: disabled).
Additional Network Resources Hosts
List of hostnames to which network requests are allowed (See description above).
Ignore SSL/Certificate Errors
If active, HTTPS SSL or certificate errors are ignored. For security reasons, this setting may only be enabled in test systems.
Width of the generated thumbnails in pixels (default value: 100).
Height of the generated thumbnails in pixels (default value: 75)
Page Load Strategy
Determines with which strategy web pages are loaded. This value is for internal use and should not be changed. (Default value: "Eager")
Page Ready State
Specifies how to determine whether a page has finished loading. This value is for internal use and should not be changed (default value: “Complete Or Interactive“).
Page Load Timeout
Browser Recycle Threshold
Number of websites that are processed with a single internal browser instance before the instance is automatically stopped and restarted to conserve system resources. (Default value: 1000)
This value is for internal use and should not be changed. (Default value: empty)
In the “Proxy” section you can enter a proxy server if required in your infrastructure. To do this, enter the computer name and the port of the proxy server in “Proxy Host” and “Proxy Port”.
The Web Connector can register on the proxy server with HTTP-BASIC. Enter the user in the field “Proxy User” and the password in “Proxy Password” if the connections are to be made by an authentication-enabled proxy.
This chapter describes the various authentication methods for the Web Connector. The methods that can be used to index content that is located behind a login are also discussed.
This section deals with the mechanism of the form-based login, which is essentially a mechanism that allows you to perform a login using a login form and to manage user sessions using HTTP cookies.
Form-based login simulates the user behavior and browsing behavior required to automate such logins.
In this chapter, two scenarios are described. Both scenarios are based on the settings shown in the figure below.
In this scenario, a POST request is sent to a specific URL to trigger the authentication. The URL to be used for this purpose is entered under Login URL. For instance, this URL can be determined using the debugging functions of the web browser. The various options are explained below:
In some cases it is necessary to retrieve a dynamically generated cookie from a specific URL and send it along already with the form-based login. An HTTP GET request is executed on the URL entered here and the cookies generated from this are sent along with the actual login.
This can be used to restrict which cookies are to be stored for session management. This field must contain a regular expression that applies to the names of the cookies that are to be enabled and used for the session.
The names and values of the elements that are used in the HTTP POST request on the login URL have to be specified in this setting. In doing this, the name of the HTML form field should be entered. All password fields have to be entered under Password Elements.
If this option is enabled, all redirects are tracked after the HTTP POST request to the login URL, and all cookies are collected until no further redirect is requested or the authentication is successful.
No further settings are required for this scenario.
If the previous scenario is not sufficient, the following settings can be used:
This URL is opened at the beginning so that it can then be dynamically redirected. The cookies received in the process are retained for the session.
If hidden fields are set in the login form, they can be listed here. They are extracted and sent along with the login request. A typical example of this is the dynamically generated FormID, which is returned as a hidden parameter from the Web server.
All redirects that correspond to the regular expressions specified here are tracked during the login process.
When tracking the redirects that correspond to the regular expressions specified here, all collected form parameters are sent using an HTTP POST request.
If you are forwarded to an URL that matches the regular expressions specified here, the login process was successful.
This can be used to set the maximum depth of the tracked redirects.
If this option is enabled, redirects to a "Login Post URL" are replaced by an HTTP POST request to the URL configured under "Session Initialization URL".
If this option is set, old session cookies are not used to create a new session when a session expires.
If this option is set, the login process is always re-executed if the session is older than the configured maximum session age (maximum session age in seconds). This option only works if “Post to configured login URL” is enabled.
Maximum session age in seconds.
To use NTLM authentication, the user, the password, and the domain need to be configured as credentials in the Network tab:
After this, this credential has to be selected in the Web Connector in the "NTLM Credential” setting:
To use OAuth authentication, a credential of type OAuth2 must be created in the Network tab:
For the Grant-Type Client Credentials it is sufficient to configure Realm, Client ID and Client Secret. For the Grant-Type Password Credentials you also have to configure the username and password.
The basic authentication following RFC 2617 is the most common type of HTTP authentication. The Web server requests an authentication using
WWW-Authenticate: Basic realm="RealmName"
where RealmName is a description of the protected area. The browser then searches for the username/password for this URL and queries the user if necessary. The browser then sends the authentication Base64-encoded and in the form Username: Password to the server using the authorization header.
Authorization: Basic d2lraTpwZWRpYQ==
To set the header specified in the example above, it has to be configured in the HTTP Request Header option as shown in the following screenshot:
Kerberos authentication uses the Negotiate protocol to authenticate HTTP requests. The Web connector is thus able to index websites that can only be accessed with Kerberos authentication. The following steps are required to activate Kerberos authentication:
Note: Kerberos does not currently support web thumbnails; they are automatically disabled.
With the option "URLs Excluded from Filtering" already found pages can be excluded from indexing by means of a regular expression. A possible use case is, for example, if you want to index certain pages that can only be reached via detours and you do not want to index the detours themself. In this case, you can use "URLs Excluded from Filtering" to specify a regular expression that excludes the detours.
Here is an overview of the options that influence the crawling direction:
There are multiple environment variables.
The variable MES_THUMBNAIL_CACHE_LOCATION specifies a directory for the network cache used during thumbnail generation. The maximal cache size can be defined using the variable MES_THUMBNAIL_CACHE_SIZE_MB. Only if both variables are set, a cache is created and used.
On Windows, the variables can be defined in the Control Panel.
Using the variable MES_THUMBNAIL_TIMEOUT a timeout value for thumbnail generation can be redefined. Otherwise the default value of 50 seconds will be used.
On Windows, the variables can be defined in the Control Panel.
Crawling of e.g. news sites often indexes useless content like menus or footers. The HTML-Filter can be switched in an alternative Mode, which can index only meaningful content, using heuristics.
There are multiple ways for configuration:
Click on the “Filters”-Tab and activate “Advanced Settings”.
In the section “Global Filter Plugin Properties” select “FilterPlugin.JerichoWithThumbnails(…)” and click on “Add”.
Expand the new Entry “FilterPlugin.JerichoWithThumbnails” and set the Value “Use Boilerpipe Extractor” to the Value “Article”.
In the Tab “Indices” activate “Advanced Settings”.
Extract the affected Index. In Section “Data Source”, Subsection “Extract Metadata” click on the Plus-Symbol “Add Composite Property”.
Within the new Section “Extract Metadata” put in as Name: htmlfilter:extractor and XPath: "Article" (Important: with double quotes).
If documents with empty HTML-elements appear within the index, you can define a regular Expression to remove these elements while filtering.
There are multiple ways for configuration:
Click on the “Filters” Tab and activate “Advanced Settings”.
In Section “Global Filter Plugin Properties” select “FilterPlugin.JerichoWithThumbnails(…)” and click on “Add”.
Expand the new Entry “FilterPlugin.JerichoWithThumbnails” and set the Value “Ignore Empty Tags Pattern” to e.g. the Value “^(ul|li|a|div)$”. This means that empty HTML-elements ul, li, a and div will be removed, if empty.
In the Tab “Indices” activate “Advanced Settings”.
Expand the affected index. In section “Data Source”, Subsection “Extract Metadata” click on the plus-symbol “Add Composite Property”.
In the new Section “Extract Metadata” put in as Name: htmlfilter:ignoreEmptyCharactersElementTagsPattern and as XPath: "^(ul|li|a|div)$" (Important: double Quotes). This means that empty HTML-elements ul, li, a and div will be removed, if empty.
Google GSA defines a mechanism in order to be able to mark and designate certain sections as "non-searchable" within a single HTML site. Consequently, these designated sections are not indexed, although the rest of the page is. The marks take the form of HTML comments that are set in pairs.
The following tags are supported:
“fish“ and “dog“ are indexed; “shark“ is not
fish <!--googleoff: snippet-->shark <!--googleon: snippet-->dog
fish <!--googleoff: all-->shark <!--googleon: all-->dog
“dog“ is indexed; “shark“ is not
There are several options for configuration.
To enable this function, click on the "Filters" tab and then click "Advanced Settings".
Under "Global Filter Plugin Properties", select "FilterPlugin.JerichoWithThumbnails (...)" and click "Add".
Expand the new entry "FilterPlugin.JerichoWithThumbnails" and tick the "Apply googleon/googleoff tags" setting. Then re-index.
In the Web Connector settings under “Content Extraction“, check the "Apply googleon/googleoff Tags" setting. Then re-index.
Enable "Advanced Settings" in the “Indices“ tab.
Expand the relevant index. In the "Data Source" section, "Extract Metadata" sub-section, click the "Add Composite Property" plus-sign icon.
In the new section "Extract Metadata", enter the name: htmlfilter:applygoogleonoff and set the XPath to: "true" (important: in quotation marks).
Using the "Enable Character Normalization" option, special characters such as ü,â are transformed into a normal format (compatibility decomposition), which facilitates searching. This option can produce better results if the client service option "Query Expansion for Diacritic Term Variants" does not deliver the desired quality.
Configuring authorization settings are only possible through “AuthorizedWeb” category. In order to configure these settings the category of data source should be change from “Web” to “AuthorizedWeb”. The rest of settings in “AuthorizedWeb” category are similar to those described in previous chapters.
An Access Rule is defined by:
“Access Check Principal”, which should be either in username@domain or domain\username or distinguished name format for users or only in distinguished name format for user groups. Additionally a capture group from the selection pattern can be referenced here (See Access Rules).
“Access Check Action”, which controls access type (Grant or Deny).
“Metadata Key for Selection”, should be a metadata name used by access rule or empty (selects all documents)
“Selection Pattern”, should be a regular expression or empty (matches all documents).
Access Check Rules are only applied if there are no ACLs defined in the Sitemap (If Sitemaps are used).
If there are Access Check Rules which refer to metadata other than ‘url’ then all documents that may inlcude ACL changes may be crawled again in a delta run.
The option „Use hashing queue assignment policy“ enables the hash based distribution of input URLs to processing queues. The number of queues is determined by the „Parallel Queue Count“ setting.
Without the option „Use hashing queue assignment policy“ URLs are distributed by hostname.
At the end of a crawl-run inaccessible documents are removed from the index. This deletes all documents that were not successfully downloaded and indexed.
If the option "Incomplete Delta Crawl Runs" is active, no documents will be deleted at the end of the crawl run.
Additionally, the option “Invalid document deletion schedule” can be used to define a schedule which is used to remove inaccessible documents from the index independent of crawl runs.
The example schedule: “0 */45 * * * ?” means a deletion run every 45 minutes.
This schedule is only active if the crawler schedule permits it.
The following documents are deleted during the crawl run:
If “Cleanup non matching URL-s from Index” is activated, the following documents will be removed additionally:
If the option "Cleanup non matching URL-s from Index" is active and no "Invalid document deletion schedule" has been defined, the deletion process is started with each crawl run. If it is a delta crawl run, only documents ignored according to the "URL Exclude Pattern" will be deleted, otherwise inaccessible documents will also be deleted (HTTP status 404, 410, 301, 307). It is a delta crawl run if the option "Incomplete Delta Crawl Runs" is active or if "Sitemap-based Incomplete" is selected for the "Delta Crawling" option.
Each crawled URI gets a status code. This code (or number) indicates the result of a URI fetch in Heritrix. Codes ranging from 200 to 599 are standard HTTP response codes. Other Heritrix status codes are listed below.
Successful DNS lookup
Fetch never tried (perhaps protocol unsupported or illegal URI)
DNS lookup failed
HTTP connect failed
HTTP connect broken
Unexpected runtime exception. See runtime-errors.log.
Prerequisite domain-lookup failed, precluding fetch attempt.
URI recognized as unsupported or illegal.
Multiple retries failed, retry limit reached.
Temporary status assigned to URIs awaiting preconditions. Appearance in logs may be a bug.
URIs assigned a failure status. They could not be queued by the Frontier and may be unfetchable.
Prerequisite robots.txt fetch failed, precluding a fetch attempt.
Some other prerequisite failed, precluding a fetch attempt.
A prerequisite (of any type) could not be scheduled, precluding a fetch attempt.
Empty HTTP response interpreted as a 404.
Severe Java Error condition occured such as OutOfMemoryError or StackOverflowError during URI processing.
"Chaff" detection of traps/content with negligible value applied.
The URI is too many link hops away from the seed.
The URI is too many embed/transitive hops away from the last URI in scope.
The URI is out of scope upon reexamination. This only happens if the scope changes during the crawl.
Blocked from fetch by user setting.
Blocked by a custom processor.
Blocked due to exceeding an established quota.
Blocked due to exceeding an established runtime
Deleted from Frontier by user.
Processing thread was killed by the operator. This could happen if a thread is a non-responsive condition.
Robots.txt rules precluded fetch.
The Mindbreeze WebConnector supports documents that are compressed with the following compression types (content encoding):