Web Connector

Installation and Configuration

Copyright ©

Mindbreeze GmbH, A-4020 Linz, .

All rights reserved. All hardware and software names used are registered trade names and/or registered trademarks of the respective manufacturers.

These documents are highly confidential. No rights to our software or our professional services, or results of our professional services, or other protected rights can be based on the handing over and presentation of these documents.

Distribution, publication or duplication is not permitted.

The term ‘user‘ is used in a gender-neutral sense throughout the document.

InstallationPermanent link for this heading

Before installing the Web Connector ensure that the Mindbreeze Server is already installed and this connector is also included in the Mindbreeze license.

Extending Mindbreeze for use with the Web ConnectorPermanent link for this heading

The Web Connector is available as a ZIP file. This file must be registered with the Fabasoft Mindbreeze Enterprise Server via mesextension as follows:

mesextension --interface=plugin --type=archive  --file=WebConnector<version>.zip install

Uninstalling the Web ConnectorPermanent link for this heading

To uninstall the Web Connector, first delete all Web Crawlers and then carry out the following command:

mesextension --interface=plugin --type=archive --file=WebConnector<version>.zip uninstall

Configuration of MindbreezePermanent link for this heading

Configuration of Index and CrawlerPermanent link for this heading

Select the “Advanced” installation method:

Click on the “Indices” tab and then on the “Add new index” symbol to create a new index.

Enter the index path, e.g. “C:\Index. Adapt the Display Name of the Index Service and the related Filter Service if necessary

Add a new data source with the symbol “Add new custom source” at the bottom right.

If necessary, choose “Web” in the “Category” field. With the “Crawler Interval” setting you are able to configure the interval between two crawl runs.

Web PagePermanent link for this heading

You can specify a regular expression for the links to follow with the field “URL Regex”. If you leave the field empty, all pages with the same host and domain parts as the “Crawling Root” will be indexed (e.g. de.wikipedia.org when the “Crawling Root” is http://de.wikipedia.org).

You can also specify a pattern for the URLs that need to be excluded using the “URL Exclude Pattern” field. The URLs matched by this pattern will not be crawled and hence not be used for further link extraction.

With the option “Convert URL-s to lowercase” set, all URL-s that are found by the crawler are converted to lowercase.

You can add an arbitrary number of crawling roots editing the “Crawling Root” field and pressing the “Add” button. The added crawling roots are displayed in the list above (e.g. Crawling Root[1]). You can remove existing crawling roots by clicking on the “Remove” button besides them.

With the “Maximum Link Depth” field you can set the maximum count of hops from the crawling roots for the URLs that are crawled. URLs having a higher hop count will be ignored.

If you want to set additional HTTP headers while crawling (e.g. for setting the Accept-Language header), do so using the Accept Headers parameter.

To minimize the load of subsequent crawl runs on your site, you can provide a sitemap with updated pages only. If you do so, you have to check the “Incomplete Delta Crawl Runs” option. With this option enabled, pages that are not reachable from the current “Crawling Root” will remain in the index.

With the Option: “Cleanup non matching URLs from index” all documents that do not fit the rules of URL Regex and URL Exclude Pattern are deleted from the index.

With the Option „Delete URL-s that are redirecting to excluded URL-s“ all documents that are redirecting to documents which are excluded by “URL Exclude Pattern” will be deleted.

With the Option: „Delete URL-s that are no longer available“ all documents with HTTP Status 401, 403 or 404 are also removed from the index. If these documents where reached via redirect, the documents that redirected to them are also removed from the index.

With the “Robots Honoring Policy” field you can configure how the rules in robots.txt files are handled by the crawler. Three options are available:

  • Obey all robots.txt rules for the configured user agent;
  • Crawl URIs if robots.txt allows any user agent to crawl it;
  • Ignore all robots.txt rules.

Sitemap Crawling StrategyPermanent link for this heading

In order to use sitemaps according to the Sitemaps.org protocol, check “Delta Crawling” and locate the site’s root sitemap as the crawling root.

In this scenario the crawler retrieves the web pages that are listed in the sitemap exclusively. The lastmod property of a sitemap URL entry is compared with the modification date of the already indexed web page (if exists). Furthermore the changefreq is interpreted between crawling runs. With a precise sitemap a high-frequent recrawling strategy can be employed.

Two “Delta Crawling” options are available:

  • “Sitemap-based Incomplete”: with this option enabled, the URL entries from the configured sitemaps will be crawled and the already indexed URL-s which are not found in the sitemaps are left in the index.
  • “Sitemap-based Complete”: with this option enabled the URL entries from the configured sitemaps will be crawled and the already indexed URL-s which are not found in the sitemaps are deleted from the index.
  • The option “Use Stream Sitemap Parser” is enabling a stream-based parsing of the sitemaps. This is more memory-efficient in case of large sitemap XML-s but is also less tolerant to XML errors in the sitemaps
    The Option “Sitemap Metadata Prefix”: adds the configured prefix to each metadata extracted from the sitemap.

Default Content TypePermanent link for this heading

The option “Default Content Type” can be used to set the MIME-Type for those documents for which the MIME-Type cannot be extracted from the HTTP-Header.

Resource ParametersPermanent link for this heading

This section is visible only when the “Advanced Settings” mode is activated on the “Indices” tab.

The “Number of Crawler Threads” setting defines the number of parallel threads that fetch web pages. By default, 5 threads are configured.

In the “Request Interval” field you can set the minimum delay between consecutive requests of a crawler thread in milliseconds. The default value is 250. If there is a “Crawl-Delay” setting used in a web sites robots.txt, that value overrules the “Request Interval” value.

ProxyPermanent link for this heading

In the “Proxy” section you can enter a proxy server if required in your infrastructure. To do this, enter the computer name and the port of the proxy server in “Proxy Host” and “Proxy Port”.

The Web Connector can register on the proxy server with HTTP-BASIC. Enter the user in the field “Proxy User” and the password in “Proxy Password” if the connections are to be made by an authentication-enabled proxy.

Filter ConfigurationPermanent link for this heading

There are multiple environment variables.

Cache Settings for Thumbnail GenerationPermanent link for this heading

The variable MES_THUMBNAIL_CACHE_LOCATION specifies a directory for the network cache used during thumbnail generation. The maximal cache size can be defined using the variable MES_THUMBNAIL_CACHE_SIZE_MB. Only if both variables are set, a cache is created and used.

Example (Linux):

export MES_THUMBNAIL_CACHE_LOCATION=/tmp/thumbcache

export MES_THUMBNAIL_CACHE_SIZE_MB=20

On Windows, the variables can be defined in the Control Panel.

Timeout Settings for Thumbnail GenerationPermanent link for this heading

Using the variable MES_THUMBNAIL_TIMEOUT a timeout value for thumbnail generation can be redefined. Otherwise the default value of 50 seconds will be used.

Example (Linux):

export MES_THUMBNAIL_TIMEOUT=10

On Windows, the variables can be defined in the Control Panel.

Extract Main Content with alternative Filter-ModePermanent link for this heading

Crawling of e.g. news sites often indexes useless content like menus or footers. The HTML-Filter can be switched in an alternative Mode, which can index only meaningful content, using heuristics.

There are multiple ways for configuration:

Filter Plugin PropertiesPermanent link for this heading

Click on the “Filters”-Tab and activate “Advanced Settings”.

In the section “Global Filter Plugin Properties” select “FilterPlugin.JerichoWithThumbnails(…)” and click on “Add”.

Expand the new Entry “FilterPlugin.JerichoWithThumbnails” and set the Value “Use Boilerpipe Extractor” to the Value “Article”.

Finally reindex.

Datasource XPath MetadataPermanent link for this heading

In the Tab “Indices” activate “Advanced Settings”.

Extract the affected Index. In Section “Data Source”, Subsection “Extract Metadata” click on the Plus-Symbol “Add Composite Property”.

Within the new Section “Extract Metadata” put in as Name: htmlfilter:extractor and XPath: "Article" (Important: with double quotes).

Finally reindex.

Ignore empty HTML ElementsPermanent link for this heading

If documents with empty HTML-elements appear within the index, you can define a regular Expression to remove these elements while filtering.

There are multiple ways for configuration:

Filter Plugin PropertiesPermanent link for this heading

Click on the “Filters” Tab and activate “Advanced Settings.

In Section “Global Filter Plugin Properties select “FilterPlugin.JerichoWithThumbnails(…)” and click on “Add.

Expand the new Entry “FilterPlugin.JerichoWithThumbnailsand set the Value “Ignore Empty Tags Pattern” to e.g. the Value “^(ul|li|a|div)$”. This means that empty HTML-elements ul, li, a and div will be removed, if empty.

Finally reindex.

Datasource XPath MetadataPermanent link for this heading

In the Tab “Indices” activate “Advanced Settings.

Expand the affected index. In section “Data Source, Subsection “Extract Metadata” click on the plus-symbol “Add Composite Property”.

In the new Section “Extract Metadata” put in as Name: htmlfilter:ignoreEmptyCharactersElementTagsPattern and as XPath: "^(ul|li|a|div)$" (Important: double Quotes). This means that empty HTML-elements ul, li, a and div will be removed, if empty.

Finally reindex.

AuthorizationPermanent link for this heading

Configuring authorization settings are only possible through “AuthorizedWeb” category. In order to configure these settings the category of data source should be change from “Web” to “AuthorizedWeb”. The rest of settings in “AuthorizedWeb” category are similar to those described in previous chapters.

Configuring Access RulesPermanent link for this heading

An Access Rule is defined by:

“Access Check Principal”, which should be either in username@domain or domain\username or distinguished name format for users or only in distinguished name format for user groups. Additionally a capture group from the selection pattern can be referenced here (See Access Rules[3]).

“Access Check Action”, which controls access type (Grant or Deny).

“Metadata Key for Selection”, should be a metadata name used by access rule or empty (selects all documents)

“Selection Pattern”, should be a regular expression or empty (matches all documents).

Parallel Processing of URLsPermanent link for this heading

The option „Use hashing queue assignment policy“ enables the hash based distribution of input URLs to processing queues. The number of queues is determined by the „Parallel Queue Count“ setting.

Without the option „Use hashing queue assignment policy“ URLs are distributed by hostname.

High priority removal of documentsPermanent link for this heading

At the end of a crawl-run inaccessible documents are removed from the index. Additionally the option “Invalid document deletion schedule” can be used to define a schedule which is used to remove inaccessible documents from the index independent of crawl runs.

The example schedule: “0 */45 * * * ?” means a deletion run every 45 minutes.

This schedule is only active if the crawler schedule permits it.

The following Documents will be removed:

  • Not found documents (HTTP Status 404, 310)
  • Redirections to those documents (e.g. HTTP Status 301, 307)

If “Cleanup non matching URL-s from Index” is activated, the following documents will be removed additionally:

  • According to „URL Exclude Pattern“ ignored Documents:
  • Redirections to those documents (e.g. HTTP Status 301, 307)

Mindbreeze Sitemap ExtensionsPermanent link for this heading

If „Sitemap-based“ delta crawling is used, crawling root URLs are processed as Sitemaps. The Mindbreeze Web Connector supports sitemap extensions which allow the definition of ACL informations and metadata.

Sitemaps with Access Control Lists (ACL)Permanent link for this heading

ACL informations can be defined for all <url> elements of a Sitemap.

ACLs from sitemaps are not compatible with access check rules.
For example:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:wstxns1="tag:mindbreeze.com,2008:/indexing/interface" >

  <url xmlns:ns3="http://www.google.com/schemas/sitemap-news/0.9">

    <loc>http://myserver.mycompany.com </loc>

    <lastmod>2016-02-11T13:11:14.07Z</lastmod>

    <priority>0.0</priority>

    <wstxns1:acl>

      <wstxns1:grant>User1 </wstxns1:grant>

      <wstxns1:deny>User2 </wstxns1:deny>

    </wstxns1:acl>

  </url>

Access is allowed for User1 but denied for User2.

Role “everyone” automatically includes every user.

Sitemaps with MetadataPermanent link for this heading

It is also possible to define metadata for <url> elements. For Example:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:wstxns1="tag:mindbreeze.com,2008:/indexing/interface" >

  <url xmlns:ns3="http://www.google.com/schemas/sitemap-news/0.9">

    <loc>http://myserver.mycompany.com </loc>

    <lastmod>2016-02-11T13:11:14.07Z</lastmod>

    <priority>0.0</priority>

    <wstxns1:meta key=”title”>

      < wstxns1:value>Page Title</wstxns1:value>

    </wstxns1:meta>

  </url>

Each „meta“ Element can have multible „Value“ sub-elements if the metadata has a list of Values:

    <wstxns1:meta key=”telefonnummer”>

      < wstxns1:value>1234234245</wstxns1:value >

      < wstxns1:value>1234234344</wstxns1:value>

    </wstxns1:meta>


Appendix APermanent link for this heading

Heritrix Status CodesPermanent link for this heading

Each crawled URI gets a status code.  This code (or number) indicates the result of a URI fetch in Heritrix. Codes ranging from 200 to 599 are standard HTTP response codes. Other Heritrix status codes are listed below.

Code

Meaning

1

Successful DNS lookup

0

Fetch never tried (perhaps protocol unsupported or illegal URI)

-1

DNS lookup failed

-2

HTTP connect failed

-3

HTTP connect broken

-4

HTTP timeout

-5

Unexpected runtime exception.  See runtime-errors.log.

-6

Prerequisite domain-lookup failed, precluding fetch attempt.

-7

URI recognized as unsupported or illegal.

-8

Multiple retries failed, retry limit reached.

-50

Temporary status assigned to URIs awaiting preconditions.  Appearance in logs may be a bug.

-60

URIs assigned a failure status.  They could not be queued by the Frontier and may be unfetchable.

-61

Prerequisite robots.txt fetch failed, precluding a fetch attempt.

-62

Some other prerequisite failed, precluding a fetch attempt.

-63

A prerequisite (of any type) could not be scheduled, precluding a fetch attempt.

-404

Empty HTTP response interpreted as a 404.

-3000

Severe Java Error condition occured such as OutOfMemoryError or StackOverflowError during URI processing.

-4000

"Chaff" detection of traps/content with negligible value applied.

-4001

The URI is too many link hops away from the seed.

-4002

The URI is too many embed/transitive hops away from the last URI in scope.

-5000

The URI is out of scope upon reexamination.  This only happens if the scope changes during the crawl.

-5001

Blocked from fetch by user setting.

-5002

Blocked by a custom processor.

-5003

Blocked due to exceeding an established quota.

-5004

Blocked due to exceeding an established runtime

-6000

Deleted from Frontier by user.

-7000

Processing thread was killed by the operator.  This could happen if a thread is a non-responsive condition.

-9998

Robots.txt rules precluded fetch.