Google Search Appliance Feed Indexing with Mindbreeze InSpire

Configuration and Indexing

Copyright ©

Mindbreeze GmbH, A-4020 Linz, 2018.

All rights reserved. All hardware and software names are brand names and/or trademarks of their respective manufacturers.

These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes or other protected rights.

The dissemination, publication or reproduction hereof is prohibited.

For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.


Google search appliance feedsPermanent link for this heading

The Mindbreeze InSpire GSA feed adapter makes it possible to index Google Search Appliance feeds with Mindbreeze InSpire.

The feed is an XML file that contains URLs. A feed can also include the contents of the documents, metadata and additional information such as the date of last modification. The XML file must correspond to the pattern defined by gsafeed.dtd. This file is located on the Google Search Appliance at http://< APPLIANCE - Host-Name>:7800/gsafeed.dtd.

The GSA feed XML documents should be sent by an HTTP post request to the GSA feed adapter service port. If the feed has been received and processed successfully, the service sends the text "Success" with the Status 200.

An example of a POST request with curl:

curl -X POST -F "data=@<feed_xml_path>" http://<mindbreeze_inspire_server>:19900/xmlfeed

Storing feedsPermanent link for this heading

The GSA feeds received can be stored for a configured time interval. To do this, enable the “Enable Feed Storage” option in the “Feed Storage Settings” section. This option should be enabled by default.

If no directory is configured for feed storage using “Override Feed Storage Directory”, the feeds will be stored in /data/messervicedata/<serviceuid>/

You can use “Schedule for cleaning up old feeds” to determine how often the outdated feeds should be deleted. A Quartz Cron expression must be used for the configuration.

Basic configuration of the GSA Feed Adapter ServicePermanent link for this heading

Open the “Indices” tab in the Mindbreeze configuration and add a new service with the “Add new Service” icon.

Set the “Display Name” and select the service type “GSAFeedAdapter”.

Service Settings:

  • GSA Feed Adapter Service Port: The HTTP port to which the feed documents can be sent
  • “Accept Trusted Requests Only”: If enabled, feeds will only be accepted from IP addresses that are configured in “Trusted Addresses”.
  • “Trusted Addresses”: Contains a list of trusted IP addresses. Wildcards are also supported, for example: 192.168.116.*
  • “Following Patterns”: Link patterns that are to be followed
  • “Do Not Follow Patterns”: Link patterns that should not be followed
  • Document Dispatcher Thread Count: Number of threads that edit the downloaded documents and forward them to the Mindbreeze indices.
  • Web Crawler Thread Count: Number of threads that visit URLs and for downloading documents.
  • Web Crawler Queue Size: Size of the web crawler document queue
  • User Agent: A user agent can be configured here. The configured user agent is used for all http requests.
  • Ignore robots.txt Rules for Matching URLs: You can use a regular expression here to determine which URLs the robots.txt rules are not used for.
  • Minimum Delay in Milliseconds Between Consecutive HTTP Requests: Minimum number of milliseconds between consecutive http requests.
  • Maximum Delay in Milliseconds Between Consecutive HTTP Requests: Maximum number of milliseconds between consecutive http requests.
  • Try to Parse Record Metadata as Date Values: If enabled, it attempts to extract a date value from the feed metadata using Java Date Format configured in Parsable Date Formats (Ordered). The region for the extraction can be set in “Locale String for Date Parsing” with an IETF BCP 47 locale.

The “Following” and “Do Not Follow” patterns can be defined with the syntax of Google URL patterns:

Collections and destination mappingsPermanent link for this heading

You can define URL groups using URL patterns in the “Collections” section of the GSA feed adapter service configuration. A document can belong to several collections, but is indexed only once per category and category instance.

The names of all collections containing a document are stored in the “collections” metadata.

If a regular expression is set as an “Enforce Extension from URL if Matches” parameter, the extension for documents with matching URLs is derived from the URL instead of from the “Content-Type” http header.

If a large number of collections or additional collection properties are required, the collections can also be defined with a CSV-formatted configuration. Collection configuration can be input from a file or configured directly as text.

The collection configuration must contain a CSV header with the properties “collectionname” and “collectionpattern”.  Further properties can also be defined, like in our example: “property1,” “property2” and “property3.”

The CSV lines are grouped by “collectionname.” If you want to define a collection with several URL patterns, you can use the following syntax:




To be able to index URLs in the collections, at least one destination mapping must be defined. To do this, click the “Add Composite Property” icon in the “Destination Mapping” section.

Reference one or more collections in the newly added destination mapping (“Collection Pattern”). You can specify a regular expression that matches the desired collection name.

In addition, the category used for indexing and the category instance properties have to be defined here. For example, for web content, the category is “Web”. The “Category Instance” can be freely chosen. The category instance can contain references to defined collection properties.

Lastly, there is an index URL and a filter URL to which the data should be sent.

With “Mindbreeze Dispatcher Thread Count” you can determine the number of threads that send documents to the configured index and filter services.

Metadata extractionPermanent link for this heading

If documents are indexed with the GSA facade service, it is possible to define user-customized metadata for documents in several ways:

metadata defined in the feed,

metadata added by the HTTP header,

in the case of HTML documents, user-customized metadata from the content,

robots meta tags.

Metadata and URL feedsPermanent link for this heading

The metadata defined in the URL records is automatically extracted and indexed. In this example, the metadata "meta-keywords“, “targetgroup“ and “group“  are indexed.

<record url="" action="add" mimetype="text/html" lock="true" crawl-immediately="true">


             <meta name="meta-keywords" content="Newsletter, My Company"/>

            <meta name="targetgroup" content="fachkunde"/>

             <meta name="group" content="public"/>



The metadata of the records is only indexed for the record URLs and not for the subpages.

HTTP headersPermanent link for this heading

In addition to metadata from the http records, the metadata is extracted from the X-gsa-external-metadata http header for all URLs. The header contains a comma-separated list of values. All values have the form meta-name=meta-value. The "meta-name" and "meta-value" elements are URL encoded (, Section 2).

ACLPermanent link for this heading

The Mindbreeze InSpire GSA Feed Adapter supports ACL-s defined in feeds with the following restrictions:

  • ACL-s have to be set per record
  • ACL inheritance is not supported
  • ACL-s from the X-google-acl headers are not supported.

Metadata extraction from the contentPermanent link for this heading

It is also possible to extract user-customized metadata from the content for HTML documents, similar to the Mindbreeze Web Connector.

As with metadata mapping, it is also possible to define "content extractors" and "metadata extractors" for URL collections.

A content extractor has one collection pattern where a regular expression can be configured. On all URLs from all matching collections, the rules for content and title extraction are applied.

Metadata extractors can also be defined for the collections. Here it is possible to extract user-customized metadata with different formats and formatting options.

The metadata extractors use XPath expressions for extracting textual content. These can then be format-specifically edited, and interpreted, for example, as a date.

Collection metadataPermanent link for this heading

For each collection, you can define metadata that are set for all associated documents. The metadata values can contain references to defined collection properties. In the following example, the value for “meta2” is set to the value of the property “property2” of the collection. A collection metadata also has a collection pattern where a regular expression can be configured. Metadata is set on all documents of all matching collections.

The metadata can also contain references to the following URL component:

  • Hostname: {{urlhost}}
  • Port: {{urlport}}
  • Pfad: {{urlpath}}

Collection ACLPermanent link for this heading

Like collection metadata, it is possible to define ACL entries on the basis of a collection. The ACL principals can also contain references to collection properties. The ACL entries also have a “Collection Pattern” property which allows you to define the collections for which the ACL entries should be defined. Collection ACLs are only used if no feed ACL has been defined for the documents.

The ACL entries can contain references to the following URL component:

  • Hostname: {{urlhost}}
  • Port: {{urlport}}
  • Path: {{urlpath}}

URLs with multiple collectionsPermanent link for this heading

If a document belongs to several collections using Collection Configuration, the collection metadata and collection ACL elements of the matching collections are merged.

It is also possible to define the “Category Instance” of the document according to the collection assignment or URL. For the Category Instance property in the Destination Mapping configuration, it is also possible to use the references to the collection properties and URL components, as shown in this example:

Login settingsPermanent link for this heading

Form login and session administration with cookies can be defined for given URL patterns using a configuration in CSV format. The login configuration can be input from a file or configured directly as text.

The login configuration must begin with the following header:


The login configuration lines contain login action definitions grouped with the “urlpattern” property.

As defined in the header, a login action has the following properties:

  • urlpattern: A Google URL pattern specifies which URLs the action should be applied to
  • actiontype: The login action type. Supported values are: GET, POST, AWS_SIGN
  • logindata: Additional login data (form content for POST or application credentials for AWS_SIGN)
  • followredirects: “true” or “false.” Determines whether the additional http redirections should be tracked automatically.
  • sessionlifetime: The session lifetime in seconds (the first value per urlpattern applies).

The supported login action types are:

  • POST: http POST request to a URL with a defined form content. The text must be URL-form coded.

Example:;POST; /dologin.action;os_username=user&os_password=mypassword&login=Anmelden&os_destination=%2Findex.action;false;60

  • GET: http GET request to a URL.
  • Example:
  •;GET; /sessionvalidator;;false;60
  • AWS_SIGN: Amazon Web Services Signature Version 4 for Amazon REST URLs.
  • Example:
  •;SIGN_AWS;;eu-central-1:<Access Key ID>:<Secret Key>;false;0

If you want to define multiple login actions for one URL pattern, you have to set the same “loginpattern” for the login actions.


urlpattern;actiontype;actionurl;logindata;followredirects;sessionlifetime;POST; /dologin.action;username=user&password=mypassword;false;60

  •;GET; /sessionvalidator;;false;60

Robots meta tagPermanent link for this heading

The robots meta tag allows a detailed, site-specific approach to determine how a particular page should be indexed and displayed for the users in the search results. (

The robots meta tag is placed in the <head> section of the corresponding page:

<!DOCTYPE html>



<meta name="robots" content="noindex" />





The Mindbreeze InSpire GSA feed adapter service considers the following robots meta tag values:

noindex: This site is not indexed.

nofollow: The links on this page are not followed.

none: Equivalent to noindex, nofollow.

Configuration of the index services  Permanent link for this heading

Click on the "Indices" tab and then click on the "Add new service" symbol to create an index (optional).

Enter the index path (in "Index Path"). If necessary, adjust the display name (in "Display Name") of the index service, the index parameters, and the associated filter service.

To create data sources for an index, under the section "Data Sources", click on "Add new custom source".

A data source should be configured here for all categories that are assigned to this index in the GSA feed adapter service (see Section 1.2). Since the data sources are only used for the search, the crawler should be disabled. To do this, activate the "Advanced Settings" mode and select the option "Disable Crawler" for the configured data sources:

GSA transformerPermanent link for this heading

The GSA transformer enables the client service to understand Google Search Appliance XML queries and provide XML responses that are compatible with Google Search Appliance.

You can find more details under Google Search Appliance: Search protocol reference

Configuring the GSA transformerPermanent link for this heading

The GSA transformer is configured in the client service. Here you can define the metadata that should always be delivered.

The plugin is first added under “API V2 Search Request Response Transformer” in the client service tab.

Features of the GSA XML search queryPermanent link for this heading

In addition, the GSA transformer supports the following new features of the GSA XML search queries:

  • Request fields
  • start
  • num
  • getfields
  • requiredfields
  • query operators
  • filter
  • paging