Mindbreeze GmbH, A-4020 Linz, 2018.
All rights reserved. All hardware and software names are brand names and/or trademarks of their respective manufacturers.
These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes or other protected rights.
The dissemination, publication or reproduction hereof is prohibited.
For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.
The Mindbreeze InSpire GSA feed adapter makes it possible to index Google Search Appliance feeds with Mindbreeze InSpire.
The feed is an XML file that contains URLs. A feed can also include the contents of the documents, metadata and additional information such as the date of last modification. The XML file must correspond to the pattern defined by gsafeed.dtd. This file is located on the Google Search Appliance at http://< APPLIANCE - Host-Name>:7800/gsafeed.dtd.
The GSA feed XML documents should be sent by an HTTP post request to the GSA feed adapter service port. If the feed has been received and processed successfully, the service sends the text "Success" with the Status 200.
An example of a POST request with curl:
curl -X POST -F "data=@<feed_xml_path>" http://<mindbreeze_inspire_server>:19900/xmlfeed
The GSA feeds received can be stored for a configured time interval. To do this, enable the “Enable Feed Storage” option in the “Feed Storage Settings” section. This option should be enabled by default.
If no directory is configured for feed storage using “Override Feed Storage Directory”, the feeds will be stored in /data/messervicedata/<serviceuid>/
You can use “Schedule for cleaning up old feeds” to determine how often the outdated feeds should be deleted. A Quartz Cron expression must be used for the configuration.
Open the “Indices” tab in the Mindbreeze configuration and add a new service with the “Add new Service” icon.
Set the “Display Name” and select the service type “GSAFeedAdapter”.
You can define URL groups using URL patterns in the “Collections” section of the GSA feed adapter service configuration. A document can belong to several collections, but is indexed only once per category and category instance.
The names of all collections containing a document are stored in the “collections” metadata.
If a regular expression is set as an “Enforce Extension from URL if Matches” parameter, the extension for documents with matching URLs is derived from the URL instead of from the “Content-Type” http header.
If a large number of collections or additional collection properties are required, the collections can also be defined with a CSV-formatted configuration. Collection configuration can be input from a file or configured directly as text.
The collection configuration must contain a CSV header with the properties “collectionname” and “collectionpattern”. Further properties can also be defined, like in our example: “property1,” “property2” and “property3.”
The CSV lines are grouped by “collectionname.” If you want to define a collection with several URL patterns, you can use the following syntax:
To be able to index URLs in the collections, at least one destination mapping must be defined. To do this, click the “Add Composite Property” icon in the “Destination Mapping” section.
Reference one or more collections in the newly added destination mapping (“Collection Pattern”). You can specify a regular expression that matches the desired collection name.
In addition, the category used for indexing and the category instance properties have to be defined here. For example, for web content, the category is “Web”. The “Category Instance” can be freely chosen. The category instance can contain references to defined collection properties, parts of the URL and feed parameters.
Lastly, there is an index URL and a filter URL to which the data should be sent.
With “Mindbreeze Dispatcher Thread Count” you can determine the number of threads that send documents to the configured index and filter services.
Full feed destination mappings are similar to collection-based destination mappings. If full content feeds should be indexed, the feed data source must be defined as a category instance in a Mindbreeze index.
The GSA full content feeds contain all documents from a data source, defined with the feed “data source” property. All documents that are not in the feed and were previously indexed in this data source are deleted.
A full feed destination mapping has the following attributes:
If no full feed destination mappings are defined, all feeds are treated as incremental.
If documents are indexed with the GSA facade service, it is possible to define user-customized metadata for documents in several ways:
metadata defined in the feed,
metadata added by the HTTP header,
in the case of HTML documents, user-customized metadata from the content,
robots meta tags.
The metadata defined in the URL records is automatically extracted and indexed. In this example, the metadata "meta-keywords“, “targetgroup“ and “group“ are indexed.
<record url="http://website.mycompany.com/newsletter" action="add" mimetype="text/html" lock="true" crawl-immediately="true">
<meta name="meta-keywords" content="Newsletter, My Company"/>
<meta name="targetgroup" content="fachkunde"/>
<meta name="group" content="public"/>
The metadata of the records is only indexed for the record URLs and not for the subpages.
In addition to metadata from the http records, the metadata is extracted from the X-gsa-external-metadata http header for all URLs. The header contains a comma-separated list of values. All values have the form meta-name=meta-value. The "meta-name" and "meta-value" elements are URL encoded (, Section 2).
The Mindbreeze InSpire GSA feed adapter supports ACLs from feeds with the following constraints:
ACLs from X-Gsa Doc controls http headers are extracted. Only ACLs set by URL are supported here.
Please note: Documents that have inherited ACLs in X-Gsa Doc controls are, by default, not indexed. If these documents should also be indexed, the configuration option "Index Documents with Partial Header ACL" must be enabled.
It is also possible to extract user-customized metadata from the content for HTML documents, similar to the Mindbreeze Web Connector.
As with metadata mapping, it is also possible to define "content extractors" and "metadata extractors" for URL collections.
A content extractor has one collection pattern where a regular expression can be configured. On all URLs from all matching collections, the rules for content and title extraction are applied.
Metadata extractors can also be defined for the collections. Here it is possible to extract user-customized metadata with different formats and formatting options.
The metadata extractors use XPath expressions for extracting textual content. These can then be format-specifically edited, and interpreted, for example, as a date.
For each collection, you can define metadata that are set for all associated documents. The metadata values can contain references to defined collection properties. In the following example, the value for “meta2” is set to the value of the property “property2” of the collection. A collection metadata also has a collection pattern where a regular expression can be configured. Metadata is set on all documents of all matching collections.
The metadata can also contain references to the following URL component and feed parameter:
Like collection metadata, it is possible to define ACL entries on the basis of a collection. The ACL principals can also contain references to collection properties. The ACL entries also have a “Collection Pattern” property which allows you to define the collections for which the ACL entries should be defined. Collection ACLs are only used if no feed ACL has been defined for the documents.
The ACL entries can contain references to the following URL component and feed parameter:
If a document belongs to several collections using Collection Configuration, the collection metadata and collection ACL elements of the matching collections are merged.
It is also possible to define the “Category Instance” of the document according to the collection assignment or URL. For the Category Instance property in the Destination Mapping configuration, it is also possible to use the references to the collection properties and URL components, as shown in this example:
Form login and session administration with cookies can be defined for given URL patterns using a configuration in CSV format. The login configuration can be input from a file or configured directly as text.
The login configuration must begin with the following header:
The login configuration lines contain login action definitions grouped with the “urlpattern” property.
As defined in the header, a login action has the following properties:
The supported login action types are:
If you want to define multiple login actions for one URL pattern, you have to set the same “loginpattern” for the login actions.
The robots meta tag is placed in the <head> section of the corresponding page:
<meta name="robots" content="noindex" />
The Mindbreeze InSpire GSA feed adapter service considers the following robots meta tag values:
noindex: This site is not indexed.
nofollow: The links on this page are not followed.
none: Equivalent to noindex, nofollow.
At , you can retrieve statistics for the collections. “host” stands for the hostname of the Mindbreeze InSpire server, “port” can be configured in the GSA feed adapter service in the “Service Settings" using the option “Collection Statistics Port” – the default value is 23850 (if the text field is left empty).
If you call up the URL listed above in the browser, you will receive the following user interface in the browser.
This triggers a calculation of the statistics. If you click on the “Try download” button, an attempt will be made to download the statistics (again, the configured link of the button is ). If, however, the calculation is not yet complete, you will receive a status of the progress of the calculation. During the calculation, the so-called crawler status is retrieved for all documents – depending on the number of documents in the index, this can take several minutes or even hours. The statistics are then compiled, but the total number of documents to be processed (in this case 4988) is only an estimate. In order to be able to track the current status of the calculation, the “Try download” button can be clicked again and again as often as you like.
When the calculation is finished and you click the “Try download” button, the CSV statistics will be downloaded. The CSV file can then be downloaded as often as you want. The CSV file contains one line for each collection. The downloaded file contains the following columns:
Name of the collection
Category (configured in the corresponding destination mapping)
Category instance (configured in the corresponding destination mapping)
Index URL (configured in the corresponding destination mapping)
Filter URL (configured in the corresponding destination mapping)
Number of documents in the collection
Timestamp (format in ISO 8601) of the time/date from which the data originated
Clicking “Recalculate” will trigger a recalculation of the statistics. However, if a calculation is in progress, “Recalculate” has no effect. Using the “Try download” button, the CSV file can be downloaded as usual or, if the calculation is not yet finished, the status of the calculation can be queried.
The configured link for the “Recalculate” button is https://<host>:8443/cache/<port>/collections?datetime=. The parameter “datetime” can be used to specify a date (in ISO 8601 format) that determines how old the last statistic can be at most. If the last calculated statistic is newer than the specified date, the statistic will be downloaded. If the last calculated statistic is older than the specified date, a recalculation of the statistic will be triggered. If the parameter is left empty (i.e. /collections?datetime=), the current date is used and thus a recalculation of the statistic is always triggered.
TIP: If a statistic has already been created and you re-enter the URL in the browser, the statistic will be downloaded directly. As a result, the user interface with the “Recalculate” and “Try download” buttons is not displayed. In order to be able to run a recalculation of the statistics anyway, you can use the datetime parameter as described above.
Click on the "Indices" tab and then click on the "Add new service" symbol to create an index (optional).
Enter the index path (in "Index Path"). If necessary, adjust the display name (in "Display Name") of the index service, the index parameters, and the associated filter service.
To create data sources for an index, under the section "Data Sources", click on "Add new custom source".
A data source should be configured here for all categories that are assigned to this index in the GSA feed adapter service (see Section 1.2). Since the data sources are only used for the search, the crawler should be disabled. To do this, activate the "Advanced Settings" mode and select the option "Disable Crawler" for the configured data sources:
The GSA transformer enables the client service to understand Google Search Appliance XML queries and provide XML responses that are compatible with Google Search Appliance.
Request can be sent to: http://appliance/plugin/GSA/search?q=<query>
The GSA transformer is configured in the client service. Here you can define the metadata that should always be delivered.
The plugin is first added under “API V2 Search Request Response Transformer” in the client service tab.
Regular expressions (regex) can be applied to the query string to set query constraints. The query constraints are regular expressions and back references are supported.
An example of a possible use case is, for instance, the search for documents with author numbers that have different syntax, which you want to restrict even more using constraints.
The following documents exist:
Document 1: author:abc_12,def_45 authorid:abc_12
Document 2: author:abc_12,def_45 authorid:abc/12
Document 3: author:abc_12,def_45 authorid:def_45
The following queries are sent:
Query 1: author:abc_12
Query 2: author:abc/12
Despite the differing syntax, both queries should only contain the following two documents:
Document 1: author:abc_12,def_45 authorid:abc_12
Document 2: author:abc_12,def_45 authorid:abc/12
The idea is to work with regex that use underscores or slashes as separators.
This requires you to configure three settings:
Set “Query Contraints Label“ to authorid. Note: The metadata must be “regex matchable.”
Set “Query Pattern” to author:(\S+)[_/](\S+)
Set “Regex Constraint” to \^\\Q$1\\E[_/]\\Q$2\\E\$
You can also search for non-existent metadata, e.g.
Query 1: writer:abc_1
Normally, this query does not return any results, since there is no document with the metadata writer. The plugin can also be configured to manipulate the query itself. To do this, the “Replace Matching Parts from Query Value“ setting needs to be enabled. And the setting “Replace Matching Parts from Query Value“ must be set to ALL. This transforms the query as follows:
Query 1‘: ALL
Since the constraints are set as they were before, the correct documents are now delivered.
“Query Contraints Label”
Name for the query expression label (name of the metadata to be filtered). Note: The metadata must be “regex matchable.” The property must be defined in the category descriptor or in the aggregated metadata keys, otherwise the constraint will not work.
“Replace Matching Parts from Query”
If active, parts of the query that match will be replaced by a string. Default: inactive
“Replace Matching Parts from Query Value“
The value that replaces the matching parts. Default: empty. E.g. ALL
Regular expression (Java) with which the query is matched. Groups may also be used. For instance: myLabel:(\S+)[_/](\S+)
Regular expression (Mindbreeze) for the query constraint regex. References to matched groups are possible with $1, $2... The stand-alone special character $ must be escaped using \. For example: \^\\Q$1\\E[_/]\\Q$2\\E\$
In the “Metadata” section you can configure the metadata to be requested by default.
Defines the metadata mode. “Disable” requests no metadata by default. “Send Only Configured Metadata” requests the configured metadata. “Send Default Client Metadata When no Metadata is Configured“ requests the default metadata of the data source if no metadata has been configured, otherwise it requests the configured metadata.
The name of the metadata.
The format of the metadata. “VALUE“ or“HTML“. “HTML“ is the recommended setting for search applications, since this format can be displayed well. “VALUE“ returns the raw value of the metadata.
In addition, the GSA transformer supports the following new features of the GSA XML search queries:
The getfields parameter determines which metadata is requested. If the getfields parameter is not used or if the getfields value is "*", then the configuration of the metadata (see the above section) determines which metadata is requested.
If metadata is explicitly requested with the getfields parameter, for example with getfields=Author.Description.Revision, then only this metadata is requested, regardless of the configured metadata. This metadata is requested by default in "HTML" format.
However, the format of the requested metadata can be changed with the configured metadata. For instance, if the metadata Description is configured with the format "VALUE", then getfields=Author.Description.Revision requests metadata in the following formats: Author in HTML, Description in VALUE, Revision in HTML.
The query can contain special characters to separate the terms and have logic meaning. This characters can be parsed and be used to transform the query into Mindbreeze-compatible logical expressions. Examples for possible transformations are:
tree OR house
tree NOT garden
tree AND garden
This behavior can be disabled by deactivating the configuration option „Parse Query Term Separator Special Characters“. Then, the original query will be used for the Search.