Google Search Appliance Feed Indexing with Mindbreeze InSpire

Configuration and Indexing

Copyright ©

Mindbreeze GmbH, A-4020 Linz, 2017.

All rights reserved. All hardware and software names are brand names and/or trademarks of their respective manufacturers.

These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes or other protected rights.

The dissemination, publication or reproduction hereof is prohibited.

For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.

.

Google search appliance feedsPermanent link for this heading

The Mindbreeze InSpire GSA feed adapter makes it possible to index Google Search Appliance feeds with Mindbreeze InSpire.

The feed is an XML file that contains URLs. A feed can also include the contents of the documents, metadata and additional information such as the date of last modification. The XML file must correspond to the pattern defined by gsafeed.dtd. This file is located on the Google Search Appliance at http://< APPLIANCE - Host-Name>:7800/gsafeed.dtd.

ACL information is not considered in the current version. Use the Mindbreeze Inspire GSA feed adapter exclusively for public information.

The GSA feed XML documents should be sent by an HTTP post request to the GSA feed adapter service port. If the feed has been received and processed successfully, the service sends the text "Success" with the Status 200.

Basic configuration of the GSA feed adapter servicePermanent link for this heading

In the Mindbreeze configuration, open the "Indices" tab. Add a new service with the symbol "Add New Service".

Insert the "Display Name" and select the service type "GSAFeedAdapter" from Service Settings:

  • GSA Feed Adapter Service Port: HTTP port to which the feed documents can be sent
  • “Following Patterns“: Link patterns that are to be tracked
  • “Do Not Follow Patterns Link patterns that are not to be tracked
  • Document Dispatcher Thread Count: Number of threads that edit the downloaded documents and to which Mindbreeze forwards indices
  • Web Crawler Thread Count: number of threads that visit URLs to download documents.

The "Following" or "Do Not Follow" patterns can be defined with the syntax of Google URL Patterns:
https://www.google.com/support/enterprise/static/gsa/docs/admin/72/gsa_doc_set/admin_crawl/url_patterns.html

Collections and destination mappingsPermanent link for this heading

In the section "Collections" of the GSA feed adapter service configuration, it is possible to define URL groups based on URL patterns. A document can belong to multiple collections, but it is only indexed once per category and category instance.

The names of all collections that contain a document are stored in the metadata "collections".

At least one destination mapping must be defined so that URLs can be indexed in the collections. Click on the icon "Add Composite Property" in the field "Destination Mapping".

Reference one or more collections in the added destination mapping ("Collection Pattern"). You can specify a regular expression here, which fits to the desired collection names

In addition, properties need to be defined for indexing the category and category instance being used. For Web content, for example, the category is "Web". The "Category Instance" can be freely chosen.

Lastly, there is an index URL and a filter URL where the data should be sent.

Metadata extractionPermanent link for this heading

If documents are indexed with the GSA facade service, it is possible to define user-customized metadata for documents in several ways:

metadata defined in the feed,

metadata added by the HTTP header,

in the case of HTML documents, user-customized metadata from the content,

robots meta tags.

Metadata and URL feedsPermanent link for this heading

The metadata defined in the URL records is automatically extracted and indexed. In this example, the metadata "meta-keywords“, “targetgroup“ and “group“  are indexed.

<record url="http://website.mycompany.com/newsletter" action="add" mimetype="text/html" lock="true" crawl-immediately="true">

         <metadata>

             <meta name="meta-keywords" content="Newsletter, My Company"/>

            <meta name="targetgroup" content="fachkunde"/>

             <meta name="group" content="public"/>

        </metadata>

</record>

The metadata of the records is only indexed for the record URLs and not for the subpages.

HTTP headersPermanent link for this heading

In addition to metadata from the http records, the metadata is extracted from the X-gsa-external-metadata http header for all URLs. The header contains a comma-separated list of values. All values have the form meta-name=meta-value. The "meta-name" and "meta-value" elements are URL encoded (http://www.ietf.org/rfc/rfc3986.txt, Section 2).

Metadata extraction from the contentPermanent link for this heading

It is also possible to extract user-customized metadata from the content for HTML documents, similar to the Mindbreeze Web Connector.

As with metadata mapping, it is also possible to define "content extractors" and "metadata extractors" for URL collections.

A content extractor has one collection pattern where a regular expression can be configured. On all URLs from all matching collections, the rules for content and title extraction are applied.

Metadata extractors can also be defined for the collections. Here it is possible to extract user-customized metadata with different formats and formatting options.

The metadata extractors use XPath expressions for extracting textual content. These can then be format-specifically edited, and interpreted, for example, as a date.

Robots meta tagPermanent link for this heading

The robots meta tag allows a detailed, site-specific approach to determine how a particular page should be indexed and displayed for the users in the search results. (https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?hl=de).

The robots meta tag is placed in the <head> section of the corresponding page:

<!DOCTYPE html>

<html>

<head>

<meta name="robots" content="noindex" />

(…)

</head>

<body>(…)</body>

</html>

The Mindbreeze InSpire GSA feed adapter service considers the following robots meta tag values:

noindex: This site is not indexed.

nofollow: The links on this page are not followed.

none: Equivalent to noindex, nofollow.

Configuration of the index services  Permanent link for this heading

Click on the "Indices" tab and then click on the "Add new service" symbol to create an index (optional).

Enter the index path (in "Index Path"). If necessary, adjust the display name (in "Display Name") of the index service, the index parameters, and the associated filter service.

To create data sources for an index, under the section "Data Sources", click on "Add new custom source".

A data source should be configured here for all categories that are assigned to this index in the GSA feed adapter service (see Section 1.2). Since the data sources are only used for the search, the crawler should be disabled. To do this, activate the "Advanced Settings" mode and select the option "Disable Crawler" for the configured data sources: