Operations
Index Operating Concepts

Introduction

This document describes the concepts of Mindbreeze InSpire. These concepts refer on the one hand to standalone operation (with only one appliance), but also to distributed operation (with several appliances).

Glossary

Aggregatable

If a metadatum is aggregatable, it is automatically also regexmatchable, with the added property that the metadatum is available as a facet (filter). A distinction is made between:

Static Aggregatable: globally defined per metadatum for the whole index in the index scheme. An index schema change requires a re-inversion of the index.
Dynamic Aggregatable: defined per metadatum and per document. Since this is not defined in the index scheme, no re-inversion is necessary. Thus, metadata for certain documents can be made "aggregatable" in a very flexible way.

Aggregated Metadata Keys

The "Aggregated Metadata Keys" can be configured per index, whereby the "Advanced Settings" must be activated for this option to be visible. This makes it possible to mark metadata as "aggregatable". Changes in this option entail a change in the index scheme.

Built-In Metadata Keys

The following metadata keys are reserved for Built-In metadata:

Name	Type
mes:docid	Integer
mes:key	String
mes:size	Integer
category	String
fqcategory	String
categoryclass	String
categoryscope	String
mes:date	String
title	String
datasource/mes:key	String
datasource/category	String
datasource/fqcategory	String
extension	String
mes:boost	Float
mes:uniformdocid	Integer

Regexmatchable

Regexmatchable metadata can be searched with RegEx (relevant for custom search clients, see api.v2.search).

Category / Category Instance / Fully Qualified Category

The "Category", "Category Instance" and "Fully Qualified Category" are described in the table below:

Name	Metadata	Description
Category	datasource/category	Documents that are indexed by a particular crawler always have the same category. This is therefore not configurable.
Category Instance	datasource/categoryinstance	The Category Instance can be configured for most crawlers so that they set the Category Instance for their crawled documents.
Fully Qualified Category	datasource/fqcategory	The Fully Qualified Category is generated by combining the Category and Category Instance (with a colon in the middle, e.g. Web:Default). This must be unique for each crawler if each crawler in the search client is to receive its own filter value for the Source filter.

Index Document Info

The part of the index that is available in memory for analysis is named Document Info. The Document Info Zones (properties) can be controlled using the Category Descriptor, the Semantic Pipeline or the Aggregated Metadata Keys.

Index Document Info Schema (Index Schema)

The characteristic value which properties are available via the Document Info is also called the Document Info Schema.

Index Configuration

The index configuration includes everything that configures the index. The index configuration is stored in the index file system.

Index schema change

A schema change results in a document info reinversion. The following list contains examples that cause a schema change:

Changes in Aggregated Metadata Keys
Changes in Category Descriptor (related to aggregatable and regexmatchable)
Precomputed Synthesized Metadata (if aggregateable)
Entity recognition

Index Inversion / Re-Inversion

After a filtered document is stored in the index, it is inverted so that it becomes searchable ("index inversion"). In addition, documents are enriched with metadata during inversion (described in the Semantic Pipeline).

When a schema is changed, the index is automatically inverted with regard to the document info.

Full Re-Inversion

Full Re-Inversion not only re-inverts the document info but rebuilds the whole inverted index.

This can be triggered using the script /opt/mindbreeze/scripts/move_inverted_index.sh.
It moves the inverted index to a specified backup directory. The inverted index will be rebuilt on the next index startup.
The index has to be stopped when using this script. After starting the index it is only available after the re-inversion has finished.

./move_inverted_index.sh

--basedir INDEX_DIRECTORY

--destdir BACKUP_DIRECTORY

[--category CATEGORY]

[--bucket BUCKET_NR]

[--overwrite]

| --help | -h

If neither category nor bucket are specified, the inverted index of all categories of all buckets is moved.

The parameter category restricts this to a specific category

The parameter bucket restricts this to a specific bucket

Multi Index Layout

A special form of the structure of an index. By default, the "Multi Index Layout" is used for all indexes, which is especially important for distributed operation with multiple Mindbreeze InSpire appliances. See also Handbook - Distributed Operation (G7) - Index Layout.

Semantic Pipeline

Documents are processed by the crawler or pusher in the semantic pipeline and then indexed. The following steps are performed:

Filter / Content Filter

Depending on the file type, the filter forwards documents to the respective content filters. The filtered documents are sent back to the filter so that the filtered documents can possibly be sent back to the respective content filters. An example of this is ZIP documents that must first be unpacked with a content filter and then processed with other content filters. Filters can be configured in the Mindbreeze Management Center under "Configuration" in the tab "Filter" and selected in the tab "Indices" for the respective indices.

Post Filter

Using Post Filter, the content of already filtered documents can be processed and modified before the document is sent to the index.

Precomputed Synthesized Metadata

Precomputed Synthesized Metadata can be used to generate new metadata based on other metadata. The time when this metadata is to be generated (in the semantic pipeline) can be determined using the “Transformation Pipeline Slot” option. A detailed documentation can be found here.

Entity Recognition

Entity Recognition can be used to generate metadata by recognizing certain patterns in a text (using Regex). For example, date, UNC paths, etc. can be recognized. A detailed documentation can be found here.

CSV Transformation

The "CSV Transformation" can also be used to generate metadata. It is possible to compare a value of a metadatum with a value of a certain column in the CSV. If the metadatum value matches the value from the column, you can write the value of another column from the same row into a new metadatum and append it to the result. More information can be found in the CSV-Transformation documentation.

Item Transformation

Item transformers are another way to enrich documents with metadata. Mindbreeze InSpire offers various item transformers, such as the LanguageDetector Plugin.

Language Detection & Named Entity Recognition

With the help of the "Language Detection" integrated in the index, the language of a document can be recognized without an additional plugin.

The subsequent "Named Entity Recognition (NER)" can identify and classify named entities both in the content and in the metadata of a document. A detailed documentation can be found here.

Operations
Index Operating Concepts

Introduction

Glossary

Aggregatable

Aggregated Metadata Keys

Built-In Metadata Keys

Regexmatchable

Category / Category Instance / Fully Qualified Category

Index Document Info

Index Document Info Schema (Index Schema)

Index Configuration

Index schema change

Index Inversion / Re-Inversion

Full Re-Inversion

Multi Index Layout

Semantic Pipeline

Filter / Content Filter

Post Filter

Precomputed Synthesized Metadata

Entity Recognition

CSV Transformation

Item Transformation

Language Detection & Named Entity Recognition

Download PDF

Download PDF

{{{i18n.refineSearch}}}

Operations Index Operating Concepts

Introduction

Glossary

Aggregatable

Aggregated Metadata Keys

Built-In Metadata Keys

Regexmatchable

Category / Category Instance / Fully Qualified Category

Index Document Info

Index Document Info Schema (Index Schema)

Index Configuration

Index schema change

Index Inversion / Re-Inversion

Full Re-Inversion

Multi Index Layout

Semantic Pipeline

Filter / Content Filter

Post Filter

Precomputed Synthesized Metadata

Entity Recognition

CSV Transformation

Item Transformation

Language Detection & Named Entity Recognition

Download PDF

Download PDF

Operations
Index Operating Concepts