Copyright ©
Mindbreeze GmbH, A-4020 Linz, 2024.
Alle Rechte vorbehalten. Alle verwendeten Hard- und Softwarenamen sind Handelsnamen und/oder Marken der jeweiligen Hersteller.
Diese Unterlagen sind streng vertraulich. Durch die Übermittlung und Präsentation dieser Unterlagen alleine werden keine Rechte an unserer Software, an unseren Dienstleistungen und Dienstleistungsresultaten oder sonstigen geschützten Rechten begründet. Die Weitergabe, Veröffentlichung oder Vervielfältigung ist nicht gestattet.
Aus Gründen der einfacheren Lesbarkeit wird auf die geschlechtsspezifische Differenzierung, z.B. Benutzer/-innen, verzichtet. Entsprechende Begriffe gelten im Sinne der Gleichbehandlung grundsätzlich für beide Geschlechter.
This document describes the concepts of Mindbreeze InSpire. These concepts refer on the one hand to standalone operation (with only one appliance), but also to distributed operation (with several appliances).
If a metadatum is aggregatable, it is automatically also regexmatchable, with the added property that the metadatum is available as a facet (filter). A distinction is made between:
The "Aggregated Metadata Keys" can be configured per index, whereby the "Advanced Settings" must be activated for this option to be visible. This makes it possible to mark metadata as "aggregatable". Changes in this option entail a change in the index scheme.
The following metadata keys are reserved for Built-In metadata:
Name | Type |
mes:docid | Integer |
mes:key | String |
mes:size | Integer |
category | String |
fqcategory | String |
categoryclass | String |
categoryscope | String |
mes:date | String |
title | String |
datasource/mes:key | String |
datasource/category | String |
datasource/fqcategory | String |
extension | String |
mes:boost | Float |
mes:uniformdocid | Integer |
The "Category", "Category Instance" and "Fully Qualified Category" are described in the table below:
Name | Metadata | Description |
Category | datasource/category | Documents that are indexed by a particular crawler always have the same category. This is therefore not configurable. |
Category Instance | datasource/categoryinstance | The Category Instance can be configured for most crawlers so that they set the Category Instance for their crawled documents. |
Fully Qualified Category | datasource/fqcategory | The Fully Qualified Category is generated by combining the Category and Category Instance (with a colon in the middle, e.g. Web:Default). This must be unique for each crawler if each crawler in the search client is to receive its own filter value for the Source filter. |
The part of the index that is available in memory for analysis is named Document Info. The Document Info Zones (properties) can be controlled using the Category Descriptor, the Semantic Pipeline or the Aggregated Metadata Keys.
The characteristic value which properties are available via the Document Info is also called the Document Info Schema.
The index configuration includes everything that configures the index. The index configuration is stored in the index file system.
A schema change results in a document info reinversion. The following list contains examples that cause a schema change:
After a filtered document is stored in the index, it is inverted so that it becomes searchable ("index inversion"). In addition, documents are enriched with metadata during inversion (described in the Semantic Pipeline).
When a schema is changed, the index is automatically inverted with regard to the document info.
Full Re-Inversion not only re-inverts the document info but rebuilds the whole inverted index.
This can be triggered using the script /opt/mindbreeze/scripts/move_inverted_index.sh.
It moves the inverted index to a specified backup directory. The inverted index will be rebuilt on the next index startup.
The index has to be stopped when using this script. After starting the index it is only available after the re-inversion has finished.
./move_inverted_index.sh
--basedir INDEX_DIRECTORY
--destdir BACKUP_DIRECTORY
[--category CATEGORY]
[--bucket BUCKET_NR]
[--overwrite]
| --help | -h
If neither category nor bucket are specified, the inverted index of all categories of all buckets is moved.
The parameter category restricts this to a specific category
The parameter bucket restricts this to a specific bucket
A special form of the structure of an index. By default, the "Multi Index Layout" is used for all indexes, which is especially important for distributed operation with multiple Mindbreeze InSpire appliances. See also Handbook - Distributed Operation (G7) - Index Layout.
Documents are processed by the crawler or pusher in the semantic pipeline and then indexed. The following steps are performed:
Depending on the file type, the filter forwards documents to the respective content filters. The filtered documents are sent back to the filter so that the filtered documents can possibly be sent back to the respective content filters. An example of this is ZIP documents that must first be unpacked with a content filter and then processed with other content filters. Filters can be configured in the Mindbreeze Management Center under "Configuration" in the tab "Filter" and selected in the tab "Indices" for the respective indices.
Using Post Filter, the content of already filtered documents can be processed and modified before the document is sent to the index.
Precomputed Synthesized Metadata can be used to generate new metadata based on other metadata. The time when this metadata is to be generated (in the semantic pipeline) can be determined using the “Transformation Pipeline Slot” option. A detailed documentation can be found here.
Entity Recognition can be used to generate metadata by recognizing certain patterns in a text (using Regex). For example, date, UNC paths, etc. can be recognized. A detailed documentation can be found here.
The "CSV Transformation" can also be used to generate metadata. It is possible to compare a value of a metadatum with a value of a certain column in the CSV. If the metadatum value matches the value from the column, you can write the value of another column from the same row into a new metadatum and append it to the result. More information can be found in the CSV-Transformation documentation.
Item transformers are another way to enrich documents with metadata. Mindbreeze InSpire offers various item transformers, such as the LanguageDetector Plugin.
With the help of the "Language Detection" integrated in the index, the language of a document can be recognized without an additional plugin.
The subsequent "Named Entity Recognition (NER)" can identify and classify named entities both in the content and in the metadata of a document. A detailed documentation can be found here.