Query Expression Transformation

Mindbreeze Query Transformer Plugins

Copyright ©

Mindbreeze GmbH, A-4020 Linz, .

All rights reserved. All hardware and software names used are registered trade names and/or registered trademarks of the respective manufacturers.

These documents are highly confidential. No rights to our software or our professional services, or results of our professional services, or other protected rights can be based on the handing over and presentation of these documents.

Distribution, publication or duplication is not permitted.

The term ‘user‘ is used in a gender-neutral sense throughout the document.

Mindbreeze Query TransformationPermanent link for this heading

Mindbreeze provides a list of query transformation services for automatic modification of search queries for better search results.

On the one hand there are the plugin-based extension points that can be loaded on demand into a Mindbreeze installation:

  • Synonym Transformer
  • Replacement Transformer
  • On the other hand there are integrated product features for easier finding the desired results (e.g. by enrichment of indexed documents with additional metadata):
  • “Did you mean?”
  • Entity Recognition
  • CSV Transformation

Query Transformation PluginsPermanent link for this heading

In order to use any of the query transformation services each of them has to be installed into your Mindbreeze installation by means of loading the corresponding plugin (they are delivered within the “Mindbreeze Query Transformation Plugins.zip” package).

The plugin also needs to be included in your Mindbreeze license.

Synonym Transformer PluginPermanent link for this heading

The SynonymTransformer-Plugin allows you to find search results by looking for different synonyms of a word. Therefore, the query is transformed to search for every term listed in the synonyms list.

Usage: The synonyms can be defined in a CSV-file writing a bulk of synonym values on every line separated with a semi-colon (;).

Example of a small synonym.csv file:

car;vehicle;automobile

plane;airplane;aeroplane

Example 1: a search for car sends the transformed query: car OR vehicle OR automobile

Example 2: a search for plane sends the transformed query: plane OR airplane OR aeroplane

Note: The term in first column is used to match on your query. Only single words without spaces are supported in the first column to be matched on.

InstallationPermanent link for this heading

  • Install the plugin with the Manager UI
  • Activate the plugin for every Index you want (with the Manager UI)
    • Switch to “Indices”-tab, activate “Advanced Settings
    • Scroll down to the section “Query Transformation Services
    • Select the “SynonymTransformer”-plugin and click “Add
  • Add the path to the CSV-file containing the synonym definitions as “Custom Plugin Properties
    • Add a new property with the name “SYNONYM_CSV_FILE_PATH
    • And assign a value with the path to the CSV-file (either as local file system path or as network path appropriate for the used operating system)

Example 1:  SYNONYM_CSV_FILE_PATHC:\data\synonyms.csv

Example 2:  SYNONYM_CSV_FILE_PATH\\fileserver.mydomain.com\mes-config\synonyms.csv

Finally save the configuration changes and restart the Mindbreeze Node to propagate all changes.

Note: Any change to the synonym CSV file is applied immediately and will be regarded on the next search.

Replacement Transformer PluginPermanent link for this heading

The ReplacementTransformer-Plugin is often used to replace unreasonable search terms with better ones or even to disallow search terms.

The main difference to the Synonym transformer plugin is that the original query is really replaced with a new one and will not be shown in the reporting of search terms. The Replacement transformer can therefore be used to hide search results found by users and replace them by something else (e.g. to hide a legacy page and show the new version).

Usage: The replacement terms can be defined in a CSV-file where the first column defines the search term to be replaced and the following columns are taken as disjunctive (OR-combined) replacement value (if empty the term will not be searched for).
Every new search term that should be replaced has to be written on a new line and the columns have to be separated with a semi-colon (;).

Example of a small replacement.csv file:

car;mercedes;bmw;audi

party

Example 1: a search for car sends the transformed query: mercedes OR bmw OR audi

Example 2: a search for party will not find any results as it is replaced by an “empty” search

InstallationPermanent link for this heading

  • Install the plugin with the Manager UI
  • Activate the plugin for every Index you want (with the Manager UI)
    • Switch to “Indices”-tab, activate “Advanced Settings
    • Scroll down to the section “Query Transformation Services
    • Select the “ReplacementTransformer”-plugin and click “Add
  • Add the path to the CSV-file containing the replacement definitions as “Custom Plugin Properties”
    • Add a new property with the name “REPLACEMENT_CSV_FILE_PATH
    • And assign a value with the path to the CSV-file (either as local file system path or as network path appropriate for the used operating system)

Example 1:  REPLACEMENT_CSV_FILE_PATHC:\data\replacements.csv

Example 2:  REPLACEMENT_CSV_FILE_PATH\\fileserver.x.y\config\replacements.csv

Finally save the configuration changes and restart the Mindbreeze Node to propagate all changes.

Note: Any change to the replacement CSV file is applied immediately and will be regarded on the next search.

General Notes on Transformer Plugins (Replacement/Synonym)Permanent link for this heading

Note: If you are using both plugins (Synonym-Transformer and Replacement-Transformer) the Replacement-Transformer is applied first!

The following screenshot displays the configuration of both plugins within the Mindbreeze Manager Interface.

Note: Any change to the synonym CSV file is applied immediately and will be regarded on the next search.

Stemmer transformer pluginPermanent link for this heading

The stemmer transformer plugin allows you to find search results by searching for different stems of a word based on linguistic characteristics of the defined language.

Use: The basic algorithm to find suitable word stems is implemented in the supplied plugin. An additional dictionary with vocabularies of a specific language is available for the most common languages and is used to improve the search results.

In addition, so-called transliterations can also be carried out with the help of the stemmer transformer. In the process, characters are rewritten using rules. Both the original term and the rewritten term are then taken into account in the search.

Example:

A search for leaf will find matches like leaf and leaves.

Installation/configurationPermanent link for this heading

  • Install the plugin (if not already installed)

  • Enable the plugin for each desired index using the Manager UI:
    • Go to the “Indices”  tab and enable “Advanced Settings”
    • Scroll down to the section “Query Transformation Services
    • Select the “ StemmerTransformer” plugin and click “Add”

  • Configuring properties (depending on use)

Languages: The languages of the stemmer. One or more languages are permitted. The languages must be separated by commas or line breaks.

Path to vocabulary: A local path on the appliance that contains a vocabulary, so that the extension can be executed without just the reduction to stems (e.g. search for “tree” should also find “trees”).

Stemmer enabled: If checked, the stemmer is used.

Case sensitive: If this option is checked, the reduction of the stems is carried out taking upper and lower case into account (case-sensitive). This can produce more precise – but also fewer – stems. Note: The stem extension vocabulary is always used with no regard to upper and lower case (case-insensitive).

Auto detect language from query: The stemmer tries to derive the language from the search query.

Transliterate all variants: This option allows the stemmer to expand the query to include all matching transliterations.

TransliterationRule: Rules for rewriting strings in terms. The following rules can be used: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedTransliterator.html

Then save the changes and restart the Mindbreeze node so that the changes take effect.

Use case: multilingual stemming.Permanent link for this heading

If Mindbreeze is used with multiple languages, it makes sense to configure the stemmer transformer plugin for multiple languages to deliver matching search results for all languages used.

The configuration option "Languages" can be used to configure several languages. The stemmer will then attempt to find stem forms in a search query for each configured language. All stem forms found for all configured languages are then used for the transformation.

If different stem forms of different languages are used together, the search may become too fuzzy and deliver irrelevant search results. To counteract this behavior, you can use the configuration option "Auto detect language from query". If this option is active, a heuristic will be used to determine the language of the search query. Note: The heuristic only determines languages that are configured via the configuration option "Languages". The languages determined are then used for stemming. This means that only the specific language of a search query is used for stemming.

The stemmer vocabulary must be adapted so that expanding the stem forms also works correctly with multiple languages. The stemmer vocabulary ("Path to Vocabulary") is an unsorted text file containing words and has one word in each line. The stemmer plugin reads this text file and creates stem forms for every single word and links the information about which words have the same stem form. This information is used in a search to expand the search term. For example, a search for “tree” should also find “trees.” The language used by the stemmer to find the stem forms in the vocabulary follows the same rules as those used to find the stem forms for a search term. All configured languages are used, or, if the configuration option "Auto detect language from query" is enabled, a heuristic is used to determine the language of a word in the vocabulary. We recommend expanding the vocabulary text file for each configured language. This can be done by simple concatenation – the words do not have to be sorted.

Limitations of the stemmer transformer pluginPermanent link for this heading

Stem forms vs. synonymsPermanent link for this heading

The stemmer uses a primitive algorithm to find stem forms of a word and expands the search query additionally with a vocabulary. However, this only covers minor variations of a word (a few changed letters). This functionality is very useful for the majority of search queries, but may not be sufficient in special cases.

If the expansion of a word (tree trees) is not working correctly, you can take the following measures:

  • If no vocabulary is being used, a vocabulary should be configured.
  • If an extensive vocabulary is already in use, we recommend including the corresponding word with synonyms in a synonym transformer. If the vocabulary were to be expanded, there would be no guarantee of success, since the existing vocabulary is usually very extensive and the stemmer uses a naive algorithm. If, however, you add a new synonym, you will definitely be able achieve the desired effect.

Known words that are difficult to stemPermanent link for this heading

There are some words for which the stemmer transformer cannot correctly determine the respective stem forms. Known words in the language german are: “Autos,” “Nudeln,” and “Kiwis.” If these words affect the search quality, it is advisable to use a synonym transformer.

Term2DocumentBoost transformer pluginPermanent link for this heading

The Term2DocumentBoost plugin enables relevance tuning for search queries. You can perform the following use cases:

  1. Increase the relevance of particular documents for certain search queries. For example, a search for “help” can be tailored so that documents with the keyword “documentation,” for instance, are assigned a higher relevance in this search.
  2. Generally increase the relevance of certain documents. For instance, all documents with the keyword “Mindbreeze” can be assigned a higher relevance.
  3. Increase the relevance for matching metadata. For example, if you search for any person (search term: “John Smith”), documents by this person (metadata: “Author”) can receive a higher relevance.
  4. Generally influence the entire relevance model. For instance, change the relevance factor “Term Frequency” to change the priority of the frequency of search hits in the document.

InstallationPermanent link for this heading

  • Install the plugin using the Manager UI
  • Enable the plugin for each desired index using the Manager UI:
    • Go to the “Indices”  tab and enable “Advanced Settings”
    • Scroll down to the section “Query Transformation Services
    • Select the “ Term2DocumentBoost” plugin and click “Add”
  • The plugin is configured via 2 files. The
    • "Term to Document Boost CSV File" is required for use cases 1, 2, and 3.
    • "Default Relevance Options JSON File" is required for use case 4.
  • Configure the settings
  • “Term to Document Boost CSV File Path
  • Path of the CSV file
  • “Default Relevance Options JSON File Path
  • Path of the JSON file

Then save the changes and restart the Mindbreeze node so that the changes take effect.

ConfigurationPermanent link for this heading

General description of the Term to Document Boost CSV file formatPermanent link for this heading

The CSV file contains one row for each boosting, which in turn contains the following columns:

  • Term: the search term
  • Metadata key: the name of the metadata property to which the boosting is to be applied
  • Pattern: a pattern that determines the value to be boosted
  • Boost: the boost factor
  • Query: Optional. Expanded configuration. See the Configuration via Query section

Only DocumentInfo metadata (i.e. data that is either aggregatable or regexmatchable) can be used as property here. A list of these properties is available in the designer under "Filter".

If several rules match at the same time, the rule with the largest boost factor is used. However, this behavior could change in future versions.

Note: Any change in the CSV file is applied immediately and will be reflected in the next search.

You can easily edit the CSV file in the Management Center under the menu item "Search Experience," submenu "Query Boostings".

Use case: increase the relevance of particular documents for certain search queriesPermanent link for this heading

Example for a CSV file:

Term;Metadata Key;Pattern;Boost

help;title;portal help|intranet help;5

When a user performs a search for help, documents containing the terms  portal help or intranet help in the title will be boosted by a factor of 5.

Use case: increase the relevance of particular documentsPermanent link for this heading

Term;Metadata Key;Pattern;Boost

;extension;.*pdf;10

Leave the "Term" column empty. The document is boosted regardless of the user’s search query. For example, any document with the extension “pdf” can be boosted up or down.

Introduction to the Mindbreeze relevance modelPermanent link for this heading

The Mindbreeze relevance model calculates a relevance count or rank for each result. This is also visible as metadata in Mindbreeze Export:

This rank or relevance count is calculated using the following parameters. The higher the count, the more important the result.

Recency

The more recent a result is, the higher the relevance count will be.

Term frequency

The more often the searched term is matched in the current hit, the higher the relevance ranking will be.

Term proximity

If the distance between the matches in the current result is smaller than in another match, then it is more important.

Term inverse zone frequency

If two documents have the same number of matches but one document contains a lot more different terms than the other. The document with the smaller number of other terms then gets a higher rank.

Common misunderstandings and misinterpretationsPermanent link for this heading

It is important to note that boosting did not replace the relevance count, instead, it only increased it multiplicatively. If the relevance count of a document is 20 and it is boosted by a factor of 2, the relevance is then 40. This can result in the following phenomenon. You want Result 2 to be in position 1:

Result 1: Rank = 2000

Result 2: Rank = 20

If you boost Result 2 by 10, it will still be in position 2 just like before boosting:

Result 1: Rank = 2000

Result 2: Rank = 200

You therefore need to boost Result 2 by a factor of 101, for example, in order to put it in the first position.

Result 2: Rank = 20020

Result 1: Rank = 2000

Use case: increasing relevance for matching metadata/advanced configuration with queryPermanent link for this heading

To achieve more flexibility with boosting, you can also add an additional "Query" column. Here you can specify a query directly with the Mindbreeze InSpire Query Language, which determines the documents to be boosted.

Note: If you use the "Query" column, the "Metadata Key" and "Pattern" columns will be ignored.

Example for a CSV file:

Term;Metadata Key;Pattern;Boost;Query

help;;;3;"datasource/mes:key:""http://myweb.com/help-index.html"""

When a user performs a search for help, documents found with the querydatasource/mes:key:”http://myweb.com/help-index.html” will be boosted by a factor of 3. Please note the correct use of the special characters.

You can also use the placeholder {{query}} in the query. This placeholder is substituted dynamically by the search query during a search.

Note: if you use {{query}}, the Term column will also be ignored.

Term;Metadata Key;Pattern;Boost;Query

;;;7;"Author:""{{query}}"""

If the searched term is an exact author name, these documents are boosted by a factor of 7. For instance, when a user performs a search for the term John Smith, documents found with the query Author: “John Smith” will be boosted by a factor of 7.

Use case: general influence of the relevance modelPermanent link for this heading

You can generally adjust all parameters of the relevance model. This is done via the Default Relevance Options JSON file.

It is not advisable to edit this JSON file manually. Instead, you will find the item "Relevance" under the menu item "Search Experience" in the Management Center.

Note: These parameters are a fundamental part of the relevance model; small changes can have a major impact on the order of the search results. It is possible that the boosting factors in the CSV will have to be adjusted at a later time.

The following sections describe which parameters can be adjusted.

For more information, see:

  • Mindbreeze InSpire Configuration Manual, Indices tab
  • Manual api.v2.search Interface Description

Relevance factors (term frequency, document frequency)Permanent link for this heading

  • The individual entries can be used to determine how the relevance parameters influence the relevance ranking. The relative share of the individual factors is the percentage share of this parameter.

Serial

The influence of recency (document date mes:date) on the relevance. Documents from the last two years (25 months) are considered “recent”.  Anything older than two years is generally treated as not recent.

Term frequency

Absolute frequency of words

Doc frequency

Relative frequency of words in the document – TF-IDF

Term proximity

Distance between the hit terms in the text

Term inverse zone frequency

Maximum relative frequency of words in individual zones – max TF-IZF

Zone boost exponent

Influence of document property boosting on relevance ranking (0 means it will be ignored)

Term boost exponent

Influence of search term boosting on relevance ranking (0 means it will be ignored)

Doc boost exponent

Influence of mes:boost property on relevance ranking (0 means it will be ignored)

Term match exponent

Influence of the matching of terms (interesting for the OR function) mes:boost property on relevance ranking (0 means it will be ignored)

Constant

Particularly if Term boosting/Document boosting/Zone boosting is used exclusively and you do not want to use the remaining components (e.g. Term proximity, Serial).

Term boost IDF exponent

IDF = Inverse document frequency. The frequency of the occurrence of a term in many documents should have an effect on the calculation of the term boost. A high exponent means: less frequent words are weighted more strongly. A low exponent means: frequent words are weighted more weakly. 0 means that this option will be ignored.

Zone boosting (metadata boosting)Permanent link for this heading

Zone boosting is another way to change the order of the search results. Boost factors can be configured for so-called zones.  A zone is nothing more than a piece of document metadata. If you want documents that are found based on a certain metadata to be ranked higher in the search results, you can define a boost factor for this metadata (= zone). In the above example, documents found on the basis of the metadata “Author” are classified as more relevant by a factor of 1.05.  Valid values of the boost factor are real numbers greater than or equal to one with a decimal separator “.” (≥ 1.0).

Document boosting (alternative to Term to Document Boost CSV)Permanent link for this heading

Using “Document boosting,” you can also change the relevance of certain documents. The relevance of documents that are found based on a search query can be changed by the “Boost factor” for all documents that match the “Query Expr”. In the above example, documents found that originate from the author “Legend User” are rated more relevant by a factor of 1.1.

Valid values of the Boost factor are:

  • To decrease weighting: real numbers greater than zero and less than one (> 0.0 ∧ < 1.0) with decimal separator “.”
  • To increase weighting: real numbers greater than one (> 1) with decimal separator “.”
  • The Boost factor 1 has no impact
Term boosting (term and Ngram boosts)Permanent link for this heading

Term boost factor

Boost factor for exact matches (1.0)

Ngram boost factor

Boost factor for partial word matches (1.0). This option is only relevant if the following settings are enabled in the Management Center under “Configuration” -> “Client Services” -> “Enable Character NGRAMs” (“Advanced Settings” must be enabled). This option is already enabled by default.

Congruence boost factor

Boost factor for character congruence (e.g. “a” vs. “ä”). This option is only relevant if the following settings are enabled in the Management Center under “Configuration” -> “Client Services” -> “Query Expansion for Diacritic Term Variants” (“Advanced Settings” must be enabled). This option is already enabled by default.

Distance boost reduction

Boost decrease for each change = Edit distance (e.g. “Mindbreze” vs. “Mindbreeze”). This option is only relevant if the following settings are enabled in the Management Center under “Configuration” -> “Client Services” -> “Enable Query Expansion for Similar Term” (“Advanced Settings” must be enabled). However, this option is enabled by default.

Additional FeaturesPermanent link for this heading

Did you mean?Permanent link for this heading

If you don’t find any results and only misspelled the word in the search term Mindbreeze offers an alternative search term (based on some internal index statistics and analysis) that would find better results. This feature is called “Did you mean?”.

Entity RecognitionPermanent link for this heading

Entity recognition can be used to extract metadata from the document content or from other metadata properties of the documents which may be used for more efficient searches afterwards.

This topic is described in detail in “Documentation – Mindbreeze Inspire”. For details please read the documentation on “Indices tab”.

CSV TransformationPermanent link for this heading

To extend indexed documents with additional metadata for easier finding results the CSV transformation allows the mapping of well-defined values to other value columns stored in a CSV file.

This feature can be quite helpful to extend your index with technical terms, abbreviations, topics or even short descriptions for your documents in special use cases.

Example: a city ZIP code directory

ZIP;City;Province

4020;Linz;Central Upper Austria
1020;Vienna;Capital City of Austria
9861;Krems;Forest Quarter
4400;Steyr;Traun Quarter

The first line of this sample CSV contains the head line defining the column names to map the data. The other lines contain the values for each mapping column. So if you are searching for the term “quarter” you will find search results for the two cities Steyr and Krems.

Another example would be the mapping of technical product data stored in a CSV file to the base articles on your web site. The mapping could be accomplished using the product ID extracted from the product web site and the CSV file contains a set of columns describing the article (product ID, category, price, dimensions, etc.).

ConfigurationPermanent link for this heading

As this feature is part of the Mindbreeze base product you don’t have to install any additional plugins but you only have to configure it.

  • Switch to “Indices”-tab, activate “Advanced Settings
  • Scroll down to the section “CSV Transformation
  • Specify the path to the CSV file containing the data mappings (either as local file system path or as network path appropriate for the used operating system)
  • Example 1:  CSV File PathC:\data\csv-mappings.csv
  • Example 2:  CSV File Path\\fileserver.x.y\config\csv-mappings.csv

For every metadata property (column) you want to extract from the CSV file add a new metadata definition with following property settings:

  • If Expression Matches:{{ZIP}}… this is the name of the mapping column in the CSV file (header name of the column containing the keys to map the documents)
  • In Property:customer_zipcode … this is the source document metadata property from the indexed document used to map the results (this could also be mes:key or any other property)
  • Name:City… this is the desired metadata name of the new property to extract (will be available for searching and if listed in the categoryDescriptor also visible in the results)
  • Value:{{City}}… this is the name of the desired target column in the CSV file (header name of column to be extracted)