Copyright ©
Mindbreeze GmbH, A-4020 Linz, 2024.
All rights reserved. All hardware and software names are brand names and/or trademarks of their respective manufacturers.
These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes, or other protected rights. The dissemination, publication, or reproduction hereof is prohibited.
For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.
This document deals with the concept, setup, and troubleshooting methods for configuring entity recognition.
In this chapter, the concept of entity recognition is explained using a simple example.
Follow these steps to set up:
Configuring the Entity Recognition Parameters enables the index service to extract metadata from document contents. The following settings are available:
Description | |
Pattern Rules | Defines a set of rules that are applied during the metadata extraction. The rules are defined with a regex pattern. Please note that the regular expressions must also be enclosed with a "/". |
Pattern Add Region Annotations | Adds the value of the setting "Use Link HREF pattern" to the annotation. |
Process HTML Attributes | Enables the Entity Recognition to also search inside link references. Like <a href=link to be searchable.> link </a> in HTML source text. |
HTML Attribute Name Pattern | Defines what attribute names should be searchable. It is defined with a regex pattern. In most cases a “href” is enough. It can also be simply extended with an ‘|’ (“OR”) like this: “href|link|…” . |
In the section “Add Metadata Definition” rules can be defined for each metadata. The following settings are available:
Description | |
If Rule Matches | A rule that defines the range in content where to extract metadata from. Should have the name of the rule that is defined in the setting “Pattern Rules”. |
Name | The name of the metadata to be added to a document when the defined rule matches. |
Value | A rule that defines the value of metadata. The value of the rule {{month}} can be normal text or composite. |
Scope | A rule that defines an area or several areas with one entity recognition rule, in which the rules for extraction are to be applied. For this purpose, the name of the rule for selecting the area(s) is to be entered. In contrast to value extraction, you have to enter the name without “{{}}”. |
Format | Enables the extraction of typed metadata like date from string. The known Types are "String", “Date”, “Number”. Only “Date” needs extra parameters “Format Options” and “Locale”. |
Format Options | Mandatory for Format “Date”. Option to set Formatting of output. Exact definition can be found here: https://github.com/unicode-org/icu/blob/main/docs/userguide/format_parse/datetime/index.md#datetime-format-syntax. Define Order and what to output, like: "yyyyy.MMMM.dd HH:mm" to Print 2024.July.05 11:33 |
Locale | Only used for Format “Date”. Set “Locale” locale, if machine and user locale differ. Like ja_JP to display in Japanese default date format. Check https://github.com/unicode-org/icu/blob/main/docs/userguide/format_parse/datetime/index.md#datetimepatterngenerator. |
In Existing Metadata | Defines to which metadata these rules should apply. For example: content, title, datasource/mes:key, <ownmetadatum>, etc. |
Aggregatable | If checked, the generated metadatum will be static aggregatable. |
Use Value for Sentence Embeddings | If this setting is activated, the recognized entities can be found with a Sentence Similarity Search (NLQA). |
Annotate As | Defines how the entity is added to the metadata. The following options are available:
|
Add Link With URL Pattern | Defines a pattern for the annotation link, if the setting “Annotate As” is set to “Link” or “Entity And Link”. It can use the regex definitions from the setting "Pattern Rules", which can be configured the same way as the setting “Value”. Used for something like: www.mindbreeze.com/link_to_item?item={{RuleName}} |
Entity Label | Name of the created entity, if the setting “Annotate As” is set to “Entity” or “Entity And Link”. |
This chapter uses a simple example to explain entity recognition and its setup with Mindbreeze.
First the rules for the extraction have to be created:
share=/[^\\]+/.
directory=/[^\\]+/.
UNCPath="\\\\" host "\\" share "\\" directory "\\".
If rule matches: UNCPath
Name: Laufwerk
Value: {{share}}
In existing metadata: datasource/mes:key
If rule matches: UNCPath
Name: Projektpfad
Value: {{directory}}
In existing metadata: datasource/mes:key
Aggregated metadata keys (; separated)
Laufwerk;Projektpfad
Date formats for entity recognition are based on the ICU patterns (e.g. locale … de_AT)
Even a complex case in which the rules are ambiguous can be achieved using alternative rules and sequencing by name, as well as the correct sequencing of the multiple metadata extraction. The path, a metadatum, is lower-case and thus better for CSV mapping.
An OR (|) operator of sub-rules does not work!
> Simple solution without exception:
Pattern rules:
LWPath=/\\\\[^\\]+\\[^\\]+\\[^\\]+\\[^\\]+/.
FilePath=/[^\\]+/.
FullPath=LWPath "\\" FilePath.
> Solution with an exception (data\it):
Pattern rules:
ASpecialPath="data\\it".
OtherPath=/[^\\]+/.
BaseShare=/\\\\[^\\]+\\[^\\]+\\[^\\]+/.
LWPathA= BaseShare "\\" ASpecialPath.
LWPathOther= BaseShare "\\" OtherPath.
FilePathA=/[^\\].*/.
FilePathOther=/[^\\].*/.
FullPathA=LWPathA "\\" FilePathA.
FullPathOther=LWPathOther "\\" FilePathOther.
The following screenshot demonstrates the configuration of the rules.
CSV-transform: the extracted value (file share) is case sensitive so the cases must match − that way the path can be used as source metadatum.
fileshare;letter
\\fileserver.mycompany.com\qa\fstest\projekte;U:
\\fileserver.mycompany.com\qa\fstest\vorlagen;T:
\\fileserver.mycompany.com\qa\fstest\allgemein;G:
\\fileserver.mycompany.com\qa\fstest\spezial;M:
\\fileserver.mycompany.com\qa\fstest\data\it;H:
\\fileserver.mycompany.com\qa\fstest\data;H:
\\fileserver.mycompany.com\qa\fstest\data-services;H:
\\fileserver.mycompany.com\qa\fstest\allgemein-retail;G:
Matching with mes:key is only possible in CSV transformation (as well as in ER rules) with: In Property = datasource/mes:key.
Please note: /documents-Servlet does not provide values that only arise via index re-invert!
This chapter deals with troubleshooting the entity recognition rules.
Check the index status at http://localhost:8443/index/<Indexport>/statistics
Entity Recognition rules are usually greedy In the following example, the selected rows are matched:
Rule
R1=/ (?s)(test)(?P<line>.+)\s+(.*Page) /.
Match:
If greedy is deactivated, however, not everything is matched, but instead, only those blocks that start with test and end with Page:
Rule:
(?U)(?s)(test)(?P<line>.+)\s+(.*Page)(?U)
Match:
An error with the following error message occurred while parsing the ER rules:
“MesQuery::Text::RE2Tokenizer ERROR: Matched empty (epsilon) token, pattern is”
… for instance, a „\“ at the end of a regex is not supported (LWPath=/\\\\[^\\]+\\/. … an error occurs better: LWPath=/\\\\[^\\]+/ “\\“.).
There can also be possible problems with “.*“ in rules.
Entity recognition rules are analyzed in alphabetical order and the first complete match wins.
Regex rules for German words do not match all characters (umlauts, etc.) with \w. Instead, you can use \pL to match all unicode characters.
If Entity Recognition should be applied on Content, either by setting Name to “Content” or “.*”, then Content has to be manually added as aggregatable. This can be done by one of the two methods:
\d{4}(\s|\.|\-)\d{6}
Example
1237 010180
1237.010180
1237-010180
(\+)([\s.\(\)]*\d{1}){8,13}(-)?(\d{1,5})
Example
+43 732 606162-0
+43 732 606162-609
+49(732)606162-609
RegEx
z1=/\d/.z2=/\d/. (…)Dlmtr=/[\s\-_.:]?/.
z1 Dlmtr z2 Dlmtr z3 Dlmtr z4 Dlmtr z5 Dlmtr z6.
Example
12-34567
12 34 56-7
1-2 3456.7
((\d{1,3}(\.(\d){3})*)|\d*)(,\d{1,2})
Example
0.84
100,000.49
100,000.00
1,000,000,000,000.00
Handbook for date formats: http://userguide.icu-project.org/formatparse/datetime
(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?
Example
11:00:23
12:30
([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})
Example
david.porter@inspire.mindbreeze.com
AT\d{18}
Example
AT002105017000123456
In this example a list of entries separated by semicolon will be interpreted as well as List in Mindbreeze InSpire.
Input: List of word, word,…
value=/[^\s,][^,]*[^,\s]?/.
rule=/\s*/value/\s*(,\s*|$)/.