Metadata Enrichment

Configuration

Copyright ©

Mindbreeze GmbH, A-4020 Linz, 2018.

All rights reserved. All hardware and software names are brand names and/or trademarks of their respective manufacturers.

These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes or other protected rights. The dissemination, publication or reproduction hereof is prohibited.

For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.

General InformationPermanent link for this heading

In order for the enrichment processes to be applied, the new metadata on the index must be added to the aggregated metadata keys. For example:

IntroductionPermanent link for this heading

This chapter deals with the concept, setup, and troubleshooting methods for configuring entity recognition.

Entity recognition configurationPermanent link for this heading

In this chapter, the concept of entity recognition is explained using a simple example.
Follow these steps to set up:

  • Connect to the Management Center.
  • Navigate to the index that you want to configure with entity recognition.
  • Activate and then open the advanced settings.
  • Search for the “Entity recognition parameters“ setting in the Management Center
  • In the pattern-rules field, define your entity recognition rules, which should match your metadatum.
  • The following rule formats are supported: https://github.com/google/re2/wiki/Syntax
  • In our concrete example:

    rule=/\// digits /\//. 
    digits=/\d+/.


    Explanation

    The first rule defines that all numbers between two slashes should match (regex):
    Example: test/1234test1234/test/543/test (543 is extracted)
  • Now add a new metadata definition to apply the rules for metadata
  • In this example, Mindbreeze searches for numbers between two slashes in the string of the existing metadatum, in the “full string”. If there are numbers between two slashes, Mindbreeze takes the part of the matches configured in the sub-rule “digits” and writes it as a string in the new metadataum “myextractedVal”.

    Example: 
    Full string: xyz/1234/herbert543/345test
    Match of the rule “rule”: /1234/
    Value of the rule “digits”: 1234
    Value of the metadatum myextractedVal==1234

Notes on configuration in the Management CenterPermanent link for this heading

When configuring as a metadatum in Mindbreeze InSpire, the following fields must be filled in:

  • If rule matches (name of the rule)
  • Name (name of the metadatum)
  • Value (value of the rule {{month}} – can be normal text or composite, for example: “Date {{Day}}.{{Month}}.{{Year}}”)
  • Format (format of the rule "string", “Date”, “Number”)
  • Format options (format options – especially with the date just like with SimpleDateFormat)
  • In existing metadata (the area where the rule is applied, for example: content, title, datasource/mes:key,, <ownmetadatum>, etc.)
  • Scope: With the Scope setting it is possible to select an area or several areas with one entity recognition rule, in which the rules for extraction are to be applied. For this purpose, the name of the rule for selecting the area(s) is entered in the scope field. In contrast to value extraction, you have to enter the name without {{}}.

Entity recognition (example: file system)Permanent link for this heading

This chapter uses a simple example to explain entity recognition and its setup with Mindbreeze.

Configuration of entity recognition for a file system:Permanent link for this heading

First the rules for the extraction have to be created:

host=/[^\\]+/.

share=/[^\\]+/.

directory=/[^\\]+/.

UNCPath="\\\\" host "\\" share "\\" directory "\\".

If rule matches: UNCPath

Name: Laufwerk

Value: {{share}}

In existing metadata: datasource/mes:key

If rule matches: UNCPath

Name: Projektpfad

Value: {{directory}}

In existing metadata: datasource/mes:key

Aggregated metadata keys (; separated)

Laufwerk;Projektpfad

Date formats for entity recognition are based on the ICU patterns (e.g. locale … de_AT)

Configuration for entity recognition for file system paths (variant 2) – with exceptions:Permanent link for this heading

Even a complex case in which the rules are ambiguous can be achieved using alternative rules and sequencing by name, as well as the correct sequencing of the multiple metadata extraction. The path, a metadatum, is lower-case and thus better for CSV mapping.

An OR (|) operator of sub-rules does not work!

> Simple solution without exception:

Pattern rules:

LWPath=/\\\\[^\\]+\\[^\\]+\\[^\\]+\\[^\\]+/.

FilePath=/[^\\]+/.

FullPath=LWPath "\\" FilePath.

> Solution with an exception (data\it):

Pattern rules:

ASpecialPath="data\\it".

OtherPath=/[^\\]+/.

BaseShare=/\\\\[^\\]+\\[^\\]+\\[^\\]+/.

LWPathA= BaseShare "\\" ASpecialPath.

LWPathOther= BaseShare "\\" OtherPath.

FilePathA=/[^\\].*/.

FilePathOther=/[^\\].*/.

FullPathA=LWPathA "\\" FilePathA.

FullPathOther=LWPathOther "\\" FilePathOther.

The following screenshot demonstrates the configuration of the rules.

CSV-transform: the extracted value (file share) is case sensitive so the cases must match − that way the path can be used as source metadatum.

fileshare;letter

\\fileserver.mycompany.com\qa\fstest\projekte;U:

\\fileserver.mycompany.com\qa\fstest\vorlagen;T:

\\fileserver.mycompany.com\qa\fstest\allgemein;G:

\\fileserver.mycompany.com\qa\fstest\spezial;M:

\\fileserver.mycompany.com\qa\fstest\data\it;H:

\\fileserver.mycompany.com\qa\fstest\data;H:

\\fileserver.mycompany.com\qa\fstest\data-services;H:

\\fileserver.mycompany.com\qa\fstest\allgemein-retail;G:

A match with mes:key goes in CSV transformation (as well as in ER rules) only with: In Property = datasource/mes:key.

Please note: /documents-Servlet does not provide values that only arise via index re-invert!

Troubleshooting entity recognitionPermanent link for this heading

This chapter deals with troubleshooting the entity recognition rules.

Important informationPermanent link for this heading

  1. In Mindbreeze InSpire, regular expressions are surrounded by a “/”.
  2. Each rule entry must be separated by a period.
  3. Rule names may not contain “_”
  4. Rules are “greedy”, meaning they match as much as possible be careful with “.*” or “.+” configurations).
  5. Rules are processed alphabetically (case-sensitive!). First in line are uppercase letters from A to Z, then lowercase letters from a to z.
  6. If a rule matches an entity, no second rule can match. Assumption: If the words “managing board” are used both for the committee and in the keyword, only the metadatum with the rule "committee" will include the words “managing board”.
  7. Entity recognition rules can only be created per index, that is, across all data sources within the index.

IndexPermanent link for this heading

Check the index status at http://localhost:8443/index/<Indexport>/statistics

Privileged servlets:Permanent link for this heading

  • Connect to the Management Center
  • Navigate to the index
  • Activate the advanced settings
  • Open the index for which you want to test entity recognition
  • Deactivate the “Disable Unrestricted Privileged Servlets” checkbox
  • Then save the settings and restart the services
  • After the services are restarted:
    • Open https://yourappliance:8443/index/Indexport (in our example: 23101/processitems) https://yourappliance:8443/index/23101/processitems
    • On this page, you can test the rules (pattern rules) with a specific query (e.g. ALL)
    • After filling in, click on process. If the syntax of the rules is correct, you’ll have more options to test after pressing the button.
    • Select the rule that you want to match and configure the values ​​of the rule(s).
    • Then click on process to start testing the rule(s):


Deactivating the greedy strategy of the entity recognition rulesPermanent link for this heading

Entity Recognition rules are usually greedy In the following example, the selected rows are matched:

Rule

R1=/ (?s)(test)(?P<line>.+)\s+(.*Page) /.

Match:

If greedy is deactivated, however, not everything is matched, but instead, only those blocks that start with test and end with Page:

Rule:

(?U)(?s)(test)(?P<line>.+)\s+(.*Page)(?U)

Match:

Common error sourcesPermanent link for this heading

An error with the following error message occurred while parsing the ER rules:

“MesQuery::Text::RE2Tokenizer ERROR: Matched empty (epsilon) token, pattern is”

… for instance, a „\“ at the end of a regex is not supported (LWPath=/\\\\[^\\]+\\/. … an error occurs better: LWPath=/\\\\[^\\]+/ “\\“.).

There can also be possible problems with “.*“ in rules.

Entity recognition rules are analyzed in alphabetical order and the first complete match wins.

Regex rules for German words do not match all characters (umlauts, etc.) with \w. Instead, you can use \pL to match all unicode characters.

Typical use casesPermanent link for this heading

Personal informationPermanent link for this heading

Social security numberPermanent link for this heading

RegEx

\d{4}(\s|\.|\-)\d{6}

Example

1237 010180

1237.010180

1237-010180

Telephone numberPermanent link for this heading

RegEx

(\+)([\s.\(\)]*\d{1}){8,13}(-)?(\d{1,5})

Example

+43 732 606162-0

+43 732 606162-609

+49(732)606162-609

Number (with delimiters)Permanent link for this heading

RegEx

z1=/\d/.z2=/\d/. (…)Dlmtr=/[\s\-_.:]?/.

z1 Dlmtr z2 Dlmtr z3 Dlmtr z4 Dlmtr z5 Dlmtr z6.

Example

12-34567

12 34 56-7

1-2 3456.7

AmountPermanent link for this heading

RegEx

((\d{1,3}(\.(\d){3})*)|\d*)(,\d{1,2})

Example

0.84

100,000.49

100,000.00

1,000,000,000,000.00

DatePermanent link for this heading

Handbook for date formats: http://userguide.icu-project.org/formatparse/datetime

  • dd(.|-|/)MM(.|-|/)yyyy
    • RegEx
      ((0[1-9])|[1-9]|([1-3][0-9]))(\.|\/|-)((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((19|20)\d{2})
    • Example
      11.03.2014
      11.3.2014
      3.3.2014
      03.2.2010
      11/03/2014
      11/3/2014
      3/3/2014
      03/2/2010
      11-03-2014
      11-3-2014
      3-3-2014
      03-2-2010
  • dd. MMM yyyy
    • RegEx
      ((0[1-9])|[1-9]|([1-3][0-9]))\..(|January|February|March|April|May|June|July|August|September|October|November|December).((19|20)\d{2})
    • Example
      3 January 2014
      4 February 2012
      30 November 2013
  • MMM yyyy
    • RegEx
      (January|February|March|April|May|June|July|August|September|October|November|December).((19|20)\d{2})
    • Example
      February 2014
      September 2014
  • MM(.|-|/)yyyy
    • RegEx
      (January|February|March|April|May|June|July|August|September|October|November|December).((19|20)\d{2})|((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((19|20)\d{2})
    • Example
      03-2014
      03.2014
      03/2014
  • yyyy(.|-|/)mm(.|-|/)dd
    • RegEx
      ((19|20)\d{2})(\.|\/|-)((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((([1-3][0-9]|0[1-9])|[1-9]))
    • Example
      2014-03-21

  • Date-Regex total
    ((0[1-9])|[1-9]|([1-3][0-9]))(\.|\/|-)((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((19|20)\d{2})|((0[1-9])|[1-9]|([1-3][0-9]))\..(January|February|March|April|May|June|July|August|September|October|November|December).((19|20)\d{2})|(January|February|March|April|May|June|July|August|September|October|November|December).((19|20)\d{2})|((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((19|20)\d{2})|((19|20)\d{2})(\.|\/|-)((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((([1-3][0-9]|0[1-9])|[1-9]))
  • Date-Regex total II
    ((((0?[1-9]|[12]\d|3[01])[\.\-\/](0?[13578]|1[02])[\.\-\/]((1[6-9]|[2-9]\d)?\d{2}))|((0?[1-9]|[12]\d|30)[\.\-\/](0?[13456789]|1[012])[\.\-\/]((1[6-9]|[2-9]\d)?\d{2}))|((0?[1-9]|1\d|2[0-8])[\.\-\/]0?2[\.\-\/]((1[6-9]|[2-9]\d)?\d{2}))|(29[\.\-\/]0?2[\.\-\/]((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)|00)))|(((0[1-9]|[12]\d|3[01])(0[13578]|1[02])((1[6-9]|[2-9]\d)?\d{2}))|((0[1-9]|[12]\d|30)(0[13456789]|1[012])((1[6-9]|[2-9]\d)?\d{2}))|((0[1-9]|1\d|2[0-8])02((1[6-9]|[2-9]\d)?\d{2}))|(2902((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)|00))))
  • Example
    31.12.2005
    12.12.12
    1.2.2003
    1.3.98
    04-05-2004

TimePermanent link for this heading

RegEx

(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?

Example

11:00:23

12:30

E-mailPermanent link for this heading

RegEx

([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})

Example

david.porter@inspire.mindbreeze.com

egov@mindbreeze.com

IBANPermanent link for this heading

RegEx

AT\d{18}

Example

AT002105017000123456

Split List by “,” or other symbolsPermanent link for this heading

In this example a list of entries separated by semicolon will be interpreted as well as List in Mindbreeze InSpire.

Input: List of word, word,…

value=/[^\s,][^,]*[^,\s]?/.

rule= /\s*/ value /\s*(,\s*|$)/.

CSV TransformationPermanent link for this heading

This section focuses on metadata enrichment using a CSV file. In doing this, it is possible to compare a value in a metadatum (i.e. piece of metadata) with a value of a particular column in the CSV. If the value from the metadatum matches the value from the column, you can write the value of another column from the same row to a new metadatum and attach it to the result.

Setting up the CSV transformationPermanent link for this heading

This chapter uses a concrete example to illustrate setting up CSV transformation. The following steps must be performed for the configuration:

Connect to the management center (default: https://IhreAppliance:8443).

Navigate to the Indices tab, enable the Advanced Settings, and then expand your Index.

Search for the CSV transformation setting and set the function as shown in the example below.

 Example:

CSV File Path: Path on the server of the CSV file copied by you

If Expression Matches: The name of the column in the CSV, which must match the value of a metadatum for the transformation

In Property: The existing metadatum that should be compared with the value from the column in the CSV

Name: The name of the metadatum that should contain the new enriched value

Value: The name of the column whose value is to be written to the new metadatum "Name" in order to enrich the result

Copy any CSV file to the /data/ directory on your Mindbreeze InSpire

How does it work?Permanent link for this heading

The value of the existing metadatum (medication) is compared with the value from the Medication column. If a row was found in which these two values were equivalent, the value is extracted from the ATC_CODE column and attached to the metadatum ATC_CODE.

Changes in the CSV file using a spreadsheet programPermanent link for this heading

If you edit the CSV using a spreadsheet program such as Excel, you must ensure that the CSV is still in UTF8 format rather than UTF8-BOM format after processing.

You can check this with any text editor such as Notepad++ and, if necessary, convert it back to the UTF8 format.

File Metadata EnricherPermanent link for this heading

This chapter deals with the use of the File Metadata Enricher. This plugin allows you to enrich indexed documents (e.g. PDF files) with external sources such as an XML file or a CSV file. This chapter differentiates between XML file metadata enrichment and catalog settings.

XML file metadata enrichmentPermanent link for this heading

This mechanism is very similar to the mechanism of CSV transformation. In essence, this is about the possibility of comparing the value of a metadatum with the value in an XML file. If, for example, there is a file with content (e.g. mindbreeze.pdf) in a data source and another file that contains the metadata separately (e.g. mindbreeze.xml), they can be merged into one result to link the content to the metadata. This mechanism is explained in more detail in the following example:

Configuration examplePermanent link for this heading

ExplanationPermanent link for this heading

File Path Source: Name of a metadatum used as a source for the enrichment. For example, a metadatum containing the path of the current result can be used. For instance, in the Microsoft File Connector, datasource/mes:key contains smb://myserver/testdaten/Content/.

File Path Pattern: Limits the enricher’s scope of application. All results whose mes:key values are not the same as the regex from File Path Pattern are ignored The enricher is not applied.

File Path Replacement: The path that contains the metadata of a file is specified here. The files must be located locally on the appliance or at least mounted on it. It is possible to reference a matching value of the regex specified in File Path Pattern as a variable here. The matching groups (REGEX) can be referenced in ascending order with $1 (e.g. $1, $2, $3, ...). The group (.*) can therefore be referenced with $1. In our case, the name of the file is extracted from the string that matches the File Path Pattern.

Metadata Node XPath: Each XML node that is made by this XPATH is interpreted by the enricher as an object with metadata.

Metadata Key XPath: The string that is matched by this XPATH expression is used by the enricher as the name of the new metadatum.

Metadata Value XPath: The string that is matched by this XPATH expression is used by the enricher as the value of the new metadatum.

Date Format: If a format is specified in Java Simple Date Format, the enricher will try to interpret each string that is matched by Metadata Value XPath as a date in the specified format to provide the entire functionality of the Mindbreeze date format. If the string is not in the specified format, the enricher performs a fallback and interprets the matching string as a string.

Example:Permanent link for this heading

This chapter uses a concrete example of the enricher for illustration.

XML file (1.xml)Permanent link for this heading

<?xml version="1.0" encoding="utf-8"?>

<Document>

<UserID>4711_12</UserID>

<DocID>PDF_4711_12_CV_001.pdf</DocID>

<DocType>CV</DocType>

</Document>

ExplanationPermanent link for this heading

If it matches the regex from the file path pattern, the metadatum datasource/mes:key is compared with all local or mounted file names from the file path replacement path for each result that was configured for the index on which the metadata enricher is configured. In so doing, the file name that was defined as a regex group in the file path pattern is used in the File Path Replacement at the reference point.

Example:

Source file: …/1.pdf File Path Replacement: …/1.xml

If the paths match, the XML node /<Document>/* is searched for in the .xml and all child nodes of the node are interpreted as relevant information. The name of the node is interpreted as the metadatum name of the new metadatum to be created. If the current child node contains a text(), this is set as the value for the newly created metadatum, and the metadatum is attached to the current result in the index.

Example:

In our case, the following indexed metadata would be attached to the already indexed file 1.pdf:

UserID: 4711_12

DocID: PDF_4711_12_CV_001.pdf

DocType: CV

Catalog SettingsPermanent link for this heading

This mechanism uses a CSV file for enrichment. As with CSV transformation, information from Mindbreeze is compared with the value of a column in the CSV. Unlike CSV transformation, the metadatum cannot be selected for comparison because the plugin is actively searching for matches in the content of the file. Another important function of the plugin is the recognition of negations. If, for example, there is a match for renal failure, but renal failure is not mentioned in the text, renal failure is attached to the result as a negation in a separate metadatum. Additionally, this feature allows automatic links to be attached behind the hits and visualized in the PDF preview. The detailed operation of this function is explained in the following section.

Configuration examplePermanent link for this heading

ExplanationPermanent link for this heading

Catalog File: This setting includes the path of the CSV file to be used for enrichment. The file must be located locally on the Mindbreeze InSpire Appliance or mounted on it.

Catalog Match On: This setting specifies which column of the CSV file is compared with the information from the content of the results in order to recognize a match.

Extract Metadata: In this field, the name of the metadatum is specified in which the text of the Metadata Value column is inserted if there is a match. The metadata is attached to the result.

Extract Negated Metadata: With this setting, you specify the name of the metadatum that contains the text of the Metadata Value column if it is matched in the case of a negation. This metadata is attached to the result. The match must be recognized as follows:

Negation Prefix Pattern + Catalog Match On (string for matching) + Negation Postfix Pattern

Extract Metadata Item: This metadatum contains a structured form of the entire applied CSV. This metadatum is not intended to be a filter and is only intended to support the development of search applications.

Metadata Value: This field must contain the column name of the CSV defined in the Catalog File setting. If the value of the Catalog Match On column matches the string currently compared from the content of the result, the value of the column specified in the Metadata Value field is attached as a string to the metadatum Extract Metadata.

Link HREF Pattern: In this setting, a link can be assembled using the extracted metadata. This link is then available in the PDF preview of the client. This link can be interpreted by the developer of the search application. The format of the link can be specified as follows:

https://entity.mindbreeze.com/meddra/?code={{pt_code}}

Instead of the placeholder {{pt_code}}, the value that was actually extracted is inserted in the column pt_code from the CSV at time of inversion.

Catalog ID Column: This setting determines which column of the CSV file is unique. This is used internally by Mindbreeze.

Sentence Split Pattern: This setting serves to divide the content of a sentence into sentence parts. The enricher is applied only in those parts of sentences that correspond to the regular expression given here.

Negation Prefix Pattern: This pattern specifies the prefix (usually text) to be used to recognize a negation. Here again, syntax means the syntax of the regular expressions.

Negation Postfix Pattern: This pattern specifies the Postfix (usually text) to be used to recognize a negation. Here again, the syntax means the syntax of the regular expressions.

Replacement Patterns: This field allows certain occurrences of words, sentences, or letters to be synonymous for the enricher. For example, this means it would be possible for ä to be ae and also for ae to be ä. The syntax for this is shown in the following example:

(?i:ä|ae)|>(ä|ae)

(?i:ae|ä)|>(ae|ä)

|>… Used as separator. Each rule must be entered as a separate line in the configuration field.

Using the new metadataPermanent link for this heading

The document Development of Search Apps illustrates how the metadata can be used in a PDF preview when developing search applications.