Entity Recognition

Configuration

Copyright ©

Mindbreeze GmbH, A-4020 Linz, 2017.

All rights reserved. All hardware and software names are brand names and/or trademarks of their respective manufacturers.

These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes, or other protected rights. The dissemination, publication, or reproduction hereof is prohibited.

For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.

IntroductionPermanent link for this heading

This document deals with the concept, setup, and troubleshooting methods for configuring entity recognition.

Entity recognition configurationPermanent link for this heading

In this chapter, the concept of entity recognition is explained using a simple example.
Follow these steps to set up:

  • Connect to the Management Center.
  • Navigate to the index that you want to configure with entity recognition.
  • Activate and then open the advanced settings.
  • Search for the “Entity recognition parameters“ setting in the Management Center
  • In the pattern-rules field, define your entity recognition rules, which should match your metadatum.
  • The following rule formats are supported: https://github.com/google/re2/wiki/Syntax
  • In our concrete example:

    rule=/\// digits /\//. 
    digits=/\d+/.


    Explanation

    The first rule defines that all numbers between two slashes should match (regex):
    Example: test/1234test1234/test/543/test (543 is extracted)
  • Now add a new metadata definition to apply the rules for metadata
  • In this example, Mindbreeze searches for numbers between two slashes in the string of the existing metadatum, in the “full string”. If there are numbers between two slashes, Mindbreeze takes the part of the matches configured in the sub-rule “digits” and writes it as a string in the new metadataum “myextractedVal”.

    Example
    Full string: xyz/1234/herbert543/345test
    Match of the rule “rule”: /1234/
    Value of the rule “digits”: 1234
    Value of the metadatum myextractedVal==1234

Notes on configuration in the Management CenterPermanent link for this heading

When configuring as a metadatum in Mindbreeze InSpire, the following fields must be filled in:

  • If rule matches (name of the rule)
  • Name (name of the metadatum)
  • Value (value of the rule {{month}} – can be normal text or composite, for example: “Date {{Day}}.{{Month}}.{{Year}}”)
  • Format (format of the rule "string", “Date”, “Number”)
  • Format options (format options – especially with the date just like with SimpleDateFormat)
  • In existing metadata (the area where the rule is applied, for example: content, title, datasource/mes:key,, <ownmetadatum>, etc.)
  • Scope: With the Scope setting it is possible to select an area or several areas with one entity recognition rule, in which the rules for extraction are to be applied. For this purpose, the name of the rule for selecting the area(s) is entered in the scope field. In contrast to value extraction, you have to enter the name without {{}}.

Entity recognition (example: file system)Permanent link for this heading

This chapter uses a simple example to explain entity recognition and its setup with Mindbreeze.

Configuration of entity recognition for a file system:Permanent link for this heading

First the rules for the extraction have to be created:

host=/[^\\]+/.

share=/[^\\]+/.

directory=/[^\\]+/.

UNCPath="\\\\" host "\\" share "\\" directory "\\".

If rule matches: UNCPath

Name: Laufwerk

Value: {{share}}

In existing metadata: datasource/mes:key

If rule matches: UNCPath

Name: Projektpfad

Value: {{directory}}

In existing metadata: datasource/mes:key

Aggregated metadata keys (; separated)

Laufwerk;Projektpfad

Date formats for entity recognition are based on the ICU patterns (e.g. locale … de_AT)

Configuration for entity recognition for file system paths (variant 2) – with exceptions:Permanent link for this heading

Even a complex case in which the rules are ambiguous can be achieved using alternative rules and sequencing by name, as well as the correct sequencing of the multiple metadata extraction. The path, a metadatum, is lower-case and thus better for CSV mapping.

An OR (|) operator of sub-rules does not work!

> Simple solution without exception:

Pattern rules:

LWPath=/\\\\[^\\]+\\[^\\]+\\[^\\]+\\[^\\]+/.

FilePath=/[^\\]+/.

FullPath=LWPath "\\" FilePath.

> Solution with an exception (data\it):

Pattern rules:

ASpecialPath="data\\it".

OtherPath=/[^\\]+/.

BaseShare=/\\\\[^\\]+\\[^\\]+\\[^\\]+/.

LWPathA= BaseShare "\\" ASpecialPath.

LWPathOther= BaseShare "\\" OtherPath.

FilePathA=/[^\\].*/.

FilePathOther=/[^\\].*/.

FullPathA=LWPathA "\\" FilePathA.

FullPathOther=LWPathOther "\\" FilePathOther.

The following screenshot demonstrates the configuration of the rules.

CSV-transform: the extracted value (file share) is case sensitive so the cases must match − that way the path can be used as source metadatum.

fileshare;letter

\\fileserver.mycompany.com\qa\fstest\projekte;U:

\\fileserver.mycompany.com\qa\fstest\vorlagen;T:

\\fileserver.mycompany.com\qa\fstest\allgemein;G:

\\fileserver.mycompany.com\qa\fstest\spezial;M:

\\fileserver.mycompany.com\qa\fstest\data\it;H:

\\fileserver.mycompany.com\qa\fstest\data;H:

\\fileserver.mycompany.com\qa\fstest\data-services;H:

\\fileserver.mycompany.com\qa\fstest\allgemein-retail;G:

A match with mes:key goes in CSV transformation (as well as in ER rules) only with: In Property = datasource/mes:key.

Please note: /documents-Servlet does not provide values that only arise via index re-invert!

Troubleshooting entity recognitionPermanent link for this heading

This chapter deals with troubleshooting the entity recognition rules.

Important informationPermanent link for this heading

  1. In Mindbreeze InSpire, regular expressions are surrounded by a “/”.
  2. Each rule entry must be separated by a period.
  3. Rule names may not contain “_”
  4. Rules are “greedy”, meaning they match as much as possible be careful with “.*” or “.+” configurations).
  5. Rules are processed alphabetically (case-sensitive!). First in line are uppercase letters from A to Z, then lowercase letters from a to z.
  6. If a rule matches an entity, no second rule can match. Assumption: If the words “managing board” are used both for the committee and in the keyword, only the metadatum with the rule "committee" will include the words “managing board”.
  7. Entity recognition rules can only be created per index, that is, across all data sources within the index.

IndexPermanent link for this heading

Check the index status at http://localhost:8443/index/<Indexport>/statistics

Privileged servlets:Permanent link for this heading

  • Connect to the Management Center
  • Navigate to the index
  • Activate the advanced settings
  • Open the index for which you want to test entity recognition
  • Deactivate the “Disable Unrestricted Privileged Servlets” checkbox
  • Then save the settings and restart the services
  • After the services are restarted:
    • Open https://yourappliance:8443/index/Indexport (in our example: 23101/processitems) https://yourappliance:8443/index/23101/processitems
    • On this page, you can test the rules (pattern rules) with a specific query (e.g. ALL)
    • After filling in, click on process. If the syntax of the rules is correct, you’ll have more options to test after pressing the button.
    • Select the rule that you want to match and configure the values ​​of the rule(s).
    • Then click on process to start testing the rule(s):


Deactivating the greedy strategy of the entity recognition rulesPermanent link for this heading

Entity Recognition rules are usually greedy In the following example, the selected rows are matched:

Rule

R1=/ (?s)(test)(?P<line>.+)\s+(.*Page) /.

Match:

If greedy is deactivated, however, not everything is matched, but instead, only those blocks that start with test and end with Page:

Rule:

(?U)(?s)(test)(?P<line>.+)\s+(.*Page)(?U)

Match:

Common error sourcesPermanent link for this heading

An error with the following error message occurred while parsing the ER rules:

“MesQuery::Text::RE2Tokenizer ERROR: Matched empty (epsilon) token, pattern is”

… for instance, a „\“ at the end of a regex is not supported (LWPath=/\\\\[^\\]+\\/. … an error occurs better: LWPath=/\\\\[^\\]+/ “\\“.).

There can also be possible problems with “.*“ in rules.

Entity recognition rules are analyzed in alphabetical order and the first complete match wins.

Regex rules for German words do not match all characters (umlauts, etc.) with \w. Instead, you can use \pL to match all unicode characters.

Typical use casesPermanent link for this heading

Personal informationPermanent link for this heading

Social security numberPermanent link for this heading

RegEx

\d{4}(\s|\.|\-)\d{6}

Example

1237 010180

1237.010180

1237-010180

Telephone numberPermanent link for this heading

RegEx

(\+)([\s.\(\)]*\d{1}){8,13}(-)?(\d{1,5})

Example

+43 732 606162-0

+43 732 606162-609

+49(732)606162-609

Number (with delimiters)Permanent link for this heading

RegEx

z1=/\d/.z2=/\d/. (…)Dlmtr=/[\s\-_.:]?/.

z1 Dlmtr z2 Dlmtr z3 Dlmtr z4 Dlmtr z5 Dlmtr z6.

Example

12-34567

12 34 56-7

1-2 3456.7

AmountPermanent link for this heading

RegEx

((\d{1,3}(\.(\d){3})*)|\d*)(,\d{1,2})

Example

0.84

100,000.49

100,000.00

1,000,000,000,000.00

DatePermanent link for this heading

Handbook for date formats: http://userguide.icu-project.org/formatparse/datetime

  • dd(.|-|/)MM(.|-|/)yyyy
    • RegEx
      ((0[1-9])|[1-9]|([1-3][0-9]))(\.|\/|-)((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((19|20)\d{2})
    • Example
      11.03.2014
      11.3.2014
      3.3.2014
      03.2.2010
      11/03/2014
      11/3/2014
      3/3/2014
      03/2/2010
      11-03-2014
      11-3-2014
      3-3-2014
      03-2-2010
  • dd. MMM yyyy
    • RegEx
      ((0[1-9])|[1-9]|([1-3][0-9]))\..(|January|February|March|April|May|June|July|August|September|October|November|December).((19|20)\d{2})
    • Example
      3 January 2014
      4 February 2012
      30 November 2013
  • MMM yyyy
    • RegEx
      (January|February|March|April|May|June|July|August|September|October|November|December).((19|20)\d{2})
    • Example
      February 2014
      September 2014
  • MM(.|-|/)yyyy
    • RegEx
      (January|February|March|April|May|June|July|August|September|October|November|December).((19|20)\d{2})|((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((19|20)\d{2})
    • Example
      03-2014
      03.2014
      03/2014
  • yyyy(.|-|/)mm(.|-|/)dd
    • RegEx
      ((19|20)\d{2})(\.|\/|-)((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((([1-3][0-9]|0[1-9])|[1-9]))
    • Example
      2014-03-21

  • Date-Regex total
    ((0[1-9])|[1-9]|([1-3][0-9]))(\.|\/|-)((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((19|20)\d{2})|((0[1-9])|[1-9]|([1-3][0-9]))\..(January|February|March|April|May|June|July|August|September|October|November|December).((19|20)\d{2})|(January|February|March|April|May|June|July|August|September|October|November|December).((19|20)\d{2})|((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((19|20)\d{2})|((19|20)\d{2})(\.|\/|-)((0[1-9])|[1-9]|10|11|12)(\.|\/|-)((([1-3][0-9]|0[1-9])|[1-9]))
  • Date-Regex total II
    ((((0?[1-9]|[12]\d|3[01])[\.\-\/](0?[13578]|1[02])[\.\-\/]((1[6-9]|[2-9]\d)?\d{2}))|((0?[1-9]|[12]\d|30)[\.\-\/](0?[13456789]|1[012])[\.\-\/]((1[6-9]|[2-9]\d)?\d{2}))|((0?[1-9]|1\d|2[0-8])[\.\-\/]0?2[\.\-\/]((1[6-9]|[2-9]\d)?\d{2}))|(29[\.\-\/]0?2[\.\-\/]((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)|00)))|(((0[1-9]|[12]\d|3[01])(0[13578]|1[02])((1[6-9]|[2-9]\d)?\d{2}))|((0[1-9]|[12]\d|30)(0[13456789]|1[012])((1[6-9]|[2-9]\d)?\d{2}))|((0[1-9]|1\d|2[0-8])02((1[6-9]|[2-9]\d)?\d{2}))|(2902((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)|00))))
  • Example
    31.12.2005
    12.12.12
    1.2.2003
    1.3.98
    04-05-2004

TimePermanent link for this heading

RegEx

(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?

Example

11:00:23

12:30

E-mailPermanent link for this heading

RegEx

([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})

Example

david.porter@inspire.mindbreeze.com

egov@mindbreeze.com

IBANPermanent link for this heading

RegEx

AT\d{18}

Example

AT002105017000123456

Split List by “,” or other symbolsPermanent link for this heading

In this example a list of entries separated by semicolon will be interpreted as well as List in Mindbreeze InSpire.

Input: List of word, word,…

value=/[^\s,][^,]*[^,\s]?/.

rule= /\s*/ value /\s*(,\s*|$)/.