Mindbreeze Prediction Service

Using the example of text classification

Copyright ©

Mindbreeze GmbH, A-4020 Linz, 2017.

 

All rights reserved. All hardware and software names are trade names and/or trademarks of their respective owners.

These documents are confidential. The delivery and presentation of these documents alone does not justify any rights whatsoever to our software, our services and service performance results or other protected rights. The disclosure, publication or reproduction is not permitted.

For reasons of easier legibility, gender differentiation has been dispensed with. In terms of equal treatment, appropriate terms apply to both sexes.

ArchitecturePermanent link for this heading

The components for text classification with Mindbreeze InSpire consist of

  • Mindbreeze Prediction Service
  • Mindbreeze Filter Service with a special filter for the text classification and an OCR filter plug-in (optional)

Note: The current version of Fabasoft Capture Client always requires a valid OCR plug-in.

Mindbreeze ConfigurationPermanent link for this heading

Mindbreeze Prediction ServicePermanent link for this heading

The Prediction Service is a multi-client capable service, which includes the infrastructure components to create a model from a dataset by means of an algorithm. The model is trained to learn relations and correlations between input and output using the dataset, which can then be applied against untrained content. Specifically, this means that, for example, associations between documents and a class based on the document content can be learned in order to then predict a class with new documents.

Train Dataset Source Ratio


This parameter determines how much of the datasets (from the training documents) will be used to train the model.


A value of 0.8 means, for example, that the training takes place using 80% of the dataset. The remaining 20% of the dataset can then be used to test the classification.


Section describes how the model is tested.

The basis for dividing the dataset into training and test objects forms a hash value of the URIs of the documents in the dataset. This ensures that the dataset is very reliably divided. Similar document names and specific folder structures from the dataset are therefore not a problem.

Mindbreeze Filter ServicePermanent link for this heading

The following plugins are preconditions for the use of text categorization:

  • FilterPlugin.ABBYYFineReaderForPDFOCR (pdfocr)
  • FilterPlugin.PredictionServiceBasedTextCategorization (textcategorize)
  • Moreover, as a plug-in for the file extension pdf, no thumbnail should be generated.

FilterPlugin.ABBYYFineReaderForPDFOCRPermanent link for this heading

Activating/Deactivating ABBYY-License:

$ /opt/ABBYY/FREngine11/activatefre.sh

ABBYY FineReader Engine 11 activation script

Configuration file already exists.

1) Reconfigure service, manage licenses and set up samples

2) Manage licenses

3) Exit

Note: If OCR isn’t necessary, the PDF can be further processed 1:1 without OCR by means of the "Pass Through on Engine Failure" option.

FilterPlugin.PredictionServiceBasedTextCategorizationPermanent link for this heading

Starting a ProjectPermanent link for this heading

The Prediction Service is multi-client capable. This means that several clients can operate within a service. Within one client, several projects can be administered. A project corresponds to a dataset. Several models can then be generated via one project.

DatasetPermanent link for this heading

A dataset is assigned to a project. That dataset forms the basis for the training. In the specific case of text classification, the dataset consists of content in which the corresponding class is already known. Currently datasets are in Mindbreeze Index Format, which provides more efficient and flexible access to the elements.

Importing an initial datasetPermanent link for this heading

To create a dataset, you just have to index a data source and thus create an index. This index can then simply be stored according to the respective project. On the other hand, one can add content to the store directly via the Prediction Service. In Capture Client, this approach is basically the same as a bulk feedback.


The file system convention is structured as follows:

<PredictionService-Data-Directory>/tenants/<TenantID>/projects/<ProjectID>

storeDataset (Indexformat)

modelsAmount of models

Working with a datasetPermanent link for this heading

By addressing the Prediction Services via http on the respective port (e.g. 23800), one can view the clients and projects.


Here is a list of services within the client training and the project “Energiesparhaus“ (energy-efficient home):

Training and validating a modelPermanent link for this heading

To train a model using a dataset, there is a control command (mespredictioncontrol). This can also be run on a remote host, since only the HTTP port of the service is necessary to communicate.

mespredictioncontrol --in train_request.txt --type-in=textual --url=http://localhost:23800 --train

The Information train_request.txt controls the selection of elements from the dataset as well as the feature extraction process, through which the source elements become feature vectors that are suitable for the method of learning:

tenant_id: "Training"

project_id: "Energiesparhaus"

dataset_source {

        query_expr {

                   kind: EXPR_UNPARSED

                   unparsed_expr: "ALL"

        }

        item_ratio_property: ITEM_PATTERN

        item_property_pattern: "{{mes:key}}"

        item_ratio: 0.9

}(Datenquellen-Filter)

[mes.ipc.prediction.linear_kernel_config] {

    content_feature_weight: BINRF

    text_feature_config {

         label_property: "label"

         token_pattern: "\\pL{1,4}|\\d{1,3}"

         ngram_length: 2

max_token_count: 90

    }

}(Algorithmus-Konfig)

token_pattern

The parameter token_pattern defines how the text of the documents in the dataset is divided for the algorithm.

With the above information, the text is divided into strings of 1 to 4 unicode characters (\\pL{1,4}) or into numbers of 1 to 3 digits (\\d{1,3}).

The most useful value of this parameter depends on the underlying dataset, which means that modification of this parameter determines the quality of the obtained results.

ngram_length

This parameter determines how many of the tokens in the algorithm can be re-assembled. Again, the values will influence the quality of the results obtained.

A model can then be tested (validated) using he following command:

mespredictioncontrol --in test_request.txt --type-in=textual --url=http://localhost:23800 --test

The testing configuration consists solely of the tenant, project and model IDs.

tenant_id: "Training"

project_id: "Energiesparhaus"

model_id: "7d3e97de-1680-11e5-a081-001a4af18900"

Optionally, when a parameter is called up it can also be overwritten. For example, in order to test another model, simply specify the parameter --model_id:

mespredictioncontrol --in test_request.txt --type-in=textual
--url=http://localhost:23800 --test --model_id=<Alternative Model ID>

Administering the modelPermanent link for this heading

With the following command the available models for a given project are listed:

mespredictioncontrol --availablemodels  --tenant_id=Training
--project_id=Energiesparhaus --url=http://localhost:23800

Any number of models can be generated for one project (within a tenant). If a model is specified in PredictRequest to classify a document, it will be used. If no model is defined, the default model of a project is used.


Now a model can be defined as the default model using setdefaultmodel.

mespredictioncontrol --setdefaultmodel=<Model ID>  --tenant_id=Training
--project_id=Energiesparhaus --url=http://localhost:23800

Classification using the Capture Client examplePermanent link for this heading

Now we can demonstrate classification using the example of Capture Client, which employs so-called “document types” that are assigned to corresponding labels of the trained model.


After the initial import (triggered via XML Files in the file system), a classification is automatically carried out. The user interacts with the document based on the list of open documents. By confirming the document (next) there are two possible outcomes: The document is correctly classified or the document has been misclassified.

Below is an example in which the class is correct. This is confirmed by clicking “Next”.

Here is an example in which a correction is given as feedback by changing the document type to “Baufinanzierung” (construction financing).

Either way, feedback is sent to the Mindbreeze Prediction Service.

Recurrent feedback-based trainingPermanent link for this heading

With the following request, 80% of the documents from the base dataset are trained (category:Web) and 100% of the negative feedback from the current project (category:Training – category is the name of the client used here) is trained. The remaining parameters are identical to the initial training request.

tenant_id: "Training"

project_id: "Energiesparhaus"

dataset_source {

        query_expr {

                   kind: EXPR_UNPARSED

                   unparsed_expr: "category:Web"

        }

        item_ratio_property: ITEM_PATTERN

        item_property_pattern: "{{mes:key}}"

        item_ratio: 0.9

}

dataset_source {

        query_expr {

                   kind: EXPR_UNPARSED

                   unparsed_expr: "category:Training feedback_flag:true"

        }

        item_ratio_property: ITEM_PATTERN

        item_property_pattern: "{{mes:key}}"

        item_ratio: 1

}

(Data Source-Filter)

[mes.ipc.prediction.linear_kernel_config] {

    content_feature_weight: BINRF

    text_feature_config {

         label_property: "label"

         token_pattern: "\\pL{1,4}|\\d{1,3}"

         ngram_length: 2

max_token_count: 90

    }

mespredictioncontrol --in train_request.txt --type-in=textual --url=http://localhost:23800 --train

Extracting entities from contentPermanent link for this heading

Structured information from text can be generated using rules. These rules and the metadata extracted therefrom are defined at the configuration of the plug-in FilterPlugin.PredictionServiceBasedTextCategorization below the respective FilterServices in the Mindbreeze configuration.


Entities are defined by rules. A rule has the following syntax:

<RegelName> = <RegelDef_1> … <RegelDef_N> .

RegelName (rule name) is an alphanumeric word and must begin with a character (a-z A-Z).

RegelDef_i can use the following form:

<RegelName>

/ <Regulärer Ausdruck> /

" <Text> "

In the above example you can see the rules for “Bausparantragsnummer“ (bspranr) and for two invoice numbers (RNR1, RNR2), which are then extracted as metadata.

bspranr = /\d{5}\.\d{6}/.

space = /\s*/.

num4 = /\d{4}/.

num4a = /\d{4}/.

num2 = /\d{2}/.

num5 = /\d{5}/.

rest = /x+/.

rnr1 = "Rechnung Nr." space num4 "-" num5 rest.

rnr2 = "Rechnung Nr." space num4 "-" num2 "-" num4a.

Now entities will become metadata in the configuration (e.g. bspranr is data6).

By means of the Entity Recognition Workbench these rules can also be presented and tested using specific documents from the dataset:

In the above example, an invoice number is recognized using the environment invoice number <zahl>-<zahl>-<zahl> and registered as property "RNR" without separators.

Query languagePermanent link for this heading

The Mindbreeze InSpire query language is used to formulate queries in the clients.

Queries from single word beginnings (single terms)Permanent link for this heading

Wildcard characters (such as %, *, etc.) are generally not used in searches for the beginning of words or whole words.

Example:

auto

An entry of "auto" will start a search for objects that contain words that start with or contain the characters "auto" (the term). In principle, searches are not case sensitive, that is, a search for auto gives the same results as a search for Auto or AUTO.


Analogous to the queries of individual words or word beginnings, one can also perform a search of several words or several word beginnings in a document. The individual beginnings of words in a search query are linked with a logical AND. Your search will result in a list of only those documents which contain all of the words or the word beginnings.

Search for several word beginnings in the same document Permanent link for this heading

Example:

Variant 1

auto test

Variant 2

Auto Test

Variant 3

AUTO TEST

All three variants in this example deliver the same result. All results that contain the words "auto" and "test" either in the beginning of the word or as a separate word (terms) are rendered. The search results are not case-sensitive.

Phrase search / exact searchPermanent link for this heading


A phrase search looks for exact words or exact phrases. This kind of search is initiated by adding starting and closing quotation marks (") right before and at the end of the phrase being searched.

Example:

"Knowledge is a matter of seconds"


Only that exact phrase will be searched for. Phrase searching is therefore not useful if the exact spelling and wording of the words or phrase is not known.

Limit by data extensionPermanent link for this heading

Mindbreeze InSpire offers the possibility of limiting the search to known file extensions.

Example:

mind (extension:doc OR extension:xls OR extension:msg)


This query searches for all files with the extensions ".doc" (Microsoft Word), ".xls" (Microsoft Excel) and ".msg" (Microsoft Outlook) for words beginning with the letters "mind" and for the word "mind", whereby the search is not case sensitive.

Logical linksPermanent link for this heading

ANDPermanent link for this heading

The beginnings of words, whole words and phrases of a search query are implicitly “and-linked”. That is to say, the only results delivered are the hits in which all listed beginnings of words, words and phrases occur. The AND keyword can also be explicitly present in a (e.g. nested) query.

Example:

"Mindbreeze" AND "Search"

ORPermanent link for this heading

The OR link delivers all search results, in which at least one of the search conditions is met, i.e. results in which one of the word beginnings, words or phrases is present. That said, it will also deliver hits in which only one single listed word beginning, word or phrase occur. The OR keyword must be explicitly formulated in a query and can also be nested.

Example:

Variant 1

("Mindbreeze" OR "Search") AND "Software"

Variant 2

("Mindbreeze" OR "Search") "Software"

These two queries return results in which the word "Mindbreeze" and/or the word "search" appear together with the word "software". That is to say, they deliver results with the combinations "Mindbreeze" and "software", "Search" and "software" and "Mindbreeze", "Search" and "software".

Other key wordsPermanent link for this heading

NEARPermanent link for this heading

Using NEAR in a search returns only the results in which one searched word is located near another.

Example:

Mindbreeze NEAR Search

NOTPermanent link for this heading

A search with NOT delivers results from a basic quantity, in which a word is not found. NOT cannot be used alone.

Example:

Mindbreeze NOT slow

Metadata searchPermanent link for this heading

The metadata search is mainly used to further limit a result set; in this context it is also referred to as "refinement". Mindbreeze InSpire uses both fixed predefined metadata and vendor-specific metadata (defined by Mindbreeze partners themselves).

Syntax of a metadata search: <metadatum>:<wert>

Example:

title:Integration


The search for a file extension can be defined by the metadata item "extension".

Example:

extension:doc mind

In this example, both versions yield the same result: Microsoft Word files that contain the word "mind" or words that begin with "mind".

The following metadata are defined for the default supported data sources:

Short name

Metadata

Explanation

Available to

Name

title

Search within the name

All

Extension

extension

Search within the extension

All

File

directory

Search within the file

File System, Outlook, Exchange

Subject

subject

Search within the subject

Outlook, Exchange

From

from

Search within the sender

Outlook, Exchange

To

to

Search for recipients

Outlook, Exchange

(not displayed)

content

Search within the document content

All

Among other things, the Exchange Connector defines, the metadata from and to.

Example:

from:bauernf


This search delivers those objects that contain the term "bauernf" in the sender's address.

Interval searchPermanent link for this heading

Generally, a search with TO delivers terms between the left and right side of the TO operator. Of particular interest here is the use of interval search in conjunction with numeric strings. Mindbreeze recognizes numerical values of various kinds, e.g.

Text

Canonical Representation (German)

100

100,00

100.0

100,00

100,0

100,00

1.000,00

1000,00

1.000

1,00

1,000.00

1000,00

-100

-100,00

Syntax of an interval search: <von> TO <bis>

Example:

105 TO 110

Advanced metadata interval searchPermanent link for this heading

Syntax of an advanced interval search:

label:[<von> TO <bis>]

label:[<von>]

label:[TO <bis>]

Example:

size:[1MB TO 1,4MB]

mes:date:[2012-03-20 TO 2012-03-25]

Combination of the language elements of the query languagePermanent link for this heading

A combination of the above mentioned language elements of the Mindbreeze InSpire query language is possible.

Example:

title:Integration from:bauernf extension:doc

This example provides Microsoft Word documents which were sent from an address with the term "bauernf". The title of the resulting objects contains the word "integration" (or a word that begins with "integration").



Familiar metadata search

           Meaning

What you enter in the search field

title

Name

title:integration

extension

File ending

extension:doc search

directory

Directory

directory:review

subject

Subject

subject:Mindbreeze

from

From

from:Lehner

to

To

to:Lehner

url

Web addresse

url:"www.mindbreeze.com"

content

(not displayed)

content:search content:server

Combination of metadata

title:"integration from:Smith extension:doc