Mindbreeze GmbH, A-4020 Linz, 2022.
All rights reserved. All hardware and software names used are brand names and/or trademarks of their respective manufacturers.
These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes, or any other protected rights. The dissemination, publication, or reproduction hereof is prohibited.
For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.
Text classification with Mindbreeze InSpire has never been easier. Tag a portion of your documents with predefined labels. With the help of Mindbreeze Insight Services and Machine Learning, Mindbreeze InSpire is able to expand your knowledge and store it for future use cases. Based on this knowledge, all other documents can subsequently be classified fully automatically.
The main steps to perform this use case are:
In order to use text classification, certain configuration steps are necessary. Configure the following services:
In addition, you still need to make configuration adjustments in the Client Service and Index Services.
Details can be found in the next sections.
In Mindbreeze Management Center, navigate to the "Configuration" menu and switch to the "Indices" tab, then add a new service.
For the additional minimal configuration, fill in the following fields in the following configuration sections:
This parameter specifies the path to be used by the Prediction Service to get the training/test data and where the models learned by the service should be stored. The basepath is freely selectable.
Specifies the TCP port on which the Prediction Service will be accessible. It is important that the port is not already in use by another service (e.g. principal resolution, index or client service).
Now add the "Text Classification Insight Service". Assign a "Display Name" again and select the "TextClassificationInsightService" under "Service".
For a minimal configuration, fill in the following fields in the following configuration sections:
In addition to the Prediction Service and the Text Classification Insight Service, you still need changes in the configuration of the ClientService and the Index Services.
To enable users to label documents in the standard Insight app, you still need to make configuration changes in the client service.
Enable Document Labeling
Enables labeling in the Insight app. Enable this option (default: disabled).
You only need to change the other options in the "Document Labeling" configuration section if you have changed certain default values in the Text Classification Insight service:
The same value as for "Label Property Name" in the Text Classification Insight Service.
Labeling Feedback Collection
The same value as for "Feedback Collection" in the Text Classification Insight Service.
Available Labels Collection
The same value as for "Label Collection" in the Text Classification Insight Service.
Add the previously created "Text Classification Insight Service" to the index at the "Item Transformation Services". If you are using multiple indexes, repeat this step on all index services.
When the configuration is complete, you can define labels. These labels can be used by users to identify documents in the Insight app.
In the Mindbreeze Management Center, navigate to the "Insight Services" "Text Classification" menu. Then click on "Edit" at "Label Definitions".
Now define your labels according to which you want to classify your documents.
Define translations for the languages you want to support in your Insight app. If there is no translation for languages, the ID will be used for display in the Insight app. With the "Save" button you can confirm your entries.
Now users have the option to label documents with the labels they have just defined. After searching in the Insight app, the found documents can be labeled by selecting the desired label from the drop-down menu.
Logged in users can read and assign labels. Anonymous users who are not logged in to the Insight app can read (automatically assigned) labels. (Manually assigned labels are not visible to anonymous users).
If multiple users assign labels for the same document, all assignments are saved, but effectively only the label of the last assignment is used.
Users also have the option to remove their own feedback again (trash icon). If the document was previously labeled by another user, the previous label is now effective.
If the required documents have now been marked with labels, you can create the training data set that will later be used as the data basis for creating the model. To do this, navigate to the "Insight Services" "Text Classification" menu in the Mindbreeze Management Center. Then click on "Edit" at "Labeled Data".
You can now check whether the users have manually labeled the documents correctly. If labels were assigned incorrectly, these assignments can be changed here or even ignored. Then click on "Create or Update Dataset" to save your changes and create the training dataset.
In the next steps, a model can now be created and tested from the training data set.
In the Mindbreeze Management Center, navigate to the "Insight Services" "Text Classification" menu. Then click on "Train" under "Models".
Now click on "Train Model" to train a model. The default parameters are sufficient for most use cases. However, you can also fine-tune them if your use case requires it. The following parameters can be adjusted:
Must only be changed if the Dataset Label Property Name option has been changed in the Text Classification Insight Service configuration. The value specified here must match the one in the configuration.
The division of the data set into training and testing data. E.g.: "0.8" means that 80% of the data is used for training, 20% for testing.
If "Custom Regex" is selected, the "Custom Pattern" field appears, in which a custom regex can be specified
Word Ngram Length
In the next step, you can now test the model to get information about the quality of the model you just trained. Scroll to "Test Model". The model you just trained should already be selected. If you now click on "Test Model", the model will be tested with the test data and you will receive key figures that give you information about the quality of the model, such as "Accuracy".
Then click on "Set Default" so that this model is used for classification.
As already mentioned, the documents are automatically classified when they pass through the - more precisely in the Item Transformation step. Unless explicitly configured otherwise in the service configuration, the default model that you set in the previous step with "Set Default" is used for classification.
Since the Semantic Pipeline is only run through completely for new or changed documents, only new or changed documents are classified. However, to ensure that documents that have already been indexed are also classified, you have two options, which are described in more detail in the next sections:
If the index is small and a full indexing can be performed very quickly, a re-indexing is recommended to trigger a classification of all documents. To do this, navigate to "Services" in the Mindbreeze Management Center. Then click on the gear icon for the index you want to re-index and then click on "Reindex". As soon as the re-indexing is successfully completed, your documents are classified.
If the index is large and a complete indexing takes a long time, a re-inversion is recommended to trigger a classification of all documents. To do this, navigate to "Configuration" in the Management Center and switch to the "Client Service" tab. Activate the "Advanced Settings" and change the "Aggregated Metadata Keys". Changing this option will automatically re-invert the index. For example, you can specify "label" which will result in filtering by label in the Insight app. However, you can also specify a non-existent metadatum key, such as "V1". Save the configuration afterwards.
Once the re-inversion is successfully completed, your documents are classified.
Once your documents are classified, users can also provide feedback widely in the Insight app and change the labeling of the documents if, for example, the automatic classification was inaccurate and in some cases incorrect (see also ).
If a document now changes or a new document is indexed, the new, just trained model is already used for the classification. If you want to classify all documents, including the already indexed documents, with the new, improved model, you must trigger a or
You can perform these steps to iteratively improve the model as many times as you like until you are satisfied with the quality of your classification model.
This section describes all the options available in the Text Classification Insight service. This section is relevant to you only if you have special use cases that require special configuration.
The TCP port of the service
Max Request Handling Threads
Maximum number of threads used to process the HTTP server requests.
Max Feedback Processing Threads (advanced)
Number of threads used to process the user feedbacks ("Labeled Data").
The URL of the Prediction Service. E.g. http://localhost:23910 if you have selected 23910 as "Bind Port" for the Prediction Service.
The project ID used to structure records in the Prediction Service. Stored in: <PredictionService-Data-Directory>/tenants/<TenantID>/projects/<ProjectID>.
The tenant ID used to structure records in the Prediction Service. Stored in: <PredictionService-Data-Directory>/tenants/<TenantID> /projects/<ProjectID>
Label Property Name
The name of the metadatum used for the label property on the document.
Dataset Label Property Name
The name of the property in the dataset
Default Label Value
Documents that are excluded from classification for certain reasons (e.g. because the "Minimum Content Length" has not been reached) are assigned a default value as a label. This default value can be defined here.
Model ID (optional)
If empty, the "Default Model" is used (can be set in the Management Center under "Text Classification" "Models"). However, a model ID can also be explicitly specified here, which will then be used for the classification.
Additional Labeling Models (optional)
Here you can specify additional models that will be used in the classification.
Content Length Limit (Characters)
The maximum number of characters of the document content that will be used for classification. If the number of characters exceeds this configured value, the characters beyond it are not used during classification for performance reasons. The value "0" or an empty value disables the character limit.
Minimum Content Length (Characters) (optional)
The minimum number of characters of the document content that is required for the document to be classified. Documents that do not meet this requirement are classified with the configured "Default Label Value". The value "0" or an empty value disables this filter.
Source Metadata Keys (optional)
By default, only the document content is classified. Additional metadata can be specified here, which will be included in the classification.
Should always be enabled
Training Link Extraction (optional)
Links in documents (HTML anchor tags) are not included in training and classification by default. In order to include certain links that are meaningful for labeling, rules can be defined here.
Here you can define rules to label certain documents without calling the Prediction Service. For example, you can use it to classify all documents as "Documentation" that contain "Doc" or "Documentation" in the title.
The first rule that matches a document is always applied. If no rule matches, then the prediction service is used to set the label.
To select the documents to which the rule will be applied. Those documents are selected for which the "Value Pattern" matches the value of the metadata with the "Property Name" key.
Value Pattern (Regex)
See above. Value Pattern is a case-sensitive Java regex (ignored if the pattern starts with (?i)).
Which action is to be performed:
Only relevant if "Action" is set to "Set Label" (see above)
The ports of the indices in which the documents to be classified are located
Configure the same values for the following options as for "Resource Persistence Settings" in the Client Service: "JDBC URL", "Database Credentials", "Database Table Prefix".
see Client Service
see Client Service
Database Table Prefix
see Client Service
Owner Encryption Credential
If you use Identity Encryption in the Client Service, you must select a credential here. In this case, please select the same credential as in the client service option "Identity Encryption Credential".
The name of the collection in the "itemdata" persisted resources where user label feedback is stored.
The name of the collection in the "labeldefinition" persisted resources where the label definitions are stored.
In addition to user feedback (via the Insight app), a CSV file can be used to set labels for documents. These labels are not displayed in the Insight app, but can be used to train the classification model.
Enable CSV Processing
To activate the CSV feedback processing
CSV File Path
The path to the CSV file (write permissions required)
This section describes all other special options that are available in the Prediction Service besides the mandatory fields. This section is only relevant for you if you have special use cases that require special configuration. Also, in this section, those options that are not marked by "(Mandatory)" or "(Advanced)" are automatically considered as Advanced.
Base Path (Mandatory)
This parameter specifies the path to be used by the prediction service to get the training/test data and the path where the models learned by the service should be stored. The basepath is freely selectable.
Bind Port (Mandatory)
Specifies the TCP port on which the Prediction Service will be accessible. It is important that the port is not already in use by another service (cache, index, client,... service).
Dump Request/Responses (Advanced)
Here you can specify under which circumstances a dump request/response from the prediction service should be written to the dump path. The following options can be selected:
"Never" – Never
Dump Path (Advanced)
Here you can define the path where the dumps are written. Here it is only to be noted that these data lie in the "/data/" partition. The subfolders are self-definable.
Dataset Source Query
This can be used to restrict the training set with a query (e.g.: PDFs only). If the Text Classification Insight Service is used, this setting should be left empty.
Dataset Source Property
Currently only “UNIFORM_ITEM_ID” can be selected.
Train Dataset Source Ratio
Defines what % of all documents are used for training. If the Text Classification Insight Service is used, this setting should be left empty.
Label Alias CSV (optional)
With this extension you can translate the label values if the dataset contains a different value than needed for the classification.