Mindbreeze GmbH, A-4020 Linz, 2019.
All rights reserved. All hardware and software names used are registered trade names and/or registered trademarks of the respective manufacturers.
These documents are highly confidential. No rights to our software or our professional services, or results of our professional services, or other protected rights can be based on the handing over and presentation of these documents. Distribution, publication or duplication is not permitted.
Mindbreeze provides languge dectection for documents using the LangugageDector ItemTransformer plugin.
To use the language detection the LanguageDetector has to be added to you Mindbreeze installation by loading the corresponding plugin (the Item Transformation Services are included in the package “ Mindbreeze Item Transformation Plugins”). Install the plugin use the manager UI.
The plugin also has to be included in your Mindbreeze license.
- Activate the plugin for each needed index using the manager UI:
- Select the tab „Indices“ and activate „Advanced Settings“
- Scroll to the „Item Transformation Services” section
- Select the “TextPlugin.LanguageDetector” plugin and click add.
- Language Probability Threshold: Specifies the probability threshold which has to be reached for a language to be included.
- Source Property Pattern: Specifies the property used for language detection.
- Language Target Property: Specifies the new property for the detected languages. To be able to filter by this metadata, it must be aggregatable. To do this, activate the Advanced Settings and add the metadata in the Aggregated Metadata Keys option in the index configuration.
- Language Property: defines the property which already includes the language. This skips the language detection and sets target property.
- Language Property Pattern: Defines languages that should be considered from the “Language Property”
- Included Languages: Defines languages that should be considered by the detector.
- Force Included Languages: If enabled, the probabilities are only calculated on the basis of the “Included Languages" (and not on the basis of all supported languages). If only a few languages are configured in "Included Languages", it is advisable to disable this option.
- Short Text Algorithm Text Length: For short texts, the quality of speech recognition can be improved by using the "Short Text Algorithm". This setting determines the maximum length of the text (in characters) for which the "Short Text Algorithm" is used. Longer texts are analyzed with the “normal” algorithm.
- Max Text Length (characters): Determines the maximum length of the text (in number of characters) to be used for the analysis. For performance reasons, only the first characters of longer texts are used for analysis, the rest is skipped. The length of the text includes the sum of the contents of all metadata found with the Source Property Pattern. Default value: 100000
- No Language found set property key and No Language found property value: If speech recognition could not determine a language, a metadata can be set with a name (key) and a value (value). This can be useful to explicitly mark documents with no recognized language.
Run the LanguageDetector as separate Service
The LanguageDetector Plugin can also be used as a separate Server. This can improve the performance on large installations with multiple indices.
Add a new Servce in the „Indices“-Tab in the section „Services“ and choose „ItemTransformationServicePlugin.LanguageDetector“. In the setting of the new service configure a „Display Name“ and free TCP-Port as „Bind port“. The other settings should be configured as above. Add the newly creates ItemTransformation Service to each index that should use it.