HANLP Text Tokenizer Plugin

Copyright ©

Mindbreeze GmbH, A-4020 Linz, 2018.

All rights reserved. All hardware and software names used are brand names and/or trademarks of their respective manufacturers.

These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes, or any other protected rights. The dissemination, publication, or reproduction hereof is prohibited.

For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.

IntroductionPermanent link for this heading

This document deals with the HANLP text tokenizer plugin. This allows Mindbreeze InSpire to crawl and understand Chinese content. Essentially, this technology splits sentences into individual interrelated parts (tokens) in order to provide an optimized search experience. The tokenizer plugin requires a tokenizer service (not included).

RequirementsPermanent link for this heading

A tokenizer service must already be configured before using the plugin.

Set-upPermanent link for this heading

To activate the HANLP tokenizer, the following steps must be carried out:

  • Setting up the post filter
  • Setting up query transformation services
  • Re-indexing contents that were already indexed before the tokenizer was installed

Setting up the post filterPermanent link for this heading

In the tokenizer, the post filter is used to tokenize (split) the contents during crawling and before they are stored in the index.

  • Navigate to the Management Center
  • Select the “Filter” tab, activate “Advanced Settings” and open the filter that you want to use to tokenize Chinese content:
  • Then search for the “Post Filter Transformation Services” option and add the tokenizer post filter plugin (TextPlugin.HANLP):
  • EndPoint URL: URL of the /parse servlet of the tokenizer service
  • Tokenize ISO-8859-1 Text: If this option is enabled, ISO-8859-1 encoded text is also processed by the tokenizer.
  • Excluded Properties Pattern: The properties configured using regular expression are not processed by the tokenizer.

Setting up the query transformation servicePermanent link for this heading

In the tokenizer, the query transformation service is used to ensure that the text entered in the search field by the end user is also tokenized before the query. If this is not the case, the index tokenization doesn’t match that of the search query. This would have the same effect as if you had not configured a tokenizer.

  • Navigate to the Management Center
  • Choose the “Indices” tab
  • Enable the “Advanced Settings” and open the index containing the Chinese contents. Select the filter on which you have configured the post filter:

  • Look for the setting Query Transformation Services and add the tokenizer service:
  • Then open the settings of the query transformation service by clicking the “plus sign” icon, and configure the EndPoint URL of the tokenizer service equivalent to the post filter:

Content re-indexingPermanent link for this heading

If documents already exist in your index, they must be re-indexed because the existing documents have not yet been tokenized.