Home
Home
German Version
Support
Impressum
25.2 Release ►

Start Chat with Collection

    Main Navigation

    • Preparation
      • Connectors
      • Create an InSpire VM on Hyper-V
      • Initial Startup for G7 appliances
      • Setup InSpire G7 primary and Standby Appliances
    • Datasources
      • Configuration - Atlassian Confluence Connector
      • Configuration - Best Bets Connector
      • Configuration - Box Connector
      • Configuration - COYO Connector
      • Configuration - Data Integration Connector
      • Configuration - Documentum Connector
      • Configuration - Dropbox Connector
      • Configuration - Egnyte Connector
      • Configuration - GitHub Connector
      • Configuration - Google Drive Connector
      • Configuration - GSA Adapter Service
      • Configuration - HL7 Connector
      • Configuration - IBM Connections Connector
      • Configuration - IBM Lotus Connector
      • Configuration - Jira Connector
      • Configuration - JVM Launcher Service
      • Configuration - LDAP Connector
      • Configuration - Microsoft Azure Principal Resolution Service
      • Configuration - Microsoft Dynamics CRM Connector
      • Configuration - Microsoft Exchange Connector
      • Configuration - Microsoft File Connector (Legacy)
      • Configuration - Microsoft File Connector
      • Configuration - Microsoft Graph Connector
      • Configuration - Microsoft Loop Connector
      • Configuration - Microsoft Project Connector
      • Configuration - Microsoft SharePoint Connector
      • Configuration - Microsoft SharePoint Online Connector
      • Configuration - Microsoft Stream Connector
      • Configuration - Microsoft Teams Connector
      • Configuration - Salesforce Connector
      • Configuration - SCIM Principal Resolution Service
      • Configuration - SemanticWeb Connector
      • Configuration - ServiceNow Connector
      • Configuration - Web Connector
      • Configuration - Yammer Connector
      • Data Integration Guide with SQL Database by Example
      • Indexing user-specific properties (Documentum)
      • Installation & Configuration - Atlassian Confluence Sitemap Generator Add-On
      • Installation & Configuration - Caching Principal Resolution Service
      • Installation & Configuration - Mindbreeze InSpire Insight Apps in Microsoft SharePoint On-Prem
      • Mindbreeze InSpire Insight Apps in Microsoft SharePoint Online
      • Mindbreeze Web Parts for Microsoft SharePoint
      • User Defined Properties (SharePoint 2013 Connector)
      • Whitepaper - Mindbreeze InSpire Insight Apps in Salesforce
      • Whitepaper - Web Connector - Setting Up Advanced Javascript Usecases
    • Configuration
      • CAS_Authentication
      • Configuration - Alerts
      • Configuration - Alternative Search Suggestions and Automatic Search Expansion
      • Configuration - Back-End Credentials
      • Configuration - Chinese Tokenization Plugin (Jieba)
      • Configuration - CJK Tokenizer Plugin
      • Configuration - Collected Results
      • Configuration - CSV Metadata Mapping Item Transformation Service
      • Configuration - Entity Recognition
      • Configuration - Exporting Results
      • Configuration - External Query Service
      • Configuration - Filter Plugins
      • Configuration - GSA Late Binding Authentication
      • Configuration - Identity Conversion Service - Replacement Conversion
      • Configuration - InceptionImageFilter
      • Configuration - Index-Servlets
      • Configuration - InSpire AI Chat and Insight Services for Retrieval Augmented Generation
      • Configuration - Item Property Generator
      • Configuration - Japanese Language Tokenizer
      • Configuration - Kerberos Authentication
      • Configuration - Management Center Menu
      • Configuration - Metadata Enrichment
      • Configuration - Metadata Reference Builder Plugin
      • Configuration - Mindbreeze Proxy Environment (Remote Connector)
      • Configuration - Personalized Relevance
      • Configuration - Plugin Installation
      • Configuration - Principal Validation Plugin
      • Configuration - Profile
      • Configuration - Reporting Query Logs
      • Configuration - Reporting Query Performance Tests
      • Configuration - Request Header Session Authentication
      • Configuration - Shared Configuration (Windows)
      • Configuration - Vocabularies for Synonyms and Suggest
      • Configuration of Thumbnail Images
      • Cookie-Authentication
      • Documentation - Mindbreeze InSpire
      • I18n Item Transformation
      • Installation & Configuration - Outlook Add-In
      • Installation - GSA Base Configuration Package
      • JWT Authentication
      • Language detection - LanguageDetector Plugin
      • Mindbreeze Personalization
      • Mindbreeze Property Expression Language
      • Mindbreeze Query Expression Transformation
      • SAML-based Authentication
      • Trusted Peer Authentication for Mindbreeze InSpire
      • Using the InSpire Snapshot for Development in a CI_CD Scenario
      • Whitepaper - AI Chat
      • Whitepaper - Create a Google Compute Cloud Virtual Machine InSpire Appliance
      • Whitepaper - Create a Microsoft Azure Virtual Machine InSpire Appliance
      • Whitepaper - Create AWS 10M InSpire Appliance
      • Whitepaper - Create AWS 1M InSpire Appliance
      • Whitepaper - Create AWS 2M InSpire Appliance
      • Whitepaper - Create Oracle Cloud 10M InSpire Application
      • Whitepaper - Create Oracle Cloud 1M InSpire Application
      • Whitepaper - MMC_ Services
      • Whitepaper - Natural Language Question Answering (NLQA)
      • Whitepaper - SSO with Microsoft AAD or AD FS
      • Whitepaper - Text Classification Insight Services
    • Operations
      • Adjusting the InSpire Host OpenSSH Settings - Set LoginGraceTime to 0 (Mitigation for CVE-2024-6387)
      • app.telemetry Statistics Regarding Search Queries
      • CIS Level 2 Hardening - Setting SELinux to Enforcing mode
      • Configuration - app.telemetry dashboards for usage analysis
      • Configuration - Usage Analysis
      • Deletion of Hard Disks
      • Handbook - Backup & Restore
      • Handbook - Command Line Tools
      • Handbook - Distributed Operation (G7)
      • Handbook - Filemanager
      • Handbook - Indexing and Search Logs
      • Handbook - Updates and Downgrades
      • Index Operating Concepts
      • Inspire Diagnostics and Resource Monitoring
      • Provision of app.telemetry Information on G7 Appliances via SNMPv3
      • Restoring to As-Delivered Condition
      • Whitepaper - Administration of Insight Services for Retrieval Augmented Generation
    • User Manual
      • Browser Extension
      • Cheat Sheet
      • iOS App
      • Keyboard Operation
    • SDK
      • api.chat.v1beta.generate Interface Description
      • api.v2.alertstrigger Interface Description
      • api.v2.export Interface Description
      • api.v2.personalization Interface Description
      • api.v2.search Interface Description
      • api.v2.suggest Interface Description
      • api.v3.admin.SnapshotService Interface Description
      • Debugging (Eclipse)
      • Developing an API V2 search request response transformer
      • Developing Item Transformation and Post Filter Plugins with the Mindbreeze SDK
      • Development of a Query Expression Transformer
      • Development of Insight Apps
      • Embedding the Insight App Designer
      • Java API Interface Description
      • OpenAPI Interface Description
    • Release Notes
      • Release Notes 20.1 Release - Mindbreeze InSpire
      • Release Notes 20.2 Release - Mindbreeze InSpire
      • Release Notes 20.3 Release - Mindbreeze InSpire
      • Release Notes 20.4 Release - Mindbreeze InSpire
      • Release Notes 20.5 Release - Mindbreeze InSpire
      • Release Notes 21.1 Release - Mindbreeze InSpire
      • Release Notes 21.2 Release - Mindbreeze InSpire
      • Release Notes 21.3 Release - Mindbreeze InSpire
      • Release Notes 22.1 Release - Mindbreeze InSpire
      • Release Notes 22.2 Release - Mindbreeze InSpire
      • Release Notes 22.3 Release - Mindbreeze InSpire
      • Release Notes 23.1 Release - Mindbreeze InSpire
      • Release Notes 23.2 Release - Mindbreeze InSpire
      • Release Notes 23.3 Release - Mindbreeze InSpire
      • Release Notes 23.4 Release - Mindbreeze InSpire
      • Release Notes 23.5 Release - Mindbreeze InSpire
      • Release Notes 23.6 Release - Mindbreeze InSpire
      • Release Notes 23.7 Release - Mindbreeze InSpire
      • Release Notes 24.1 Release - Mindbreeze InSpire
      • Release Notes 24.2 Release - Mindbreeze InSpire
      • Release Notes 24.3 Release - Mindbreeze InSpire
      • Release Notes 24.4 Release - Mindbreeze InSpire
      • Release Notes 24.5 Release - Mindbreeze InSpire
      • Release Notes 24.6 Release - Mindbreeze InSpire
      • Release Notes 24.7 Release - Mindbreeze InSpire
      • Release Notes 24.8 Release - Mindbreeze InSpire
      • Release Notes 25.1 Release - Mindbreeze InSpire
      • Release Notes 25.2 Release - Mindbreeze InSpire
    • Security
      • Known Vulnerablities
    • Product Information
      • Product Information - Mindbreeze InSpire - Standby
      • Product Information - Mindbreeze InSpire
    Home

    Path

    Sure, you can handle it. But should you?
    Let our experts manage the tech maintenance while you focus on your business.
    See Consulting Packages

    Configuration
    CJK Text Tokenizer Plugin

    IntroductionPermanent link for this heading

    This document deals with the CJK Tokenizer Plugin. It allows Mindbreeze InSpire to crawl and understand Chinese or Japanese content. For example, sentences can be divided into individual parts (tokens) that belong together in order to provide an optimized search experience. The Tokenizer Plugin supports multiple Tokenizers. An external tokenizer service is also supported (not included).

    PrerequisitesPermanent link for this heading

    If an external Tokenizer service is to be used, this service must already be configured.

    SetupPermanent link for this heading

    To activate the CJK Tokenizer the following steps have to be performed:

    • Setup of the Postfilter
    • Setup of the QueryTransformationService
    • Reindex the contents that were already indexed before the Tokenizer installation.

    Setup of the Launched ServicePermanent link for this heading

    The CJK Tokenizer plugin is configured as a single Launched Service. This is the only way to achieve high performance. After configuration, this Launched Service is referenced as a Postfilter and QueryTransformationService.

    To set up the CJK Tokenizer Plugin Launched Service, switch to the "Index" tab in the configuration and add a new service in the "Services" section.

    Base ConfigurationPermanent link for this heading

    Bind port

    A free TCP port on the appliance on which it runs Launched Service.

    Tokenizer

    Selects the Tokenizer mode. The following modes are selectable:

    Jieba

    internal tokenizer, Chinese

    [Deprecated] HANLP

    external tokenizer service

    Kuromoji

    internal tokenizer, Japanese

    Separation character

    Character used to separate the tokens. The default value is \uFEFF . This value can also be changed for testing purposes. For the search to work correctly, however, the default value must be retained.

    Tokenize ISO-8859-1 Text

    If this option is activated, ISO-8859-1 encoded text is also processed by the tokenizer.

    Enable Text Normalization

    Text is normalized so that, for example, documents with full-width characters can be found even though normal western characters were used in the search. The normalization form used is NFKC.

    Excluded Properties Pattern

    The properties configured here using regular expression are not processed by the tokenizer.

    Jieba ConfigurationPermanent link for this heading

    Note: only relevant if the value Jieba is selected at Tokenizer.

    Segmentation Dictionary

    The dictionary used for tokenizing:

    Default

    smaller vocabulary

    Enhanced Support for traditional Chinese (Large)

    larger vocabulary

    Segmentation Mode

    Depending on whether the service is used as a QueryExprTransformation service or as a post-filter, different settings can be used. However, the default value "Index" is sufficient for both service types.

    Index

    Für Post-Filter oder QueryExprTransformation Service

    Search

    Für QueryExprTransformation Service

    HANLP Configuration (deprecated)Permanent link for this heading

    Note: only relevant if the value HANLP is selected at Tokenizer.

    [Deprecated] EndPoint URL

    URL of the /parse servlet of the Tokenizer service

    Kuromoji ConfigurationPermanent link for this heading

    Note: only relevant if the value Kuromoji is selected at Tokenizer.

    Tokenizermode

    Kuromoji Tokenizer Mode, also see Javadoc

    Setup of the PostfilterPermanent link for this heading

    The postfilter is used by the tokenizer to tokenize (decompose) the contents at crawling time before they are stored in the index.

    • To do this, navigate to the Management Center.
    • Select the tab Filter, activate the "Advanced Settings" and open the desired filter, which should tokenize the Chinese content:
    • Then search for the Post Filter Transformation Services option and add the reference to the CJK Tokenizer PostFilter Plugin (TextPlugin. CJKTokenizer) (recognizable by the "@" in the name):

    Setup of the Query Transformation ServicePermanent link for this heading

    With the Tokenizer, the Transformation Service query ensures that the text entered by the end user in the search field is tokenized before the query. If this is not the case, the tokenization of the index does not match that of the search query. This would have the same effect as if you had not configured a Tokeinzer.

    • To do this, navigate to the Management Center.
    • Select the Indices tab
    • Activate the "Advanced Settings" and open the index containing the Chinese content. Select the filter on which you have configured the post filter:

    • Search for the setting Query Transformation Services and add the reference to the CJK Tokenizer QueryTransformation Plugin (TextPlugin.CJKTokenizer) (recognizable by the "@" in the name):

    Re-indexing of contentsPermanent link for this heading

    If documents already exist in your index, they must be re-indexed because the existing documents have not yet been tokenized.

    TroubleshootingPermanent link for this heading

    The CJK Tokenizer Plugins runs a test servlet on the BindPort, which can be used for diagnostic purposes. For example, you can "tokenize" any text fragments in the web browser.

    For example, the call results in:

    https://myappliance:8443/index/{{BindPort}}/tokenize?text=清洁技术

    the result:

    <font color="#ffff00">-=清洁=- sync:ßÇÈâÈâ

    Note: The default separation character is not visible. To make these separators visible, you can copy the result to an editor.

    Download PDF

    • Configuration - CJK Tokenizer Plugin

    Content

    • Introduction
    • Prerequisites
    • Setup

    Download PDF

    • Configuration - CJK Tokenizer Plugin