Home
Home
German Version
Support
Impressum
23.1 Release ►

    Main Navigation

    • Preparation
      • Connectors
      • Create an InSpire VM on Hyper-V
      • Initial Startup for G7 appliances
      • Setup InSpire G7 primary and Standby Appliances
    • Datasources
      • Configuration - Atlassian Confluence Connector
      • Configuration - Best Bets Connector
      • Configuration - Box Connector
      • Configuration - COYO Connector
      • Configuration - Data Integration Connector
      • Configuration - Documentum Connector
      • Configuration - Dropbox Connector
      • Configuration - Egnyte Connector
      • Configuration - GitHub Connector
      • Configuration - Google Drive Connector
      • Configuration - GSA Adapter Service
      • Configuration - HL7 Connector
      • Configuration - IBM Connections Connector
      • Configuration - IBM Lotus Connector
      • Configuration - Jira Connector
      • Configuration - JiveSoftware Jive Connector
      • Configuration - JVM Launcher Service
      • Configuration - LDAP Connector
      • Configuration - Microsoft Azure Principal Resolution Service
      • Configuration - Microsoft Dynamics CRM Connector
      • Configuration - Microsoft Exchange Connector
      • Configuration - Microsoft File Connector (Legacy)
      • Configuration - Microsoft File Connector
      • Configuration - Microsoft Graph Connector
      • Configuration - Microsoft Project Connector
      • Configuration - Microsoft SharePoint Connector
      • Configuration - Microsoft SharePoint Online Azure Principal Resolution Service
      • Configuration - Microsoft SharePoint Online Connector
      • Configuration - Microsoft Stream Connector
      • Configuration - Microsoft Teams Connector
      • Configuration - Salesforce Connector
      • Configuration - SAP KMC Connector
      • Configuration - SemanticWeb Connector
      • Configuration - ServiceNow Connector
      • Configuration - Sitecore Connector
      • Configuration - Web Connector
      • Configuration - Yammer Connector
      • Configuration - Zoho Connector
      • Data Integration Guide with SQL Database by Example
      • Indexing user-specific properties (Documentum)
      • Installation & Configuration - Atlassian Confluence Sitemap Generator Add-On
      • Installation & Configuration - Caching Principal Resolution Service
      • Installation & Configuration - Jive Sitemap Generator
      • Installation & Configuration - Mindbreeze InSpire Insight Apps in Microsoft SharePoint On-Prem
      • Mindbreeze InSpire Insight Apps in Microsoft SharePoint Online
      • Mindbreeze Web Parts for Microsoft SharePoint
      • User Defined Properties (SharePoint 2013 Connector)
      • Whitepaper - Mindbreeze InSpire Insight Apps in Salesforce
      • Whitepaper - Web Connector - Setting Up Advanced Javascript Usecases
    • Configuration
      • CAS_Authentication
      • Cognito JWT Authentication
      • Configuration - Alternative Search Suggestions and Automatic Search Expansion
      • Configuration - Back-End Credentials
      • Configuration - Chinese Tokenization Plugin (Jieba)
      • Configuration - CJK Tokenizer Plugin
      • Configuration - Collected Results
      • Configuration - CSV Metadata Mapping Item Transformation Service
      • Configuration - Entity Recognition
      • Configuration - Exporting Results
      • Configuration - External Query Service
      • Configuration - Filter Plugins
      • Configuration - GSA Late Binding Authentication
      • Configuration - Identity Conversion Service - Replacement Conversion
      • Configuration - InceptionImageFilter
      • Configuration - Index-Servlets
      • Configuration - Item Property Generator
      • Configuration - Japanese Language Tokenizer
      • Configuration - Kerberos Authentication
      • Configuration - Management Center Menu
      • Configuration - Metadata Enrichment
      • Configuration - Metadata Reference Builder Plugin
      • Configuration - Mindbreeze Proxy Environment (Remote Connector)
      • Configuration - NLQA Plugin
      • Configuration - Notifications
      • Configuration - Personalized Relevance
      • Configuration - Plugin Installation
      • Configuration - Principal Validation Plugin
      • Configuration - Profile
      • Configuration - QueryExpr Label Transformer Service
      • Configuration - Reporting Query Logs
      • Configuration - Reporting Query Performance Tests
      • Configuration - Request Header Session Authentication
      • Configuration - Shared Configuration (Windows)
      • Configuration - Vocabularies for Synonyms and Suggest
      • Configuration of Thumbnail Images
      • Cookie-Authentication
      • Documentation - Mindbreeze InSpire
      • I18n Item Transformation
      • Installation & Configuration - Outlook Add-In
      • Installation - GSA Base Configuration Package
      • Language detection - LanguageDetector Plugin
      • Mindbreeze Personalization
      • Mindbreeze Property Expression Language
      • Mindbreeze Query Expression Transformation
      • Non-Inverted Metadata Item Transformer
      • SAML-based Authentication
      • Trusted Peer Authentication for Mindbreeze InSpire
      • Using the InSpire Snapshot for Development in a CI_CD Scenario
      • Whitepaper - MMC_ Services
      • Whitepaper - SSO with Microsoft AAD or AD FS
      • Whitepaper - Text Classification Insight Services
    • Operations
      • app.telemetry Statistics Regarding Search Queries
      • Configuration - app.telemetry dashboards for usage analysis
      • Configuration Usage Analysis
      • Deletion of Hard Disks
      • Handbook - Backup & Restore
      • Handbook - Command Line Tools
      • Handbook - Distributed Operation (G7)
      • Handbook - Filemanager
      • Handbook - Indexing and Search Logs
      • Handbook - Updates and Downgrades
      • Index Operating Concepts
      • Inspire Diagnostics and Resource Monitoring
      • InSpire Support Documentation
      • Mindbreeze InSpire SFX Update
      • Provision of app.telemetry Information on G7 Appliances via SNMPv3
      • Restoring to As-Delivered Condition
    • User Manual
      • Browser Extension
      • Cheat Sheet
      • iOS App
      • Keyboard Operation
    • SDK
      • api.v2.alertstrigger Interface Description
      • api.v2.export Interface Description
      • api.v2.personalization Interface Description
      • api.v2.search Interface Description
      • api.v2.suggest Interface Description
      • api.v3.admin.SnapshotService Interface Description
      • Debugging (Eclipse)
      • Developing an API V2 search request response transformer
      • Developing Item Transformation and Post Filter Plugins with the Mindbreeze SDK
      • Development of Insight Apps
      • Embedding the Insight App Designer
      • Java API Interface Description
    • Release Notes
      • Release Notes 20.1 Release - Mindbreeze InSpire
      • Release Notes 20.2 Release - Mindbreeze InSpire
      • Release Notes 20.3 Release - Mindbreeze InSpire
      • Release Notes 20.4 Release - Mindbreeze InSpire
      • Release Notes 20.5 Release - Mindbreeze InSpire
      • Release Notes 21.1 Release - Mindbreeze InSpire
      • Release Notes 21.2 Release - Mindbreeze InSpire
      • Release Notes 21.3 Release - Mindbreeze InSpire
      • Release Notes 22.1 Release - Mindbreeze InSpire
      • Release Notes 22.2 Release - Mindbreeze InSpire
      • Release Notes 22.3 Release - Mindbreeze InSpire
      • Release Notes 23.1 Release - Mindbreeze InSpire
    • Security
      • Known Vulnerablities
    • Product Information
      • Product Information - Mindbreeze InSpire - Standby
      • Product Information - Mindbreeze InSpire
    Home

    Path

    Sure, you can handle it. But should you?
    Let our experts manage the tech maintenance while you focus on your business.
    See Consulting Packages

    Web Connector

    Setup and Troubleshooting for Advanced Javascript usecases

    Copyright ©

    Mindbreeze GmbH, A-4020 Linz, 2023.

    All rights reserved. All hardware and software names are brand names and/or trademarks of their respective manufacturers.

    These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes or other protected rights.

    The dissemination, publication or reproduction hereof is prohibited.

    For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.

    .

    IntroductionPermanent link for this heading

    The Web Connector indexes HTML documents and also supports the indexing of web pages that use JavaScript by means of "JavaScript crawling".

    This documentation describes how to use and handle JavaScript crawling in the context of advanced JavaScript use cases.

    Note: For the use cases described here, basic knowledge of HTML and JavaScript is required. Furthermore, basic knowledge about the structure of the web page to be indexed is required.

    ArchitecturePermanent link for this heading

    Web Connector without JavaScript (Regular Web Crawling)Permanent link for this heading

    In this mode, the web connector downloads the web page to be indexed directly using a single HTTP request. In this mode, it is not possible for the connector to properly index web pages with dynamic content.

    Web connector with JavaScript (JavaScript crawling)Permanent link for this heading

    In this mode, the Web Connector downloads the web pages to be indexed indirectly using JavaScript crawling, allowing it to index web pages that are dynamically loaded with JavaScript.

    However, this mode is very performance-intensive and should only be used if no better alternative is available. See the section Alternatives.

    JavaScript crawling requires some additional settings, which are described in the following sections.

    Comparison between Regular Web Crawling and JavaScript CrawlingPermanent link for this heading

    The following table shows the main features and differences between regular web crawling and JavaScript crawling.

    Web Content Compatibility

    Regular Web-Crawling

    JavaScript-Crawling

    Crawling Efficiency

    Highly Efficient

    Very performance-intensive

    Static Websites

    Are fully indexed.

    Are fully indexed.

    Web pages with authorization (not supported in regular crawling)

    Are indexed with login mask, indexing not fully possible.

    Automatic login is possible to index only desired content.

    Web pages with unwanted content (static cookie banners or ads)

    Unwanted content is indexed.

    Unwanted content can be hidden.

    Content loaded by user input

    Some content is not loaded.

    User input can be simulated. All content is loaded and indexed.

    Content with delayed (lazy) loading

    Some content is not loaded.

    The crawler can wait for content. All content is indexed.

    Configuration JavaScript crawlingPermanent link for this heading

    Settings for indexing with JavaScriptPermanent link for this heading

    The following settings can be used to regulate which pages should be indexed with JavaScript. Furthermore, you can also control whether only web pages with the content type "text/html" should be indexed (this is determined in a separate HEAD request).

    A detailed explanation of the settings listed below can be found in the documentation Configuration - Web Connector.

    • Enable JavaScript
    • Skip Head Request
    • Include JavaScript URL (regex)
    • Exclude JavaScript URL (regex)

    The idea behind these fine-grained configuration options is to only use JavaScript crawling for web pages when it is necessary.

    Default behaviorPermanent link for this heading

    When the Web Connector is run in JavaScript mode (without configuring any other settings), the web page content is loaded and indexed as soon as the configured "Page Ready State" is reached. If the configured "Page Ready State" is not reached within the "Network Timeout", an error is logged and the content is not indexed.

    These settings can also be found here.

    Behavior with Scripts Permanent link for this heading

    Scripts are JavaScript code fragments that can be configured for advanced use cases. The following sections describe how to use these scripts:

    Content Presence SelectorPermanent link for this heading

    If the default behavior of the connector is not sufficient (for example, a web page is indexed with incomplete content) it is possible to define a "Content Presence Selector". This selector can be specified using XPath or CSS selectors and changes the operation to recognize the content, so that now it waits for the selector. This is especially useful for complex web pages that take longer to load.

    ScriptsPermanent link for this heading

    For web pages that need user input or are more complex, the "Content Presence Selector" can be extended with the "Script" setting.

    Configured scripts are executed based on events using the "Script Trigger Selectors". A Script Trigger Selector checks the DOM of the document to see if a certain condition occurs. When a Script Trigger Selector becomes active, the associated script is executed in the context of the current processing of the web page.

    Processing stops when the Content Presence Selector becomes active (when the requested content is loaded).

    For example, a Script Trigger Selector can detect if user input is currently required (e.g. a login form is displayed) and then perform actions in the associated script (e.g. a click on the login button)

    The "Script Trigger Selectors" can be specified using XPath and CSS.

    Assume Content if no Script triggeredPermanent link for this heading

    In some cases, we cannot define an explicit "Content Presence Selector". This can be the case if elements of the content are already loaded in the DOM before the script is to be executed or if different web pages are to be indexed, for which we cannot define a common selector.

    The setting "Assume Content if no Script triggered" exists to support these special cases. If this setting is enabled, the processing will be stopped as soon as no Script Trigger Selector is active and the web page will be indexed in this state.

    Note: care should be taken with pages that load more slowly, as this setting may cause subsequent scripts to stop running and the content may be indexed prematurely and incomplete.

    Use CasesPermanent link for this heading

    Intranet with unsupported API authenticationPermanent link for this heading

    This example explains the steps required to set up Mindbreeze Web Connector to crawl an intranet website with unsupported API authentication (in regular crawling).

    Requirements AnalysisPermanent link for this heading

    AlternativesPermanent link for this heading

    First of all, the website to be indexed must be analyzed. Since indexing with JavaScript is very performance-intensive, the following questions should be answered first and foremost:

    • Is the relevant website a supported data source for which another connector already exists?
    • (e.g.: Microsoft Sharepoint Online, Salesforce, Google Drive, Box Connector).
    • Is the authentication variant already supported with regular crawling? (Form, Basic, NTLM, OAuth).
    • Is there an API that can be integrated in the Data Integration Connector?
      Configuration - Data Integration Connector

    If none of these alternatives apply to the web page, we proceed with JavaScript crawling. Next, we analyze the behavior of the web page.

    BehaviorPermanent link for this heading

    In order to analyze the behavior of the web page, we first need to make sure that the necessary credentials (username, password) are available and that the login works with a browser (e.g.: Chrome, Firefox).

    If the login in the browser was successful, it will be checked if all contents that should be indexed are available. (It is possible that the user does not have sufficient permissions to see the relevant content).

    If all content is displayed correctly, the individual pages are analyzed in detail.

    For the analysis of the individual pages, a distinction must be made between three types:

    • Login pages
    • Content pages
    • Redirects ("redirected pages")
    Login PagesPermanent link for this heading
    • How is the login form structured, which fields have to be filled in?
    • How is the login triggered? Which button has to be pressed?
    • How long does the entire login process take? (timeouts may have to be adjusted)
    • To which hosts are connections established (e.g. loading fonts or JavaScript components from other servers)?

    Note: 2-factor authentications can NOT be supported.

    Content PagesPermanent link for this heading

    The following points are to be considered here:

    • Are actions required by the user to view the content? E.G.:
    • Do you need to scroll down to load the whole content?
    • Does a button need to be pressed to view the content?
    • Does an element need to be entered and confirmed to display the content? (e.g. cookie banner)  
    • Do you need to operate a menu?
    • How quickly does the page react to these actions? (maybe timeouts have to be adjusted)
    • Which hosts are connections established to? (e.g. loading fonts or JavaScript components from other servers)?
    • see "Forwarded pages" for more information.
    Redirects (Redirected Pages)Permanent link for this heading

    For web pages that redirect to other pages, it is important to note that the web connector blocks all unknown hosts by default for security reasons. (Error code: BLOCKED_BY_CLIENT)

    To allow other hosts, they must be specified as "Additional Network Resources Hosts" in the settings.

    Note: This also applies to sitemaps. Here, all relevant hosts must be specified as "Additional Network Resources Hosts".

    Browser Developer ConsolePermanent link for this heading

    When the analysis of these pages is complete, you can begin to recreate the login using JavaScript in the browser's Developer Console (F12).

    Login PagePermanent link for this heading

    The first step is to locate the elements of the login window (e.g.: username, password, login button) and note the corresponding selector.

    For the login page, the selector for the form is very suitable as a "Script Trigger Selector". Using XPath, you can specify more complex selector, which for example, search for the combination of username and password ID.
    (//*[@id='username'] and //*[@id='password'])

    1. Locate an element in the DOM.
    2. Copy the CSS selector of an element
    3. Copy the XPATH selector of an element

    Once the selectors are noted, they can be tested in the console tab of the developer console as follows:

    Once all elements of the login page have been found, you can begin recreating the login process in the console:

    1. Enter your username
    2. Enter your password
    3. Press „Login“

    If the login via the Developer Console was successful, the three commands can be created as a standalone script in the Management Center using the corresponding "Script Trigger Selector".

    Content pages/redirects (Redirected pages)Permanent link for this heading

    The content page(s) contain the content to be indexed. For these pages, it is important to note that there is only one "Content Presence Selector" per data source. It is therefore important to make sure that all pages have the same selector or that the option "Assume Content if no Script triggered" is active.

    For this step it is also important that ALL pages are checked (not only start/login page). If it is difficult to consistently identify the content with a single Content Presence Selector, we recommend splitting the indexing into multiple data sources.

    In some cases, special user interactions may also need to be performed for the content pages to see the final content. Scripts can be configured for this purpose, following the same procedure as for the login script.

    1. identify selectors (trigger selector & selector of the element being interacted with)
    2. test selectors
    3. imitate user interaction in the development console
    4. copy script to configuration and test indexing

    Redirected pages behave like normal content pages, and are treated the same way. If user interaction is required, a script must be configured. Otherwise the default JavaScript process behavior will be executed. When linking to hosts that do not correspond to the configured "Crawling Root" hosts, these must also be specified as Additional Network Resources hosts.

    TroubleshootingPermanent link for this heading

    If too few, too many or incomplete documents are indexed, check the following items:

    1. Check configuration for correctness
      1. Selectors
      2. Scripts
    2. Check website for response time, adjust timeouts
    3. Check logs for errors
      1. /current/log-mescrawler_launchedservice.log
      2. /current/job/logs/crawl.log
    4. app.telemetry
      1. check Crawler Service
      2. check Index Service
    5. activate Advanced Settings "Enable Verbose Logging", crawl again and check logs
      1. /current/log-webdriver-webdriver/current/chromedriver-logs/*
      2. /current/log-webdriver-webdriver/current/chromedriver-screenshots/*
      3. /current/log-webdriver-webdriver/current/chromedriver-javascript-process.csv
      4. /current/log-webdriver-webdriver/current/chromedriver-network.csv
      5. /current/log-webdriver-webdriver/current/log-webdriver.log

    Download PDF

    • Whitepaper - Web Connector - Setting Up Advanced Javascript Usecases

    Content

    • Introduction
    • Architecture
    • Configuration JavaScript crawling
    • Use Cases

    Download PDF

    • Whitepaper - Web Connector - Setting Up Advanced Javascript Usecases