Copyright ©
Mindbreeze GmbH, A-4020 Linz, 2023.
All rights reserved. All hardware and software names are brand names and/or trademarks of their respective manufacturers.
These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes or other protected rights.
The dissemination, publication or reproduction hereof is prohibited.
For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.
.
The Web Connector indexes HTML documents and also supports the indexing of web pages that use JavaScript by means of "JavaScript crawling".
This documentation describes how to use and handle JavaScript crawling in the context of advanced JavaScript use cases.
Note: For the use cases described here, basic knowledge of HTML and JavaScript is required. Furthermore, basic knowledge about the structure of the web page to be indexed is required.
In this mode, the web connector downloads the web page to be indexed directly using a single HTTP request. In this mode, it is not possible for the connector to properly index web pages with dynamic content.
In this mode, the Web Connector downloads the web pages to be indexed indirectly using JavaScript crawling, allowing it to index web pages that are dynamically loaded with JavaScript.
However, this mode is very performance-intensive and should only be used if no better alternative is available. See the section Alternatives.
JavaScript crawling requires some additional settings, which are described in the following sections.
The following table shows the main features and differences between regular web crawling and JavaScript crawling.
Web Content Compatibility | Regular Web-Crawling | JavaScript-Crawling |
Crawling Efficiency | Highly Efficient | Very performance-intensive |
Static Websites | Are fully indexed. | Are fully indexed. |
Web pages with authorization (not supported in regular crawling) | Are indexed with login mask, indexing not fully possible. | Automatic login is possible to index only desired content. |
Web pages with unwanted content (static cookie banners or ads) | Unwanted content is indexed. | Unwanted content can be hidden. |
Content loaded by user input | Some content is not loaded. | User input can be simulated. All content is loaded and indexed. |
Content with delayed (lazy) loading | Some content is not loaded. | The crawler can wait for content. All content is indexed. |
The following settings can be used to regulate which pages should be indexed with JavaScript. Furthermore, you can also control whether only web pages with the content type "text/html" should be indexed (this is determined in a separate HEAD request).
A detailed explanation of the settings listed below can be found in the documentation Configuration - Web Connector.
The idea behind these fine-grained configuration options is to only use JavaScript crawling for web pages when it is necessary.
When the Web Connector is run in JavaScript mode (without configuring any other settings), the web page content is loaded and indexed as soon as the configured "Page Ready State" is reached. If the configured "Page Ready State" is not reached within the "Network Timeout", an error is logged and the content is not indexed.
These settings can also be found here.
Scripts are JavaScript code fragments that can be configured for advanced use cases. The following sections describe how to use these scripts:
If the default behavior of the connector is not sufficient (for example, a web page is indexed with incomplete content) it is possible to define a "Content Presence Selector". This selector can be specified using XPath or CSS selectors and changes the operation to recognize the content, so that now it waits for the selector. This is especially useful for complex web pages that take longer to load.
For web pages that need user input or are more complex, the "Content Presence Selector" can be extended with the "Script" setting.
Configured scripts are executed based on events using the "Script Trigger Selectors". A Script Trigger Selector checks the DOM of the document to see if a certain condition occurs. When a Script Trigger Selector becomes active, the associated script is executed in the context of the current processing of the web page.
Processing stops when the Content Presence Selector becomes active (when the requested content is loaded).
For example, a Script Trigger Selector can detect if user input is currently required (e.g. a login form is displayed) and then perform actions in the associated script (e.g. a click on the login button)
The "Script Trigger Selectors" can be specified using XPath and CSS.
In some cases, we cannot define an explicit "Content Presence Selector". This can be the case if elements of the content are already loaded in the DOM before the script is to be executed or if different web pages are to be indexed, for which we cannot define a common selector.
The setting "Assume Content if no Script triggered" exists to support these special cases. If this setting is enabled, the processing will be stopped as soon as no Script Trigger Selector is active and the web page will be indexed in this state.
Note: care should be taken regarding pages with delayed loading, as this setting may cause subsequent scripts to stop running and the content may be indexed prematurely and incompletely.
This example explains the steps required to set up Mindbreeze Web Connector to crawl an intranet website with unsupported API authentication (in regular crawling).
First of all, the website to be indexed must be analyzed. Since indexing with JavaScript is very performance-intensive, the following questions should be answered first and foremost:
If none of these alternatives apply to the web page, we proceed with JavaScript crawling. Next, we analyze the behavior of the web page.
In order to analyze the behavior of the web page, we first need to make sure that the necessary credentials (username, password) are available and that the login works with a browser (e.g.: Chrome, Firefox).
If the login in the browser was successful, it will be checked if all contents that should be indexed are available. (It is possible that the user does not have sufficient permissions to see the relevant content).
If all content is displayed correctly, the individual pages are analyzed in detail.
For the analysis of the individual pages, a distinction must be made between three types:
Note: 2-factor authentications can NOT be supported.
The following points are to be considered here:
For web pages that redirect to other pages, it is important to note that the web connector blocks all unknown hosts by default for security reasons. (Error code: BLOCKED_BY_CLIENT)
To allow other hosts, they must be specified as "Additional Network Resources Hosts" in the settings.
Note: This also applies to sitemaps. Here, all relevant hosts must be specified as "Additional Network Resources Hosts".
When the analysis of these pages is complete, you can begin to recreate the login using JavaScript in the browser's Developer Console (F12).
The first step is to locate the elements of the login window (e.g.: username, password, login button) and note the corresponding selector.
For the login page, the selector for the form is very suitable as a "Script Trigger Selector". Using XPath, you can specify more complex selector, which for example, search for the combination of username and password ID.
(//*[@id='username'] and //*[@id='password'])
Once the selectors are noted, they can be tested in the console tab of the developer console as follows:
Once all elements of the login page have been found, you can begin recreating the login process in the console:
If the login via the Developer Console was successful, the three commands can be created as a standalone script in the Management Center using the corresponding "Script Trigger Selector".
The content page(s) contain the content to be indexed. For these pages, it is important to note that there is only one "Content Presence Selector" per data source. It is therefore important to make sure that all pages have the same selector or that the option "Assume Content if no Script triggered" is active.
For this step it is also important that ALL pages are checked (not only start/login page). If it is difficult to consistently identify the content with a single Content Presence Selector, we recommend splitting the indexing into multiple data sources.
In some cases, special user interactions may also need to be performed for the content pages to see the final content. Scripts can be configured for this purpose, following the same procedure as for the login script.
Redirected pages behave like normal content pages, and are treated the same way. If user interaction is required, a script must be configured. Otherwise the default JavaScript process behavior will be executed. When linking to hosts that do not correspond to the configured "Crawling Root" hosts, these must also be specified as Additional Network Resources hosts.
This example explains the steps required to set up Mindbreeze Web Connector to crawl web pages with accordions.
If the web page loads the accordion content into the DOM, no further steps need to be taken to expand it.
Otherwise, you need to check if the web page is a single page application, as they are not currently supported. (Limitations of the "Enable JavaScript" option). However, this restriction can be bypassed in certain cases.
Basically, to load the content, it is important to know that accordions have the options to:
Both types require user interaction and the goal is to load all content into the DOM.
If all elements of the accordion are expandable at the same time, a script is sufficient to click on the respective buttons to load the content.
If not all elements can be unfolded at the same time, a script is needed first to disable this mechanism, otherwise not all content is available.
This mechanism can be implemented by JavaScript on the web page, but it can also be determined by the properties of the framework used.
In case of any ambiguity in this regard, the web developers of the website should be contacted to clarify the possibilities of disabling this mechanism.
If the mechanism is disabled, a script can expand the individual accordion elements as before.
This example explains the steps required to set up Mindbreeze Web Connector to crawl web pages that support infinite scrolling.
If the web page loads the entire content into the DOM, no further steps need to be taken to scroll.
If this is not the case, a script must be used to scroll to the bottom of the page so that all content is available in the DOM.
The difficulty with this is to recognize when you have reached the bottom, as this is not consistent across web pages.
For web pages that display a "No more results" notification at the end of the list, which does not always exist in the DOM, a "Content Presence Selector" that addresses this notification and a script that only scrolls down are sufficient.
If the notification always exists in the DOM, you must try to determine the end of the document, as with loading animations.
If there is no notification at the end of the list, a script and the setting "Assume Content if no Script triggered" must be used to find out when the end of the list is reached.
For this purpose, the loading animation of the list, as well as several height parameters can be used.
To stop the iteration of the script, the used "Script Trigger Selector" in the HTML must be changed when the end of the list is reached.
The web page to be indexed must be scrolled to the bottom in order to display all elements. As long as the list of elements is found, try to scroll down. When the end of this list is reached, a message with "No more results" is displayed.
The script is executed until the Content Presence Selector triggers (the end is reached).
The web page to be indexed must be scrolled to the bottom in order to display all elements. As long as the list with the elements is found, try to scroll down. If more results are available, this will be indicated with a loading animation. When the end of this list is reached, no message is displayed.
Assume Content if no Script triggered: true
Script Trigger Selector (CSS): .element-list
Script:
if (window.innerHeight + window.pageYOffset == document.body.offsetHeight &&
document.getElementsByClassName(‚element-list').length == 0)
{
document.getElementsByClassName("element-list")[0].className =
"element-list-done";
} else {
window.scrollTo(0, document.body.scrollHeight);
}
The section to check if you are already ready to scroll down is done here by the conditions
window.innerHeight + window.pageYOffset == document.body.offsetHeight
.getElementsByClassName('element-list').length == 0
The first part checks if the end of the document has been reached. However, since this script is executed very often, it can happen that the end is reached by the loading bar, which signals that more objects are being loaded. To avoid this, it is also checked if the loading bar is present.
If both conditions are met, the tag that serves as the content trigger selector is changed so that the script does not trigger further. Otherwise it will scroll further down.
If too few, too many or incomplete documents are indexed, check the following items:
If "Content Presence Selector" or "Script Trigger Selector" are not executed as expected, the log "/current/log-webdriver-webdriver/current/chromedriver-logs/driver-actions.log" should be checked.
In this log file, look for the method: "isElementPresent". This is where the selectors are processed. In case of syntax errors, these would be displayed in the leave event.
It is important to note that selectors should only be specified with single quotes. If double quotes are used, they must be escaped beforehand with "\".
XPath:
//*[@id=’content’]
//*[@id=\"content\"]
CSS:
a[href^=’https’]
a[href^=\"https\"]
XPath:
//*[@id="content"]
CSS:
a[href^="https"]
Sometimes for documents in chromedriver-network.csv the "Status Code Description" "net:ERR_ABORTED" is displayed.
If this is the case, you have to look specifically at the columns "Website URL" and "Network Resource URL" which show the requested URL and the current URL.
These are usually different since the current URL represents the requested resource (jpg, js, svg, redirected url, ...).
These URLs can now be searched for in the cdp-logs.csv document to get a detailed description of the error.