Copyright ©
Mindbreeze GmbH, A-4020 Linz, .
All rights reserved. All hardware and software names used are brand names and/or trademarks of their respective manufacturers.
These documents are strictly confidential. The submission and presentation of these documents does not confer any rights to our software, our services and service outcomes, or any other protected rights. The dissemination, publication, or reproduction hereof is prohibited.
For ease of readability, gender differentiation has been waived. Corresponding terms and definitions apply within the meaning and intent of the equal treatment principle for both sexes.
Add a new index in the Indices tab with the +Add Index button. Select the desired Index Node and Client Service and in the Data Source field, specify the JiveSoftware Jive data source. Afterwards, confirm your entries by clicking the Apply button.
Additionally, you have to activate the option "Use ACL References" of the index under "Advanced Settings".
Now configure the data source.
Legend:
Crawling Root* | In this field you can specify a URL where a Jive sitemap is accessible. If you have the Mindbreeze Sitemap Generator add-on installed on your Jive server and a sitemap has been created, the URL <Jive URL>/rpc/rest/mindbreeze/sitemap?jobid=full should be entered here. | ||||||
URL Regex | In this field you can specify a regular expression that sets a pattern for the links to be indexed. | ||||||
URL Exclude Regex | If certain URLs are to be excluded from crawling, they can be configured here with a regular expression. | ||||||
Skip Thumbnail Generation URL Pattern | If thumbnails should not be generated for certain URLs, they can be configured with a regular expression here. | ||||||
Thumbnailer URL Exclude Pattern | If certain URLs are to be blocked when generating preview images (e.g. URLs that reload advertising), these can be configured here with a regular expression. | ||||||
User Agent | The specified value is sent in the user agent header with HTTP request. | ||||||
Additional Hosts File | If for network technical reasons the DNS resolution of certain web servers does not work, you can specify the IPs with the "Additional Hosts File". | ||||||
Ignore Proxy | If enabled, no HTTP proxy is used, regardless of what is configured in the "Network" tab. | ||||||
Accept headers | If you want to add specific HTTP headers (for example Accept-Language), you can set them here. | ||||||
Incomplete Delta Crawl Runs | If this option is enabled, after the crawl run even the pages that are no longer accessible from the "Crawling Root" remain in the index. | ||||||
Enforce Matching Parent as ACL Reference. | A pattern can be defined here. If a document has a parent container and its key matches the specified pattern, the ACL reference is set to the key of the parent container instead of the default ACL reference. | ||||||
Use hashing queue assignment policy | If activated, the input URLs are distributed hash-based to parallel processing queues. The number of processing queues can be set with the "Parallel queue count" option. If disabled, the URLs are distributed hostname-based. | ||||||
Parallel queue count | This option allows you to set the number of processing queues. | ||||||
Robots Honoring Policy | Determines how a robots.txt (if any) is handled. 3 possibilities:
| ||||||
Website Cache Directory | Path where the cache directory should be created. When this option is set, caching is enabled. | ||||||
Maximum Mirror Database Size (MB) | Maximum size of the database used for caching in MB. The default size is 512 MB. | ||||||
Use Cache Only | When enabled, documents are loaded exclusively from the cache. | ||||||
Maximum Number of Extracted Links | Maximum number of links that can be extracted from a document. The default value is 6000. | ||||||
HTTP Request Header | Here you can define HTTP headers that will be sent with each request. | ||||||
Content Signature Type | With this setting duplicate documents can be avoided. Note: For this setting to work, the Mindbreeze PostFilterTransformerPlugin SignatureToKeyRewriter must also be configured. There are different methods available how duplicates are detected and overwritten so that they are not found multiple times. The following methods are selectable:
| ||||||
Disable Diffie-Hellman ciphers | If checked, Diffie-Hellman ciphers are disabled for SSL connections. | ||||||
Support redirects in crawling roots. | If the crawling root is a redirect, the redirect location is crawled. | ||||||
Ignore SSL/Certificate Errors | When enabled, HTTPS SSL or certificate errors are ignored. For security reasons, this setting may only be enabled in test systems. |
Title | An XPath expression that sets the title to the first hit. (e.g.: //h1) | ||||||||
Title Element | The tag name of the title element. | ||||||||
Use link text for title. | A regular expression that sets the title of the document when linking. (e.g.: *\.pdf) | ||||||||
Content | An XPath expression that sets the content to the first match. (e.g.: //div[@class='content']) | ||||||||
Content Metadata Selector | The selector for content metadata. | ||||||||
Exclude Tags from Content | An XPath expression that excludes special tags from the content. | ||||||||
Metadata Selector | The selector for metadata. | ||||||||
Metadata Value Pattern | A regular expression for the value of the metadata. (e.g.: \W*([\w \t]*)\W*) | ||||||||
URLs Excluded From Filtering | A regular expression for URLs that should be crawled but not filtered. | ||||||||
Display Date Timezone | Time zone for the display date. (e.g. CET) | ||||||||
Default Encoding | Coding for the HTML documents. | ||||||||
Extract Metadata |
| ||||||||
Exclude Documents With Matching Elements |
| ||||||||
Assign Metadata | This option is deprecated and should not be configured. To add additional metadata to a document, you can use Entity Recognition, CSV Transformation or Synthesized Metadata. |
In this section the crawl speed can be adjusted.
Memory Profile | The "InSpire" profile is standard; if required, the resource-saving "InSite" profile can also be used. |
Number of Crawler Threads | The number of threads that crawl the web page(s) in parallel. |
Minimum Request Interval | Minimum delay in milliseconds between successive requests from the crawler. A "crawl delay" robots statement overrides this value. |
Maximum Request Interval | Maximum delay in milliseconds between successive requests from the crawler. A "crawl delay" robots statement overrides this value. |
Crawler Queue Size | Maximum number of documents in the queue that will be sent to the index. |
Mindbreeze Dispatcher Thread Count | The number of threads that send data to the index in parallel. |
A credential must be selected here (if the "Form Based Login" authentication method has not been selected), which will be used for the HTTP requests during Basic Authentication. A credential of the type "Username/Password" should be specified here.
This credential can be added and configured in the "Network" tab under "Credentials".
The Mindbreeze Jive Connector also supports OAuth, but due to technical limitations of the Jive API endpoints Basic Authentication must also be configured. To configure OAuth, an add-on must be uploaded in the Jive settings. This add-on can be created here . The created add-on can be uploaded here: <Jive URL>/addon-services!input.jspa
Under "Action", "Client ID" and "Secret" can then be viewed.
For OAuth authentication, a credential of type "OAuth 2" must be created, which is used for HTTP requests during OAuth authentication. The "Client ID" and the "Client Secret" are required for the credential.
The credential can then be created and configured in the "Network" tab under "Credentials".
In the "OAuth access authentication" section, select the credential that you configured earlier.
If the Jive site map is accessible with HTTP form based authentication, the login parameters in the "Form Based Login" section can be configured as follows: (If the "Basic access authentication" authentication method is not selected).
Login URL | The Jive URL to send the login form to: e.g. http://<jive_url>/cs_login | ||||
Session renewer URL Pattern. | Regular expression matched to the session renewal URL. | ||||
Follow Redirects for Login Post | If enabled, "multiple-round-authentication" is supported. | ||||
Form Elements |
| ||||
Form Password Elements |
|
In addition, there is also the possibility to define so-called "Access Rules", whereby these consist of the following options:
Access Check Principal | The user names can be in the format "username@domain", "domain\username" or "distinguished name". The group names can only be in the format "distinguished name". Furthermore, a reference to a capture group in the selection pattern can be used here. |
Access Check Action | "Grant" or "Deny." |
Metadata Key for Selection (e.g. url) | A metadata name, can be empty (all documents are selected). |
Selection Pattern (e.g. .*html) | A regulation expression, can be empty (all documents are selected). |
Jive URL* | Base URL of the Jive server. |
Grant Access to Configured Principals if Key Matches | Regular expression that can be used to set static ACL grants on documents. The regex matches the document key and the entries can be configured with the next option. |
Grant Access to Principal | Specifies which grants are set when the document is matched with the previous regex. This can be used to give certain users or groups access to the documents. |
Check Tags For Update | List of tags which should be checked for new documents. Entries are separated with line breaks. Otherwise, tag changes cannot be updated reliably. This option should only be used for important tags. |
The Jive Connector also offers the possibility to index Kaltura videos embedded in Jive. It is important that the videos are embedded in Jive as HTML iframe (see also Embedding Kaltura Media Players in Your Site ). The embedded videos inherit the permissions of the Jive page in which they were embedded in Mindbreeze InSpire. Please also note that videos that are embedded multiple times (i.e. the same) will be found multiple times in a search.
The following options must be configured for this:
Enable | If checked, Kaltura videos will be indexed and the options below will take effect. |
Video URL Pattern | Regular expression that can be used to extract the IDs of the embedded videos from the IFrame URLs (<iframe src={URL}</iframe> ). Here, the first capture group in the regex represents the ID of the video. For example: https?\Q://cdnapi.kaltura.com/p/9999/\E.*entry_id=([^&\/]*).* |
Kaltura URL | The URL of Kaltura, usually https://www.kaltura.com |
Secret | "Administrator Secret" (recommended) or "User Secret". Can be found in the Kaltura Management Console (KMC) under "Settings" -> "Integration Settings". |
Partner ID | Can be found in the Kaltura Management Console (KMC) under "Settings" -> "Integration Settings". |
Privileges | This field can be left empty. Otherwise, session permissions can be restricted here (see Kaltura's API Authentication and Security ). |
Session Expiration | Number of seconds after which the session should expire. Recommended is one day (86400 seconds). |
Concurrent Filter and Index Dispatch Threads | The number of threads with which documents are sent from the crawler to the index. |
The Jive connector also requires a caching principal resolution service to resolve permissions.
To create it, scroll to the Services section in the Indices tab and add a new service using the +Add Service button. Then select CachingJivePrincipalResolution from the Service dropdown.
Jive Server URL | Base URL of the Jive server. |
User Agent | The specified value is sent in the user agent header during HTTP requests. |
Read Timeout (Minutes) | Defines the read timeout for outgoing connections. |
Connect Timeout (Minutes) | Defines the connect timeout for outgoing connections. |
Jive Guest Access enabled | If Jive allows access for non-logged-in users, please enable this option. |
Groups Containing All Users | This option can be used to define groups so that all users are treated as if they are members of these groups. |
Keep Groups Containing All Users in Memory | By enabling this option, such groups are kept in RAM until the next cache update. |
Identity Encryption Credential | With this option you can display the user identity encrypted in the app.telemetry |
Cache In Memory Items Size | Number of items stored in the cache. Depending on the available memory of the JVM. |
Database Directory Path | The directory path for the cache. Example: /data/principal_resolution_cache If one uses a Mindbreeze Enterprise product, a path must be set. When using a Mindbreeze InSpire product, the path does not need to be set. |
Cache Update Interval (Minutes) | This option determines (in minutes) when the cache should be refreshed. (Default value: 60 minutes) Values below 0, disable the cache update. When starting the service, the last (persisted) cache update time is taken into account. This means that the cache is not necessarily updated when the service is stopped/started, for example, but only at the next time interval. |
Clean Cache Update Schedule | In this field you can configure cache cleanup and update using Extended Cron Expressions at specific times (documentation and examples of Cron Expressions can be found at here) |
Backup cache before cleaning | If this option is selected, a copy of the cache is created in the /data/currentservice/<service name>/temp directory |
Retry Update Cache Run If Was Incomplete In (Minutes) | This option determines (in minutes) when the cache should perform a new update process if an update was incomplete. Values below 0, disable the cache retry update. |
These configuration options are described here in the Caching Principal Resolution Service documentation.
Use Parent Principals Cache Service | If this option is enabled, additional groups of the user are resolved and delivered in another cache (Parent Cache). |
Parent Principals Cache Service Port | The port used for the "Use Parent Principals Cache Service" option if enabled. |
Parent Cache Principals Include Patterns | If empty, all parent cache principals are included, otherwise a parent principal must match at least one pattern (case-insensitive) to be included. |
Parent Cache Principals Exclude Patterns | Parent cache principals that match at least one pattern line (case-insensitive) are excluded. "exclude patterns" have priority over "include patterns". |
Parent Principals Are Unique IDs | If enabled, the unique IDs of the parent principals will be resolved if they are not unique IDs. |
Webservice port | The service is available on the specified port. If multiple Principal Resolution Services are configured, make sure that they have different "Webservice Port" parameters and that they are available. |
Lowercase Principals | With this option, all principals supplied by the cache are written down. |
Case Insensitive Member Resolution | This option determines whether users are checked regardless of their capitalization. |
Suppress JIVE Service Calls | If users cannot be resolved on a search query, a request is sent directly to Jive if this option is not enabled. |
Hint: To test the Caching Principal Resolution Service, you can use the Principal Resolution Service REST API .