The site configuration procedure consists of 7 steps, displayed in tabs. It is reached by clicking the Configure button on the Overview screen.
Site
This step allows you to set up the basics of your search facility. Each option has an associated help text. Hover your mouse over the red question mark to display help texts.
Here is a summary of the options:
- Name: This displays the name of your site.
- URL: This is the Uniform Resource Locator for your site.
- Active: This determines whether the search facility for your site should be active or inactive.
- Index IFrames: This determines whether content contained within IFrames should be indexed. IFrame content is appended to the page on which it occurs, and links within IFrames are not followed.
Click Update to save any changes that you make.
Alias
If your site consists of several root domains, the Alias option allows you to include all other domains than the one entered in the URL field under the previous step, Site.
It may be the case that you use the domain http://www.example.com/ for the bulk of your site’s content, and the domain http://news.example.com/ for news.
Pages in the news section will not automatically be included in the search, because they are placed under a different root domain. It is necessary to set up an alias to include these pages; e.g. with the value .example.com/.
This alias will include all domains that contain this value, so http://publications.example.com/ will also be included.
Enter an alias in the field and click Create. Existing aliases are displayed below the Create button. Click Delete to remove an alias.
Your site must be crawled again for the changes to take effect. This is done by running a complete crawl on your site, from the Crawl summary page.
Exclusions
If you do not want the crawler to retrieve and index certain pages or links, you can create Exclusions to reject them here. You do not need to write the complete URL - a part of it is enough.
For example, your site may use printer friendly pages which essentially contain a copy of the original page content. These pages may be identified by the parameter printerfriendly=yes. To exclude such pages, creating an exclusion for printerfriendly is sufficient.
Enter the part of the URL that is common to the pages that you want to exclude, and click Create. Existing exclusions are displayed below the Create button. Click Delete to remove an exclusion.
Exclusions take effect immediately after the Create button has been clicked.
Meta tag
It may be that your site contains categorizing or descriptive information in meta tags, or that the author or date when modified is identified in meta tags. The Meta tag setup step allows you to determine which meta tags SearchImprove should recognize, and to set up options for each meta tag.
Each option has an associated help text. Hover your mouse over the red question mark to display help texts.
Here is a summary of the function that each option serves:
Meta tag name:
This determines the name that identifies the meta tags that SearchImprove should recognize. The example below displays meta tags with the names last-modified, description and keywords. Note that only the contents within the quotation marks are entered in the Meta tag name field.
<meta name='last-modified' content='2005-12-20'>
<meta name='description' content='This page contains help with the setup, maintenance and use of SearchImprove.'>
<meta name='keywords' content='Support; SearchImprove; Help'>
Include in XML:
Metadata that is included in the XML output is available to use in the search results presentation. Tick this box to create a tag in the XML output with this meta tag name, containing the values within it. In the example above, the XML output for the meta tag named description would look like this:
<META>
<title>
<![CDATA[
BREADCRUMB
]]>
</title>
<text>
<![CDATA[
This page contains help with the setup, maintenance and use of SearchImprove.
]]>
</text>
<raw>
<![CDATA[
This page contains help with the setup, maintenance and use of SearchImprove.
]]>
</raw>
<searchdata>
<![CDATA[
This page contains help with the setup, maintenance and use of SearchImprove.
]]>
</searchdata>
</META>Further information about these three versions of the metadata is available in the Tag reference section.
Searchable:
Tick this box to include the values within this meta tag in the search index. Tags must be made searchable for them to appear in the Metadata tab in the Search ranking setup procedure.
Display in groups:
Tick this box to display the values contained in this meta tag in the groups setup process, under the Metadata tab. This allows you to gather pages that have the same value within this meta tag in groups. This would typically be applicable for meta tags such as Author, Keywords, Category, or other custom meta tags.
Separator:
If this meta tag can contain multiple values, enter a character that separates values. In this instance, the separator would be ; (semi-colon):
<meta name='keywords' content='Support; SearchImprove; Help'>
Create group:
Tick this box to automatically create a group for each value found under this meta tag. For example, if a meta tag with the name Author is used to identify the author of each page, a group will be created for each value found under that meta tag; i.e. a group for each author.
The Meta tag setup step allows you to create recognition for as many meta tags as you require. Click Create to add a meta tag. Existing meta tags are displayed in a list below the Create button. Click Delete to remove a meta tag.
Group Type:
Groups created from metadata can be used for rudimentary access control, so that users who have privileges to view some but not all content only search through documents that they have access to. When the user performs a search, the ID or name of the group of pages that the user has access to can be passed to SearchImprove. The results page will only contain pages within that group; i.e. pages that the user has privileges to view.
Body tag
The Body tag section allows you to identify sections within the html document Body, limited by certain tags. These tags may be basic HTML tags, such as:
<h1>; <p>; <a>; <strong>; etc.
They may also be HTML comment tags or parts of tags, such as:
<!-- Comments -->; class="content">; etc.
For example, the section of each page on your site that contains an abstract of the page content may be identified by the following tag:
<div class="abstract">
<p>This section contains an introductory passage. Search term occurrence here reflects high document relevancy.</p>
</div>
To identify this class as a container of this page’s abstract, enter <div class="abstract"> as start tag. Enter </div> as the end tag.
The tag can be named as required; in this instance we will name the tag Abstract.
Body tags support regular expression matching, for cases when it is not possible to capture the required content with normal string matching.
In order to use regular expression matching, the capture pattern must be prefixed with the string regexp:, as follows:
regexp:</a>\s*</div>
The above regular expression matches a closing <a> tag, followed by zero or more whitespace characters, followed by a closing <div> tag.
Complete details on how to use regular expressions cannot be provided in this manual; please refer to the Perl regular expressions documentation or one of many tutorials available online.
Note that this procedure does not affect search index or search results in any way; this simply identifies and names a certain part of the Body. When sections of the page body have been identified by a start tag and an end tag, these sections can be assigned different ranking scores on the Search ranking setup page, under the Body tags tab.
As with Meta data, body tags can be made available for use in search results presentation, by ticking the Include in XML checkbox.
File type
This option allows you to set up options for file types other than html on your site.
Here is a summary of the options that are available:
File type:
By default, four file types are selectable: Adobe PDF, and Microsoft Office (PowerPoint, Excel, Word). These are included because they are the most widely used formats for documents, presentations and spreadsheets. However, we provide support for numerous proprietary and open source formats if your organization has special requirements. Please enquire for further information.
A list of all files of each type on your site is displayed when you click the file type.
Include:
Tick this box to include documents of this type in the search.
Ranking:
PDF and Word Documents often contain greater quantities of text than html pages and as such, they are likely to contain many more instances of search terms than html pages. For this reason, you may find it necessary to manually tweak ranking scores for some document types, assigning a ranking from -5000 to 5000, to promote or demote these file types.
Use file name as title:
Microsoft Office and Adobe PDF offer the option of adding metadata to a document’s properties. However, this option is not used very consistently. By default, SearchImprove will look for such a title first. If no title was found, SearchImprove will use the file name as document title instead.
Tick this box to use the file name in stead of the meta title for all files of this type. This ensures that documents are compared for relevancy on even terms.
Inherit group association:
Groups of pages are often created from patterns in a body tag (e.g. the breadcrumb), or in the URL. In many cases, non-HTML document types cannot be included in page groups in this manner.
If this box is ticked, the documents will inherit the group association of the referring page.
For example, a page containing an online form located at http://example.com/forms/online_form.html might link to a PDF version of the form, which resides in a generic document repository, http://example.com/documents/online_form.pdf. A page group named Forms has been created with the URL match /forms/. This URL pattern ensures that the HTML version is included in the group, but not the PDF version.
By enabling group inheritance for PDF files, the PDF form is included in the Forms group, because its referring page belongs to that group.
Click Update to save your changes.
Capture tags
Capture tags allow you to limit the search index to include only page content, and not recurring elements such as menus, links, adverts, or anything else that has no direct relation to the part of the page that has relevancy to users.
There are at least two advantages to identifying and focusing on page content:
- Faster search: The smaller the search index, the less time it takes for a search query to be processed. By removing irrelevant page content, the search index can be reduced substantially.
- Greater search accuracy: Different pages use different page templates. Some templates may contain the user’s search query, where others do not. By removing template content from every page, page relevancy is determined exclusively by direct comparison of page content, and is not influenced by menus, links, or other template content that have no bearing on page content.
Capture tags function in the same way as Body tags: Identify a start and an end tag that contain a certain type of content. The content of tags can either be included or excluded from the search index. Select this under Type.
Click Create to add a capture tag.
Sequence:
You may add as many capture tags as you require, of either type – include or exclude. If multiple capture tags exist, a sequence must be entered.
The crawler will start by executing the inclusion or exclusion that is labelled 1 in the Sequence column.
If this tag is not found on a page, the crawler will automatically proceed to the inclusion or exclusion labelled 2.
If the tag is found, the crawler will perform indexing, and then stop, unless the Continue box has been ticked.
The crawler will continue until it reaches a capture tag for which the Continue box is not ticked.
The sequence in which the tags are processed can be set by entering a number in the field in the Sequence column. Click Update in order to save your changes.