The crawl summary is reached by clicking on the name of an existing crawl on the Site overview page. The Crawl summary displays the current crawl configuration settings.
Each option has an associated help text. Hover your mouse over the red question mark to display help texts.
Here is a summary of what each item signifies:
-
Name: This displays the name of the crawl.
-
URL: This is the designated starting point for the crawl.
-
Alias: In order to limit the scope of a crawl to a specific site section, a crawl alias may be entered. For example, a crawl might start at the URL http://example.com/news/listing.html, and it should only index pages within the news section. If each news item has the following URL pattern: http://example.com/news/news_item_name.html, set the alias to http://example.com/news/. The crawler will not follow any links to pages that do not match this URL pattern.
-
Interval: This displays how often the crawl is performed. This can range between 1 minute and years.
-
Crawl level: This displays the scope of the crawl. A level 1 crawl follows the links from the start page entered in the URL field above, and indexes all the pages one level below it. A level 2 crawl indexes a level deeper, and so on. A Complete crawl indexes all the pages that are accessible via links on the designated start page.
-
Active: This displays whether the crawl is active or not.
-
Index first page: This displays whether the content of the start page entered in the URL field above should be included in the search index; or whether this is simply a page that refers to content pages. This option may be applicable to index or directory pages -- such as news front pages, A-Z pages, or Sitemap pages -- that do not contain content that should be searchable, but which are useful as starting points for a crawl.
-
Is this a Delete crawl?: This indicates whether this crawl is a crawl of the Delete type. It is possible to manage the content of the search index by creating a page with links to pages that should be removed from the search index. There are some advantages to managing search index content using this method:
- It reduces the need for complete site indexing, reducing the demands on your web server resources
- The page that contains links to pages that should be removed can be generated using your Content Management System, making day-to-day management simple.
-
Is this an Update crawl?: This indicates whether this crawl is a crawl of the Update type. This type of crawl uses the same method as a Delete crawl: The URL field above is used to point to an index page containing a list of links to pages that have been updated since the last crawl. This index page can be generated using your content management system, reducing the need for complete crawls.
Actions Associated with Crawls
The buttons below the Crawl summary allow you to access configuration options, to delete this crawl, or to manually commence crawling.
There are two types that can be requested manually using the buttons:
-
Start Crawl: This method commences a regular crawl. Indexing only proceeds if the MD5 key (a unique page identifier that carries a checksum of all the characters that the document contains) is altered. This updates the search index using a “soft-entry” approach, minimizing the load on web servers.
-
Re-crawl Site: This method commences a complete update of all pages on your site, regardless of whether the MD5 key is altered or not.
This is the option used when changes are made to the way in which SearchImprove interprets information from the pages it crawls. For example, if a new Capture tag is set up, SearchImprove needs to index pages from scratch because, although page content (and the MD5 key) is unchanged, SearchImprove captures different data from that page.
This is true for Capture tags, Body tags, and Meta tags: when changes are made to the data that SiteImprove should capture, the search index must be updated completely for the changes to take effect. Find out how to set up Capture tags, Body tags and Meta tags on the Configure site help screen.