How to set up a crawler

A crawler searches a URL for pages and adds them to a search index to be used by an engine.

Setting up a Crawler

Boundaries

  1. In the navigation, select Configuration Crawlers.
  2. Click the New button at the top of the table.
  3. Give the crawler a name in the Name field.
  4. Insert a URL in the Websites field.
  5. Click the + Icon to add the URL.
  6. Optional: Open More options to insert a URL in the Page Exceptions field that the crawler is allowed to index even if it doesn’t match the domain entered. Click the + Icon to add the URL.
  7. Optional: Open More options to insert a URL to a sitemap in the Sitemaps field. Click the + Icon to add the URL. Note that the crawler will automatically detect any sitemaps declared in the robots.txt file or located as /sitemap.xml under the root of the domain.
  8. Optional: Under Excluded from the crawl
    • Insert part of a URL or a URL parameter in the Pages field. Click the + Icon to add the URL.
    • Open More options to insert a regular expression in the URL regex field. Click the + Icon to add the URL.
  9. Click Next: Structure to the bottom right (only displayed if it’s a new crawler), or click the Structure tab at the top.

Structure

  1. Set the Language to the language of the website.
  2. Optional: Open More options to change the Content type. The default is Web pages – only if the crawler is strictly set up to index people profiles, it should be set to People directory.
  3. Update the default fields as needed to control exactly what content is picked up for each of those.
  4. Under Page Fields and File fields, click on Add custom field + to add additional fields for the crawler to pick up.
    • Type the name of the field in the Field name field.
    • Optional: Enable the Field required toggle to only index pages that contain this field. If the crawler does not find any value for a required field, the page will not be added to the index.
    • Click on Add source +.
      • Select the type of source under Type.
      • Fill out the Value field and any other required field for the field type.
      • Optional: Select the correct data type in the Data type drop down.
      • Optional: Write a default value in the Default value field. If the crawler doesn’t find any value for the field, the page will instead be assigned this value. This is especially useful when the field is set to be required to ensure all pages will still be indexed.
    • Optional: Click Add source + again to add a fallback source for finding the field value in different ways.
    • Click Apply.
  5. Click Save Crawler.
A crawler with 5 required and 2 optional page fields.
Tags: