Best practice for crawlers

Setting up a crawler is a required step in configuring a functional search engine. You should consider both the configuration of the crawler as well as how many crawlers you should create.

Language

Crawlers are language-specific, so sites with support for multiple languages should configure one crawler per language. Remember to set the correct language in the crawler settings, since this determines in which language the pages should be analyzed (read more about language support).

Fields

Title and Description

Title and Description are both default and required fields. These are also the most important fields when it comes to good relevance in search.

Title

For Title, choose the source that best applies to the page structure on your site. In most cases, all pages have an H1, so you could use the First H1 option to have this as the search result title. You could also leverage the <title> tag or metadata such as the og:title.

Description

For Description, one thing that is very important to keep in mind is that the fields configured in the crawler determine what is searchable. Make sure that what you configure for this field is a reflection of the content on the page.

It would not be a good idea to set the description to just fetch your meta description, since in that case, the actual page content would not be searchable. If you would like to display the meta description in the results, you could of course set the Description field to fetch the meta description, but then you’d want to add a custom field that grabs the entire page content to make sure this is searchable.

You will also want to make sure that the description doesn’t go too broad – e.g. grabbing navigation or footer elements that are present on all pages (for example by just fetching the entire <body> element). If that was the case, all of your pages would appear as results if the visitor searched for something that’s present in the navigation or footer, ultimately decreasing relevance.

Custom fields

When setting up a crawler, you will want to consider not only what fields should be displayed, but also which fields should be searchable.

For example, you may not want to display your meta keywords in the result itself, but it would still be a good idea to set this up as a crawled field to make sure they’re searchable.

On top of the above, the fields you define in the crawler can also be used for boosting, so taking the example above, you could have a field for your meta keywords that you could later apply boosting to, ensuring that results that have the search term in their meta keywords are ranked higher.

Using fallbacks

Most sites have a number of different page types that do not always follow the same structure. To accommodate for this, it is a good idea to use fallbacks. For example, when defining your Title field, you could have the primary source be First H1, but have a fallback to Page title (the <title> field).

Meta tags vs. XPath

When configuring a crawler, at some point configuration must be done to clarify where in the HTML specific data can be found.
For a lot of fields the data is either found in the Meta tags in the header of the HTML or in the body. If the data is in the body of the HTML it can be found using XPath. However, since XPath is a series of instructions to the HTML structure, it means any change to the HTML structure can cause the XPath to be invalid.

To avoid this risk, it is generally recommended to configure crawler fields based on metadata. An exception to this would of course be the Description, where it is encouraged to write an XPath that grabs the page content.

Multiple sites

When there is a need for a global search across different sites that have a different structure, it is recommended to configure multiple crawlers in order to account for these differing structure, since you’ll want to configure the page fields differently. All of the crawlers can later be added to the same engine.

If all of the sites have the same structure, it can be beneficial to keep them in one crawler for maximum efficiency.

Avoiding duplicate results

If you’re dealing with duplicate results in your search, there’s probably an explanation for it. Visit this page to learn more!

Need help? Don’t hesitate to contact support!

Tags: