What is a Crawler?

A crawler is usually the first step to creating a functional site search. The crawler is in charge of checking all available pages on the given domain(s) and indexing any pages and files according to the configuration. Once indexed, the pages and files can be searched for, leveraging an engine.

When indexing pages, the crawler will always account for unique pages, so the same URL will never be indexed twice by the same crawler.

How does a Crawler find pages and files?

XML sitemaps

Sitemaps are structured files, usually auto-generated, that provide a list of all the pages on a site. A sitemap file will typically contain the URL of each page along with its last modified time.

When the crawler locates a sitemap, it will look through all the URLs and attempt to index all pages and files that match the crawler settings.

Following links on the site

In addition to the sitemap, the crawler will always detect available links on the pages it comes across, follow and index those pages that match the crawler settings.

The crawler will only ever crawl and index URLs that match the domain(s) added in its settings.

Tags: