How to configure a crawl delay

A crawl delay will limit the frequency at which the crawler will request pages on a website. This results in an overall slower crawl but can prevent overloading the website with too many requests at once. This is rarely needed, but it can be useful for “sensitive” servers that don’t have a lot of bandwidth.

The crawl delay is defined in the website’s robot.txt file, using the following format:

User-agent: cludo
Disallow: 
Crawl-delay: 5

The example above sets the crawl delay to 5, only allowing a new request from the crawler every 5 seconds.

Keep in mind that a crawler will run for 24 hours before stopping. Setting a crawler delay can impact the max number of pages the crawler can crawl in a day. For example, setting the delay to 10 seconds, means a maximum of 8,640 pages can be crawled in one day. For websites with a lot of pages, you should make sure the entire site can be crawled within the 24-hour period.

Tags: