Tag: Crawling

What is smart crawling?

31 Jan, 2023 Feature Description 0

Smart crawling allows the crawler to run more frequently, leveraging the XML sitemap(s) of your site. It uses the lastmod timestamps in the sitemap to detect if a page was updated since the last crawl. This allows the crawler to only re-crawl recently modified pages, saving time and resources when . . . Read more

Setting up time-scheduled crawling

31 Jan, 2023 FAQ 0

It is possible to set a specific time schedule for the crawler to run at a specific time of day. Currently, it is not possible to set the time schedule for a crawler via MyCludo. To configure time-scheduled crawling, submit a support ticket, informing your timezone and at which time of . . . Read more

What is async crawling?

31 Jan, 2023 FAQ 0

Async crawling is meant for websites where content is loaded asynchronously (AJAX-generated content). AJAX-generated content allows the web page and web browser to process data without having to reload the page. For example, if you hit a “Submit” button on the page, AJAX processes the information and updates the content . . . Read more

What are the crawlers’ user agent and IP addresses?

30 Jan, 2023 FAQ 0

In some cases, the crawler may be blocked from indexing your website. To fix this, you may need to whitelist our IP address to allow the crawler to access the site. Our crawler’s user agent is: Our crawler’s user agent can be referred to simply as cludo. User-agent: cludoAllow: * Our . . . Read more

What is the maximum file size Cludo can index?

30 Jan, 2023 FAQ 0

Cludo’s crawlers can index files up to 15 MB. Anything larger can be pushed directly via Cludo’s API. The extraction of files removes the size of images and other irrelevant information prior to looking at the file size. For reference, the raw text of the entire Bible is around 5MB.

Deleting a crawler

30 Jan, 2023 FAQ 0

For security reasons, crawlers can only be deleted by Cludo staff. If you need to delete a crawler, please contact support and let us know the ID(s) of the crawler(s) you would like to delete.

What file types does Cludo index?

30 Jan, 2023 FAQ 0

The indexability of a file is not defined by its extension (e.g. “.pdf”), but rather by the content type, as returned in the HTTP headers. In the list below, we have added extensions as examples. Supported file types

How many requests does the crawler make?

30 Jan, 2023 FAQ 0

Our crawler will always attempt to make as many requests as possible, often requesting multiple pages per second, but the actual frequency of requests depends on the server response from the website. Some websites might also have a crawl delay set in their robots.txt, which can impact how many requests . . . Read more

Why is this page not indexed?

30 Jan, 2023 FAQ 0

Once a crawler has crawled the defined domain(s), you may experience a specific page not being added to the search index. This will typically be due to one of the following reasons: Feel free to contact support if you have further questions on why a page was not indexed as . . . Read more

Configure a crawl delay

02 Jan, 2023 FAQ 0

A crawl delay will limit the frequency at which the crawler will request pages on a website. This results in an overall slower crawl but can prevent overloading the website with too many requests at once. This is rarely needed, but it can be useful for “sensitive” servers that don’t . . . Read more

1 2 3

What are you looking for?

Explore topics

Tag: Crawling

What is smart crawling?

Setting up time-scheduled crawling

What is async crawling?

What are the crawlers’ user agent and IP addresses?

What is the maximum file size Cludo can index?

Deleting a crawler

What file types does Cludo index?

How many requests does the crawler make?

Why is this page not indexed?

Configure a crawl delay