Tag: Crawling

Ways to index content

Indexing content is an important part of having an efficient search functionality. There are three distinct ways to index content that can be utilized. Crawling Crawling is the most common method of indexing content and is automatically performed when setting up and activating a crawler. Crawling is the activity of . . . Read more

How to use the Update Content tool

When you have a page that you recently published or updated and would like this to be indexed for your search immediately, the Update Content tool can come in handy.

How to use cludooff/cludoon

If there are certain parts of the content on a page that you would like the crawler to ignore, this can be achieved using cludooff/cludoon.

How does the crawler index and delete pages?

Cludo’s strategy for crawling sites is based on finding as many pages as possible within the user-defined domains, indexing, and storing their content. The step-by-step process can be seen in detail in the diagram at the end of the article and will be explained further below: Crawling: Step-by-step process 1: Sites . . . Read more

How to set up scoped searching

If you would like an existing engine to only show results for a specific area, this can be done by adding a filter in the script. Scoped search allows you to limit search results to a specific section or type of content within the website instead of searching across the whole . . . Read more

How does Cludo index files?

As long as a file is machine-readable (not an image), Cludo is able to crawl its content along with the information sent with the HTTP headers. File titles It is possible to select how the file title should be extracted by selecting one of the following: Automatic The default option is¬†Automatic, . . . Read more

What is Page Inventory?

If you’re ever wondering about the number of pages in your search results or find the need to check up on any indexed content, Page Inventory is here to help. Page Inventory will provide you with an overview of indexed content for all your crawlers to provide you with a . . . Read more

Best practice for avoiding duplicate results

When searching, you may experience the same content appearing more than once in the results. Since a crawler is unable to index the same URL twice, this will always be due to the same content existing on multiple URLs. That is, of course, unless you have two crawlers that index . . . Read more

What is smart crawling?

Smart crawling allows the crawler to run more frequently, leveraging the XML sitemap(s) of your site. It uses the lastmod timestamps in the sitemap to detect if a page was updated since the last crawl. This allows the crawler to only re-crawl recently modified pages, saving time and resources when . . . Read more

How to test a crawler

In order to test the crawler configuration, it is possible to make crawl against a specific URL to see which data will be indexed for the page.