Tag: Crawling

Crawlers

What is a crawler? A crawler is usually the first step to creating a functional site search. The crawler is in charge of checking all available pages on the given domain(s) and indexing any pages and files according to the configuration. Once indexed, the pages and files can be searched . . . Read more

How to set up date crawling

You can configure your crawler to index dates, which can be utilized for various purposes such as date filtering or date freshness boosting. To set this up: When selecting the Date data type, a Date format field will appear. It’s generally recommended to leave this field blank as the crawler . . . Read more

Ways to index content

Indexing content is an important part of having an efficient search functionality. There are three distinct ways to index content that can be utilized. Crawling Crawling is the most common method of indexing content and is automatically performed when setting up and activating a crawler. Crawling is the activity of . . . Read more

How to use cludooff/cludoon

If there are certain parts of the content on a page that you would like the crawler to ignore, this can be achieved using cludooff/cludoon.

How does the crawler index and delete pages?

Cludo’s strategy for crawling sites is based on finding as many pages as possible within the user-defined domains, indexing, and storing their content. The step-by-step process can be seen in detail in the diagram at the end of the article and will be explained further below: Crawling: Step-by-step process 1: Sites . . . Read more

How to set up scoped searching

If you would like an existing engine to only show results for a specific area, this can be done by adding a filter in the script. Scoped search allows you to limit search results to a specific section or type of content within the website instead of searching across the whole . . . Read more

How does Cludo index files?

As long as a file is machine-readable (not an image), Cludo is able to crawl its content along with the information sent with the HTTP headers. File titles It is possible to select how the file title should be extracted by selecting one of the following: Automatic The default option is Automatic, . . . Read more

Page Inventory

This article contains the following sections: Pages Crawler summary If you’re ever wondering about the number of pages in your search results or find the need to check up on any indexed content, Page Inventory is here to help. Page Inventory will provide you with an overview of indexed content . . . Read more

Best practice for avoiding duplicate results

When searching, you may experience the same content appearing more than once in the results. Since a crawler is unable to index the same URL twice, this will always be due to the same content existing on multiple URLs. That is, of course, unless you have two crawlers that index . . . Read more