Tag: Crawling
What is a crawler? A crawler is usually the first step to creating a functional site search. The crawler is in charge of checking all available pages on the given domain(s) and indexing any pages and files according to the configuration. Once indexed, the pages and files can be searched . . . Read more
You can configure your crawler to index dates, which can be utilized for various purposes such as date filtering or date freshness boosting. To set this up: When selecting the Date data type, the indexed value will appear in the Date format. You can verify the correct date format by . . . Read more
Indexing content is an important part of having an efficient search functionality. There are three distinct ways to index content that can be utilized. Crawling Crawling is the most common method of indexing content and is automatically performed when setting up and activating a crawler. Crawling is the activity of . . . Read more
When you have a page that you recently published or updated and would like this to be indexed for your search immediately, the Update Content tool can come in handy.
If there are certain parts of the content on a page that you would like the crawler to ignore, this can be achieved using cludooff/cludoon.
Cludo’s strategy for crawling sites is based on finding as many pages as possible within the user-defined domains, indexing, and storing their content. The step-by-step process can be seen in detail in the diagram at the end of the article and will be explained further below: Crawling: Step-by-step process 1: Sites . . . Read more
As long as a file is machine-readable (not an image), Cludo is able to crawl its content along with the information sent with the HTTP headers. How to enable or disable file indexing By default, the crawler is configured to index files for the specified domain. You can enable or disable . . . Read more
If you’re ever wondering about the number of pages in your search results or find the need to check up on any indexed content, Page Inventory is here to help. Page Inventory will provide you with an overview of indexed content for all your crawlers to provide you with a . . . Read more
When searching, you may experience the same content appearing more than once in the results. Since a crawler is unable to index the same URL twice, this will always be due to the same content existing on multiple URLs. That is, of course, unless you have two crawlers that index . . . Read more
Smart crawling allows the crawler to run more frequently, leveraging the XML sitemap(s) of your site. It uses the lastmod timestamps in the sitemap to detect if a page was updated since the last crawl. This allows the crawler to only re-crawl recently modified pages, saving time and resources when . . . Read more