How to exclude pages

It is possible to exclude pages in different ways, each with their own set of pros and cons. It is recommended to read and understand all the ways of excluding pages before setting up the exclusion logic to ensure the best setup for your needs.

How to exclude pages via the crawler

Crawler exclusions will prevent the crawler from ever indexing any pages that match the exclusion. Changes here affect all engines using the crawler in question.

  1. In the navigation, select Configuration Crawlers.
  2. Select the crawler on the list that you would like to exclude pages for.
  3. To exclude based on specific URLs:
    • Under Excluded from the crawl, enter a URL in the Pages field. You can use both absolute and relative URLs.
    • Click the plus icon.
  4. To exclude base on a URL regular expression:
    • Under Excluded from the crawl, enter a regular expression in the URL regex field.
    • Click the plus icon.
  5. Click the Save Crawler button.
  6. All pages that match the exclusion(s) will now be ignored by the crawler. Any existing pages that match the exclusion(s) will be dropped from the index.

Example

A crawler is set up to exclude /login/. This will make the crawler ignore pages like:
clumobile.com/login/my_profile
clumobile.com/login/my_subscription

However, a page like clumobile.com/login will still be crawled as it does not end with “/”.

Example

A crawler is set up with a URL regex of “product\/\d{3}$”. This will exclude any product page where the page is exactly 3 numbers, such as:
clumobile.com/products/123
clumobile.com/products/456
clumobile.com/products/321

However, any product page with more or less than 3 numbers in the URL would still be crawled like:
clumobile.com/products/12
clumobile.com/products/1234
clumobile.com/products/12345

How to exclude pages via the engine settings (engine filters)

Excluding pages via the engine will prevent pages from being shown in the engine’s results. It will only affect search engines, and only the single engine that it is being configured for.

  1. In the navigation, select Configuration, then Engines.
  2. Select the engine to exclude pages for in the list.
  3. Under Advanced -> Pre-filter search results, Click the Add New Filter button.
  4. Select a crawled field to filter on in the dropdown.
  5. Type a field value in the field below the dropdown. Note that this value is case-sensitive and must reflect exactly what has been crawled.
  6. Click the plus icon.
  7. Optional: To add more field values, repeat the last two steps.
  8. Optional: To add more filters, repeat the last five steps.
  9. Click the Save Engine button.

Example

An engine is set up with a filter for the field “Category” to match “Blog”. This will limit the shown results for the engine to only show pages where the category is “Blog”.

How to exclude pages via the Excluded pages tool

Excluding pages via the excluded pages feature allows for specific pages to be removed from the search results of an engine. It will only affect search engines and only the single engine that it is being configured for.

  1. In the navigation, select Configuration, then Excluded pages.
  2. Select the engine to exclude pages for from the drop-down list.
  3. Click the New button.
  4. Choose to exclude pages by title or URL in the Search Pages By radio button.
  5. Insert the title or URL of the page to exclude in the Find page field.
  6. For the correct result in the list, click the plus icon.
  7. Optional: to add more pages repeat the previous three steps.
  8. Click the Save button.
  9. The selected page(s) are now added to the list of pages to be excluded from the search and will stay excluded until removed from the list.

Example

An engine is set up to exclude the blog overview page, leading the user to only find individual blog posts among the results.

How to exclude pages via canonicalization

Excluding pages via a canonical tag will prevent pages from being indexed because the content on the page also exists on a different URL. It will affect all search engines including external search engines like Google. Read more on canonicalization here.

Setting up canonical tags does not happen in MyCludo, but rather in the HTML of a page. The canonical tag must point to the URL of a different page to cause the page to be ignored.

Example

When interacting with the pagination of a blog page of a site, it generates a new URL, e.g. clumobile.com/blog/?p=1, clumobile.com/blog/?p=2, and so forth. We do not want any search engines to index these pages, but only the first page, clumobile.com/blog/.

To achieve this, a canonical tag is added to direct to the original /blog page:
<link rel="canonical" href="https://clumobile.com/product/blog" />

With this canonical tag, the /?p= pages will not be indexed, but the /blog page will be.

How to exclude pages via noindex tags

Excluding pages via the noindex tag will prevent the page from being indexed at all. It will affect all search engines including external search engines like Google.

Setting up noindex tags is not done in MyCludo, but rather in the HTML of a page. A meta tag named “robots” must be set with the content value of “noindex” as shown here:

<meta name="robots" content="noindex">

Most CMS’es will have the option to exclude a page from search, which will inject a noindex tag on the page in question.

How to exclude pages via robots.txt rules

It is possible to exclude pages or entire areas of a site by leveraging robots.txt rules. The rules are flexible, so it is possible to make specific rules for Cludo, or make global rules which will also affect external search engines like Google and Bing.

Setting up the robots.txt file is not done in MyCludo, but rather in the hosting of the website. The file must follow a specific format, where one or more search agents can be addressed with certain rules for what must be crawled by that agent. Read more about robots.txt here.

Example of Robots.txt

User-agent: Googlebot

Disallow: /employee/

User-agent: cludo

Allow: /

User-agent: *

Disallow: /internal/

In the example above, the Google search engine (Googlebot) is not allowed to crawl any URL containing /employee/. Cludo’s user agent (cludo) is allowed full access to crawl any URL it can find on the site. Any search engine (*), including both Google or Cludo is not allowed to crawl any URL that contains /internal/.

Tags: