Best practice for avoiding duplicate results

When searching, you may experience the same content appearing more than once in the results. Since a crawler is unable to index the same URL twice, this will always be due to the same content existing on multiple URLs. That is, of course, unless you have two crawlers that index the same pages added to the same engine.

Multiple pages with the same content

When running, the crawler will detect and crawl all available links on the site. Many links exist with various URL parameters, resulting in the same content existing on multiple unique URLs. To account for this, like with SEO in general, it is encouraged to use canonical tags.

A canonical tag (aka “rel canonical”) is a way of telling search engines that
a specific URL represents the master copy of a page. Using the canonical tag prevents
problems caused by identical or “duplicate” content appearing on multiple URLs.
Practically speaking, the canonical tag tells search engines which version of a URL
you want to appear in search results.

moz.com: Canonicalization

On top of canonicalization, the crawler, of course, respects noindex/nofollow as well as rules set in the robots.txt file. This can however be disabled within the crawler settings, though it is encouraged to have these settings enabled.

Multiple crawlers on a single engine

While each crawler attempts to keep a unique list of indexed pages, an engine with multiple crawlers can end up with the same unique page appearing twice in the results if some of those crawlers are indexing the same pages. It is recommended to keep each crawler logic clear and easy to understand to avoid accidentally having the same page indexed by one or more crawlers that are added to the same single engine.

Example

A crawler named “Blog” is configured to crawl only the blog posts on a website. Another crawler called “All” is configured to crawl all content on the same site. If both crawlers are applied to a single engine, any blog post would be shown twice in the results, as the content is found in both crawlers. The easy solution, in this case, would be to remove the “Blog” crawler from the engine or to have the “All” crawler exclude pages from the blog section of the site.

Tags:  ,