This post originally appeared in 2012. I’ve updated it to include more approaches on handling duplicate and similar content. May 8, 2014
A while back, I wrote about how Bing had duplicate content in its index even with the use of some tools that should have removed it. Here I cover the topic of duplicate content on dynamic websites. As you probably know, duplicate content is a common technical SEO problem, and it is frightening easy to get it on dynamic websites.
If you see that some of your pages end up in the dreaded supplemental index, then you may have a duplicate content problem.
There are three common culprits for duplicate content:
This is one reason to avoid using categories in your WordPress permalinks.
There are several fixes you can consider for duplicate content.
<meta name="robots" content="noindex">. The next time Google crawls the page, it will remove the page from it’s index. This solution is straightforward when there are actually two separate duplicate pages, rather than different URLs leading to the same page. In the latter case, the canonical might work better for you.
There is no one solution that fits all situations. It depends on your technical capabilities (for some sites, 301 redirects are a problem to implement), and the nature of the duplicate content. Removing URLs via GWMT is a very fast solution (waiting for Google to recrawl pages with noindex tags on a very large site can take months!), however this approach creates a black hole that spiders can’t enter. If you are removing a lot of duplicate, low quality pages that you want page rank to follow through to the rest of the site, the
noindex tag might be a better bet. And you don’t want to remove URLs that have a lot of good backlinks to them. I typically use the Remove URL feature for removing low quality sections of websites that have been hit by Panda, rather than for addressing some of the examples I give above.
A common mistake that web site owners make, is to set up the Disallows in the robots.txt file and think they are done with fixing their duplicate content. Nothing could be further than the truth. All the Disallow does is prevent the spider from crawling the URLs, the URLs are NOT removed from the index. And don’t think adding the
noindex tag to the page helps, because the spider will never see it because it is blocked from crawling the page.
The GWMT URL parmameters settings tell the crawler what parameters change the content on the page and what parameters don’t. For the parameters that don’t, Googlebot now knows that there is no point in crawling pages also that have this parameter, even with a different value. Bing Webmaster Tools has a similar feature “URL Normalization”.
My experience with the URL parameters is that they are best to keep Google from crawling duplicate content in the first place, while the
rel canonical tag is heeded more quickly to remove duplicate content from Google’s index. Like with the robots.txt Disallow, the GWMT URL parameters mainly impact the crawling but not the indexing of your context. So setting “No URLs” for a parameter does not remove URLs with that parameter from the index.
As you can see, the topic of managing duplicate content is a multi-faceted one. Ideally you should try to set up your website to avoid these problems in the first place. For example, on Microsoft implementations, determining a case standard up front is very important, so that you are not serving the same page with both mixed case and lower case URLs – which Google may decide are different. Fortunately we have many tools at our disposal to address these problems.
Kathy Alice Brown is a SEO expert specializing in Technical SEO and Content. In her spare time she loves to get outside.