This post originally appeared in 2012. I’ve updated it to include more approaches on handling duplicate and similar content. May 8, 2014
A while back, I wrote about how Bing had duplicate content in its index even with the use of some tools that should have removed it. Here I cover the topic of duplicate content on dynamic websites. As you probably know, duplicate content is a common technical SEO problem, and it is frightening easy to get it on dynamic websites.
First some definitions
- Dynamic Websites are websites that are generated in all or in part at the time of access. This is often done by assembling information retrieved from a database. WordPress is an example of a dynamic website. All the content is stored in a database and it is presented as needed in multiple forms.
- Duplicate Content here refers to duplicate content within a site (not duplicate content across domains). In this case, duplicate content is when more than one page has the same content. Or, put another way, there are multiple URLs that lead to the same page. SEOs also classify pages with very similar content as duplicate content, I touch on this briefly below, but it really deserves a separate post.
Why is duplicate content bad?
- Your domain is only given a fixed amount of PR (page rank) / link juice. Duplicate content wastes that link juice on pages that you don’t need in the index.
- Extra pages that are the same as other pages wastes the spiders time when they crawl your site. Googlebot and the other search engine crawlers may not get to your “good” pages because they have wasted time with your duplicate pages.
- To Google it is a low quality signal. Which means that Google adjusts its opinion of your site downwards. I’ve seen at least one partial recovery from Panda by fixing a duplicate content problem on a site.
If you see that some of your pages end up in the dreaded supplemental index, then you may have a duplicate content problem.
What causes duplicate content on dynamic websites?
There are three common culprits for duplicate content:
- Multiple navigation paths are reflected in the URL As a user navigates through a site, the directories are added to the URL. This is fine, however the problem comes in when there is multiple ways to navigate to the same page. So let’s say you have a cooking site and you have a category called vegetables and a category called soup. Your vegetable soup would fit into both categories and could be reached via two URLs:
- URL parameters can cause duplicate content, especially on sites that use them as a tracking mechanism. The problem is that now you have two URLs that render as the same page, one with the URL parameter added and one without. Using our cooking site as an example:
- Aggregate pages are pages that have summaries or excerpts of your content pages. This is more of an example of very similar content rather than duplicate content, but I wanted to mention it because it is a common problem in WordPress sites. Excerpts is one approach to handle duplicate content on aggregate pages in WordPress.
This is one reason to avoid using categories in your WordPress permalinks.
Common fixes for duplicate content
There are several fixes you can consider for duplicate content.
- Use the rel canonical tag to specify which page is the authoritative page that should be indexed.
- Tag the page with
<meta name="robots" content="noindex">. The next time Google crawls the page, it will remove the page from it’s index. This solution is straightforward when there are actually two separate duplicate pages, rather than different URLs leading to the same page. In the latter case, the canonical might work better for you.
- If you’ve had the duplicate content for some time and you know you have backlinks to the page you want to remove, a better solution (than the noindex tag) is a HTTP 301 redirect. Once implemented, both the backlinks and the user will be redirected to the page you are keeping.
- Another option is to remove the duplicate pages with the GWMT (Google Webmaster Tools) Remove URLs feature. This is a a two step process and works best when the URLs you want to remove are all in the same folder. Let’s say you accidentally published a set of pages twice on your website, one set in a folder called /sales and the other to /leads. You’ve decided to remove all the duplicate pages that are in /leads.
- First: Add a Disallow directive to your robots.txt file:
- Next: Log into GWMT as a fully privileged user and select the “Remove URLs” option under the “Google Index” menu. Enter the URL to the directory you want to remove (in this case /leads/) and then select “Remove Directory” from the drop down.
- First: Add a Disallow directive to your robots.txt file:
Which approach to use?
There is no one solution that fits all situations. It depends on your technical capabilities (for some sites, 301 redirects are a problem to implement), and the nature of the duplicate content. Removing URLs via GWMT is a very fast solution (waiting for Google to recrawl pages with noindex tags on a very large site can take months!), however this approach creates a black hole that spiders can’t enter. If you are removing a lot of duplicate, low quality pages that you want page rank to follow through to the rest of the site, the
noindex tag might be a better bet. And you don’t want to remove URLs that have a lot of good backlinks to them. I typically use the Remove URL feature for removing low quality sections of websites that have been hit by Panda, rather than for addressing some of the examples I give above.
Just Disallowing is not Enough
A common mistake that web site owners make, is to set up the Disallows in the robots.txt file and think they are done with fixing their duplicate content. Nothing could be further than the truth. All the Disallow does is prevent the spider from crawling the URLs, the URLs are NOT removed from the index. And don’t think adding the
noindex tag to the page helps, because the spider will never see it because it is blocked from crawling the page.
What about GWMT URL parameters?
The GWMT URL parmameters settings tell the crawler what parameters change the content on the page and what parameters don’t. For the parameters that don’t, Googlebot now knows that there is no point in crawling pages also that have this parameter, even with a different value. Bing Webmaster Tools has a similar feature “URL Normalization”.
My experience with the URL parameters is that they are best to keep Google from crawling duplicate content in the first place, while the
rel canonical tag is heeded more quickly to remove duplicate content from Google’s index. Like with the robots.txt Disallow, the GWMT URL parameters mainly impact the crawling but not the indexing of your context. So setting “No URLs” for a parameter does not remove URLs with that parameter from the index.
As you can see, the topic of managing duplicate content is a multi-faceted one. Ideally you should try to set up your website to avoid these problems in the first place. For example, on Microsoft implementations, determining a case standard up front is very important, so that you are not serving the same page with both mixed case and lower case URLs – which Google may decide are different. Fortunately we have many tools at our disposal to address these problems.