Most people outside the industry think that Technical SEO is all about on page optimization, in other words; updating a page’s meta data and content to have the right keywords in it. However there is another aspect of Technical SEO that has nothing with keywords and can be very powerful for larger dynamic websites. This discipline makes sure the site is search engine crawler friendly by making sure the search bots can find all the pages and also that there are not multiple URLs for a single page, also known as duplicate content. Duplicate content within a site is more common that you might think and I have repeatedly seen significant traffic increases by eliminating it. This is a case study of how we addressed over-indexation of duplicate content by Google and increased traffic by more than 150% overall.
The site in question is a B2B eCommerce site which does a brisk business on weekdays but has much lower traffic during the weekends and holidays. The chart below tracks its weekly non branded organic search traffic for the last 9 months. Here “non branded” means that we exclude organic traffic that arrives via company and domain name keywords, however note we do include product keywords which is a significant part of their traffic.
Like many online businesses, the traffic of the site has seasonal ebbs and flows, Spring is the strongest period, with a second ramp up in early fall. So the decline in traffic in the late spring was at first not a concern. However when the traffic hit its nadir at just 1,100 non branded organic visits per week, well off its pace compared to last year, it got people’s attention.
Since it was the height of the Penguin hysteria, and with Google continuing to roll out Panda algo changes at a dizzying pace, we initially looked for drops in traffic that coincided with known Google algorithmic releases, however we did not find a strong correlation and starting looking elsewhere.
Indexation ramps up and traffic drops
Google webmaster tools has a neat feature called Index Status that you can find under the Health Menu that shows the indexation of the site over time. It was clear that the number of pages Google had indexed for the site had dramatically increased, at the same time that the traffic was decreasing. Correlation is not always causation but in this case the relationship seemed clear.
Several scenarios that caused duplicate content
- Several pages existed under both “generic” and “vanity” URLs.
- The site had links to URLs that that were missing the trailing slash
- There were many articles that appeared under more than one id (which was part of the URL)
- The site had a print function that rendered the page without the sidebar and header and added a parameter to the URL. While this is not “true” duplicate content, it is still two URLs that show the same core content so it can be treated as a duplicate content problem.
There are a couple of different approaches to first three problems. The brute force method is to just 404 (return a not found HTTP status code) the undesired pages. Google sees these and eventually drops them from the index. However this also drops any link juice from incoming external links into a black hole, a better approach is to use a 301 redirect from the old page to the new one, getting both the search engines and your human visitors to the new page.
For the print parameter, we updated the parameter in GWMT to be set as “NoURLs” – telling Google we didn’t want it to crawl any URLs that had this parameter.
We also cleaned up broken links and made sure the site was linking to the right URL. The site’s traffic started recovering, increasing 60% to 1800 – 2000 organic non branded visits a week and we planned to revisit the project in 30 days.
Finding more duplicates in Google’s index
Despite the nice recovery in traffic, I was sure there was more to be done. The number of pages returned by the
site: command and shown by GWMT’s index status was still much higher than it should be. So when I returned to the project in the early fall I started really looking through what Google had indexed for the site, relying heavily on the
inurl search operator to look at patterns of URLs. I soon stumbled on the causes of the over-indexation.
A few of the broken links we had fixed were external links erroneously generated as relative links. To illustrate what I mean by this: say the link was supposed to be to
www.yahoo.com, however since the site thought this was an internal link, the link ended up being generated as
www.mysite.com/articles/www.yahoo.com/article. No big deal right? Just fix the problem and move on. The problem is that this malformed URL still RENDERED, and not only did it render (the page displayed and no error code was sent) but the
www.yahoo.com was carried forward and placed into all the URLS of the other generated links on the page. Which all worked. So Googlebot had a little feast crawling all these duplicate, malformed, yet working URLs until we found and fixed the links in our first pass. But the damage had been done. If a URL is in Google’s index and “works” (returns a HTTP status code 200), Google will keep in it the index even if you can no longer reach that URL from the site.
I found a couple variants of this problem which ended up adding about 2000 extra duplicate pages in the index. For this situation we used the
rel=canonical to tell Google which URL was the one it should keep in the index. We implemented the canonical on all the article pages that had the problem. Soon after implementation we saw another nice spike in traffic (unfortunately which is somewhat marred by the temporary analytics glitch), and now that the holidays are past, the site is now averaging more than 3,000 non branded organic visits a week, which is more than a 150% gain from its low point.
Google still hasn’t dropped all the duplicates from the index, there are still hundreds of pages. As these continue to get removed, it will be interesting to see if the traffic increase continues. Unfortunately, the GWMT URL parameter configuration for the print parameter has not been helpful at all with indexation, so we may need to come up with another approach for those duplicates. Still this case study does demonstrate the power of this flavor of technical SEO and why it’s important to have ongoing SEO aware maintenance for your dynamic website.
I never thought that duplicate content was such a
minefield! Thanks for the information in this well-
As a non-techie, I learned a lot from your post. Also, I just wanted to second the importance of treating each product as a unique entity. Creating engaging, informative, unique content for each product goes a long way to ensuring good Google rankings.
Thanks for the comment! I would agree, making sure you have separate, well optimized landing pages for each of your products is great for SEO.