I took a look and sure enough the site’s robots.txt file was set up to block Google and the other search engines from crawling the entire site. Fortunately the fix was easy. I changed the file from this:
(You can also just remove the file.)
I might be going out on a limb here, but I’ve seen more problems caused by misuse of the robots.txt file than solved.
One of the big misconceptions about robots.txt disallow directives is that they are a fool proof way to keep pages out of the Google index. Not only is this not true, but when the pages are indexed – they are indexed with almost no information adding a lot of low quality almost duplicate content into the index which might drag down the SEO performance of your site.
The robots.txt file has been around for years. In those early days, bandwidth was more precious and Googlebot often taxed servers, even crashing them, when it crawled a site. So using the disallow directive to keep Google from crawling pages often helped keep a site up. Those concerns are a distant memory today.
When you add a disallow directive to your robots.txt you are telling Googlebot and other search bots to not crawl that page, or pages in that directory. For example when I originally wrote this post, my robots.txt includes:
The first directive disallows any URL that starts with sitename.com/wp-admin including anything in the /wp-admin/ directory. The second disallows any URLs that have a question mark in them (useful to avoid crawling the original ?p= permalink structure). One of the better explanation of the various patterns you can use in robots.txt for allows and disallows can be found in the Google developer information on robots.txt.
But telling Google and the other other bots that they can’t crawl the page doesn’t necessarily prevent them from indexing the page. Not crawling and not indexing are two different things. The first means the spider will not visit the page at all, the second means that Google or Bing will not make the page available in the search results.
This is when we run into problems.
Just because your robots.txt prevents the spider from visiting the page doesn’t means that Google or Bing doesn’t know about it. The Search Engines can learn about the page from external sites linking to the page and even from your internal links (especially if the link does not have the rel nofollow tag). Google, in particular, is a greedy little monster and will voraciously index anything it finds a reference to – even if it can’t crawl the page. So you end up with references in the Google index that have the URL (not the page title because Google can’t see it!) and a snippet that says this:
A description for this result is not available because of this site’s robots.txt – learn more.
You don’t want a lot of these in the Google index.
You have three ways to get the URLs out of the Google index.
Incoming links to the page may not be the only way that a URL blocked by robots.txt gets indexed. Here are a couple of surprising ones:
I would also remove any Adsense on a page you intend to keep out of Google’s index. I don’t have any evidence that Google Adsense causes a page to be indexed, but I would remove it anyway.
In Google Search Console (formerly known as Google Webmaster Tools) you can check “Block Resources” in the Google Crawl Section to double check you are not blocking anything that Google considers important.
There are two scenarios that I can think of where robots.txt disallows are still useful:
Also updates to the robots.txt file are not processed instantly. I’ve seen cases where Google has crawled a number of URLs before processing the disallows. So add your disallows at least 24 hours in advance.
With the Google Search Console Remove URLs feature you can remove a page, a sub folder or an entire site from Google’s Index; as long as the site is blocked by robots.txt or the page returns a 404 Not Found HTTP status code. You do need to have the admin privileges to submit the removal requests. And keep in mind that the removal can be temporary. More information on the Remove URLs feature can be found here.
The robots.txt file is old and it’s usefulness has diminished. Yes, there are still scenarios where disallows are useful, but they are often misused.
This post was originally published on September 10, 2012 and was updated on May 26, 2016.
Spider image courtesy of openclipart.org