I took a look and sure enough the site’s robots.txt file was set up to block Google and the other search engines from crawling the entire site. Fortunately the fix was easy. I changed the file from this:
User-agent: *
Disallow: /
To this:
User-agent: *
Disallow:
Problem solved.
(You can also just remove the file.)
I might be going out on a limb here, but I’ve seen more problems caused by misuse of the robots.txt file than solved.
One of the big misconceptions about robots.txt disallow directives is that they are a fool proof way to keep pages out of the Google index. Not only is this not true, but when the pages are indexed – they are indexed with almost no information adding a lot of low quality almost duplicate content into the index which might drag down the SEO performance of your site.
The robots.txt file has been around for years. In those early days, bandwidth was more precious and Googlebot often taxed servers, even crashing them, when it crawled a site. So using the disallow directive to keep Google from crawling pages often helped keep a site up. Those concerns are a distant memory today.
Crawled and Indexed are two different things
When you add a disallow directive to your robots.txt you are telling Googlebot and other search bots to not crawl that page, or pages in that directory. For example when I originally wrote this post, my robots.txt includes:
Disallow: /wp-admin
Disallow: /*?
The first directive disallows any URL that starts with sitename.com/wp-admin including anything in the /wp-admin/ directory. The second disallows any URLs that have a question mark in them (useful to avoid crawling the original ?p= permalink structure). One of the better explanation of the various patterns you can use in robots.txt for allows and disallows can be found in the Google developer information on robots.txt.
But telling Google and the other other bots that they can’t crawl the page doesn’t necessarily prevent them from indexing the page. Not crawling and not indexing are two different things. The first means the spider will not visit the page at all, the second means that Google or Bing will not make the page available in the search results.
This is when we run into problems.
Just because your robots.txt prevents the spider from visiting the page doesn’t means that Google or Bing doesn’t know about it. The Search Engines can learn about the page from external sites linking to the page and even from your internal links (especially if the link does not have the rel nofollow tag). Google, in particular, is a greedy little monster and will voraciously index anything it finds a reference to – even if it can’t crawl the page. So you end up with references in the Google index that have the URL (not the page title because Google can’t see it!) and a snippet that says this:
A description for this result is not available because of this site’s robots.txt – learn more.
You don’t want a lot of these in the Google index.
What to do if your blocked URLs get indexed
You have three ways to get the URLs out of the Google index.
- Often the best way is to add a meta robots noindex tag to your page’s HTML head section. This tells the spiders not to put the URL into their index. IMPORTANT: The spider has to see the tag to process the “noindex”. So you MUST remove the disallow directive from your robots.txt file to allow the spider to get to the page so it understands it should remove the URL from the index.
- If the page has been removed, remove the disallow and let Googlebot and the other search bots crawl and see the 404 (even better use a 410). It’s not harmful to have Not Found or Gone pages on your site, especially if they were pages of low quality. These will eventually drop out of the index.
- Another method is to use the Google URL removal tool in your Google Webmasters Tools Account (Bing Webmasters also has removal tool). With this approach you want to keep the disallows in place as this is a requirement for the removal. Note that there has been some reports of the URLs reappearing in the index after the 90 day period so your mileage may vary.
Other surprising ways your URL gets indexed
Incoming links to the page may not be the only way that a URL blocked by robots.txt gets indexed. Here are a couple of surprising ones:
- You have a Google+ One button on your page. Google assumes that a page with this button means the page should be publicly available.. Even if it is noindexed as well.
- The URL is in your sitemap. Posts in this webmasterworld thread report that Google indexed blocked URLs in a sitemap.
I would also remove any Adsense on a page you intend to keep out of Google’s index. I don’t have any evidence that Google Adsense causes a page to be indexed, but I would remove it anyway.
Don’t block Javascript files and other resources with robots.txt disallows
It used to be common practice to use robots.txt disallows to keep web crawlers away from non HTML files like CSS, Javascript and images files. However in October 27, 2014 Google updated its Technical Webmaster Guidelines to recommend against this practice as its indexing system now behaves more like a modern browser. In the October announcement Google states: Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings.
In Google Search Console (formerly known as Google Webmaster Tools) you can check “Block Resources” in the Google Crawl Section to double check you are not blocking anything that Google considers important.
When to use robots.txt disallows
There are two scenarios that I can think of where robots.txt disallows are still useful:
- You want to remove a site or a section of a site quickly: It’s a lot faster to use a robots.txt disallow and Google Search Console’s Remove URL feature to remove a site or a section of a site from the Google Index; than adding a meta robots noindex tag and waiting for Googlebot to recrawl the pages and heed the noindex. I had a client that had been hit by Panda. We found that they had a section of their site that was mostly duplicated across their portfolio of sites, when we removed that section from the Google index using a robot.txt disallow and the GSC Remove URL, their site traffic recovered within a month. Another common scenario is finding that a staging or a development site has gotten indexed (often a subdomain) and needing to remove it from the Google search results.
- You want to conserve your bandwidth: A common scenario I see is sites that create a separate “return path” URL each time a user clicks on a Login link for the source page that has the link. Normally I would advise to just add the meta robots noindex tag to that login page (and all the variants), however a possible concern is that the crawling of these pages is wasting Googlebot’s crawl bandwidth that has been allocated to your site. I still think the meta robots noindex tag is the way to go, however with large complex sites it is possible to have filters and parameters that create endless numbers of pages that Googlebot shouldn’t crawl. A robots.txt disallow might be appropriate in some of these cases.
Also updates to the robots.txt file are not processed instantly. I’ve seen cases where Google has crawled a number of URLs before processing the disallows. So add your disallows at least 24 hours in advance.
With the Google Search Console Remove URLs feature you can remove a page, a sub folder or an entire site from Google’s Index; as long as the site is blocked by robots.txt or the page returns a 404 Not Found HTTP status code. You do need to have the admin privileges to submit the removal requests. And keep in mind that the removal can be temporary. More information on the Remove URLs feature can be found here.
Use robots.txt disallow sparingly
The robots.txt file is old and it’s usefulness has diminished. Yes, there are still scenarios where disallows are useful, but they are often misused.
This post was originally published on September 10, 2012 and was updated on May 26, 2016.
Spider image courtesy of openclipart.org
Kathy,
Your recommendations to get pages out of the index, despite the robots file, works most of the time. But not ALL of the time. It seems that pages somehow come back to the index.
That’s why I’m now experimenting with the X-Robots http header. You can insert a noindex directive in your htaccess file, which should prevent pages from being indexed.
Thanks for the informative post. its very helpful, keep it up!
Its always better to allow every files and urls in robots.txt. Just control page through meta robots .