What is Google indexation?
What we mean by indexation is the pages that Google has crawled of your site and put into its index. When you type in a search term into Google, it then uses the index as a repository to retrieve pages to show in the search engine results pages. Just because a search engine has crawled a page on your site doesn’t mean that it puts it into the index. For one thing you might have told it not to index it. Or Google might have decided it’s not worthy of being indexed.
What has Google indexed from your site?
To see whether Google has indexed your site at all, type the following query into Google, replacing your domain name for “foo.com” below.
If you get a list of URLs you are in business, if you don’t then the simplest way to get your site indexed is to add a link back to your site from a social media network such as Twitter or LinkedIn. Assuming your site is indexed, the next step is click through the pages (at the bottom of the search result page) to see what Google has indexed. You might be surprised.
When I did the below
site: command for yahoo.com I got 222,000 pages.
site:www.yahoo.com – Google Search.
When I did this and started looking the pages of search results, I found that many apparent variants of the Yahoo home page have been indexed (sg3.www.yahoo.com ?) which is a form of duplicate content. Soon I also started running into the all too familiar:
A description for this result is not available because of this site’s robots.txt – learn more.
Now I did not check into the Yahoo case (and Yahoo may not care at all about its Google indexation) but recently it seems that Google is indexing “blocked” URLs more aggressively.
Here’s another case to look at.
site:www.wikipedia.org – Google Search.
When I did this, I got 9 results (9 whole pages from wikipedia!). But then we note there is the dreaded supplemental results message down at the bottom. This message is an indication of a duplicate content problem (Google sees similar or duplicate pages when visiting multiple URLs). When you see this you will need to click on the link: “repeat the search with the omitted results included.” This time we see 6270 pages as a result. So wikipedia.org probably has around 6000 pages indexed by Google.
Probably? Well, the problem with the
site: command is that it is not very reliable. It’s very common to get a different results for different runs of the command (the theory is because you are accessing different Google servers each time). To compound the issue, the results numbers can change as you click through the result pages. For example for clothing site loft.com I initially get 53,600 pages which changes to 92,300 when I access the last result page by adding
&start=990&filter=0 at the end of the google query. That’s a big difference!
Fortunately Google has provided (as of July 2012) a more accurate reporting of the indexation in Google Webmasters Tools. Called “Index Status” and located under the Health menu, this will give you a more accurate indexation count over time.
The tool also has an advanced view that shows URLs not selected, ever crawled, removed and blocked by robots.txt. The “not selected” stat is one that might bear further looking into if it is higher than you expected as it indicates Google is choosing not to index some of your site’s pages.
Why site: can still be useful
Why even look at
site: results when the results are unreliable and there is a more accurate tool is available? The problem with the GWMT index status is that it is only updated once a week (usually Sunday) and changes in the indexation take a long time to show up – especially when you are dealing with large number (thousands and thousands) of pages that have been added or marked with the noindex tag. I recently worked with a client that had an overindexation problem that we solved mostly (although not completely) with tagging the pages we wanted out of the index with the meta robots noindex tag. The drops in indexation counts first showed up in the
site: “click through” results (what you see when you click through to the last page, eg page 70 for loft.com). If this number is below 1000 (even if when the “first page” number is in the thousands), this is likely to be the more accurate number. Next you should see drops in the “first page” site: indexation count. And finally the GWMT index status will finally show the drop in the indexation.
Update: You do indeed get very few results (6 or 9) from the site: query for www. wikipedia.org – however if you do the site: query with wikipedia.org (drop the www) you get a more believable result of 62,000 pages. Which is kind of interesting since there is a 301 redirect from wikipedia.org to www. wikipedia.org.