When we first build a website, the thought of actually telling Google and the other search engines to not spider a given web page seems counter-intuitive, why would anyone want Google to not spider their website? (Well except when you are Rue La La).
Here’s one reason. A more sophisticated website might have a login page or registration page. Often these pages shouldn’t be indexed as they don’t add value for ranking for keywords. Compounding the issue, in one case I looked at, the registration page was manifesting as many registration pages, because the site was tacking on a return url in the parameter (so that after the registration the user would be returned to the calling page), creating duplicate content.
If you have many URLs that all point to the same page, that is known as duplicate content (this is different than duplicate content across many websites … and worse) and definitely to be avoided. Each site gets limited link juice and a limited spider crawl budget, you don’t want to waste either on yet another version of a page the spider has seen before.
So to tell the spiders you don’t want a page to be indexed, you put the no index meta tag into the HTML source code (between the open and closing <HEAD> tags) for that page.
<META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>
Why the follow? So that the link juice from external incoming links and internal links can pass through to the links on the page you are noindexing. Otherwise you are creating a dead end that stops the link juice from passing through. The registration page might not be important, but it might have links to articles that are.
I wanted to explicitly point this out, because if you search on “meta tag no index” you will find lots of examples of “no index, no follow”. Lindsay Wassell makes a compelling case for the right use of this tag in her seomoz.org article and explains why using robots.txt instead is not a viable alternative.
More on the noindex meta tag.