Check your WordPress robots.txt file

nerd details like your robots.txt are importantI recently looked at a client’s robots.txt file for a WordPress site. He had all sorts of things blocked (using the Disallow directives) including:


/*/feed/$
/*/comments/$
/*/trackback/$

As I have written before, robots.txt disallows are generally not as useful as people think they are. True, they prevent Google and other search engine spiders from crawling those URLs, but that doesn’t necessarily mean those pages don’t end up in the index (which is usually the reason for the Disallow directive.) These days we have much better tools at our disposal with the meta robots noindex tag and the canonical.

Perhaps the creator of this robots.txt was concerned about the “extra” URLs that are created for a post when someone comments or a trackback is created. For example, on my post where I blogged about Domain Forwarding and SEO, I’ve gotten a number of comments, when you look at the source, the URLs look something like:

http://webenso.com/domain-forwarding-seo/#comment-19503
/domain-forwarding-seo/?replytocom=19503#respond

The first URL is actually not a concern at all as the search engines generally ignore everything after the hash tag. However even if this was not true, neither of these URLs are going to harm your SEO in a modern WordPress installation because WordPress automatically adds a canonical URL to both pages and posts. The canonical URL, which for this post is:

<link rel="canonical" href="http://webenso.com/domain-forwarding-seo/" />

tells the search engines this is the URL to index even if the page was reached by a different URL such as one of the above.

You definitely don’t want to block Google (and especially Bing) from crawling your feed URLs – a feed is a great map for Google to find all your pages and posts – and Bing is on record preferring RSS feeds as sitemaps. So don’t disallow any feed URLs in your WordPress robots.txt.

I just updated my own robots.txt file and to my chagrin, I found several things in there that shouldn’t have been. I also decided to remove the Disallow for /*?* because as I point out about for the ?p URLs the canonical should take care of it (although this is something I should confirm). A quick update and now my WordPress robots.txt file looks pretty similar to:


User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/themes
Disallow: /wp-content/plugins/
Allow: /wp-content/uploads
Allow: /feed*

(Mine is a little different because I use a subdirectory). Note that these are all “internal” URLs that shouldn’t ever be publicly facing.

Hat tip to Yoast for suggesting the inclusion of the plugins directory.

Photo Credit: exalted Flickr Creative Commons License

About the Author Kathy Alice

Kathy Alice Brown is a traffic and conversion expert specializing in SEO, Copywriting and Facebook Ad Campaigns. In her spare time she loves to get outside.

Leave a Comment:

1 comment
Praveen says May 11, 2013

Thanks for publishing this helpful article, Could you please advice regarding my WordPress Robots.txt file
Thanks a lot, and keep it up 🙂

Reply
Add Your Reply