Robots.txt is a special file which is located in the root of each server which is a plan text file which allows the administrator of a website to define which web content need to be allowed and disallowed for the bot which visitors their website.
All major search engine like Google, Yahaoo and MSN agrees to the Robots Exclusion Protocol. There are several elements that every website owner need to understand for a easing crawling of their website. Following are the top 10 common mistakes to be avoided while create a robots.txt file.
1. Adding robots.txt not under the root directory – This is one of the common mistake webmaster do. They upload the robots.txt file at the wrong place it must reside in the root of the domain and must be named “robots.txt”. A robots.txt file uploaded in subdirectory is not a valid one since blots check for robots.txt file only in the root of the domain name.
User-agent: *
Disallow:
2. Wrong syntax in robots.txt – Another explanation is that the Webmaster used the wrong syntax when creating the robots.txt. Therefore, always double check the robots.txt file using tools like Robots.txt Checker
Here is an example
User-agent: *
Disallow: private.html
We advise you to start a file/directory name with a leading slash char (Example: /private.html).
3. Adding comment at the end of the sentence instead of at the beginning – If you wish to include comments in your robots.txt file, you should precede them with a # sign like this:
# Here are my comments about this entry.
User-agent: *
Disallow:
4. Empty robots.txt file almost like not having one – If you have created a robots.txt file under your root directory and there is nothing in it, then it is similar like not having one. Because nothing is disallowed or no User-agent is given, everything is allowed for every bots.
5. Blocking the pages which you need to get indexed – If you are blocking spider bots and pages using robots.txt you should have thorough understanding of the syntax to be used any mistake can cause you huge problem with the spiderbots.
6. URL’s Paths are case sensitive – URL paths are often case sensitive, so be consistent with the site capitalization WARNING! Many robots and webservers are case-sensitive. So this path will not match any root-level folders named private or PRIVATE.
7. Misspelled robots/user agent names – SpiderBots will ignore mispelled User-Agent names. Check out your raw server log to find User-Agent name which you need to be blocked. Check out UserAgentString.com for a list of User Agent name.
8. Don’t add all the files in one single line – Some of the common mistake is adding all the files under on disallow.
For example
User-agent: *
Disallow: /private/ /images/ /javascript/
This is a wrong syntax and robots will not understand this format. The correct syntax is given below.
User-agent: *
Disallow: /private/
Disallow: /images/
Disallow: /javascript/
9. No allow command in robots.txt – There is only one command that is Disallow: and there is no command called Allow: So if you want to allow the bots to visit the page just don’t add the files.
10. Missing the colon – Missing the colon in Disallow and User-agent entry. Here is one of the example of a missing colon entry.
#This is a wrong entry
User-agent: googlebot
Disallow /
#The correct entry
User-agent: googlebot
Disallow: /
Please leave your comment if you find any other common mistakes which need to be avoided while generating a robots.txt file. Also below are few robots.txt useful resources and tools.
http://www.mcanerin.com/en/search-engine/robots-txt.asp
http://webtools.live2support.com/se_robots.php
http://googlewebmastercentral.blogspot.com/2008/03/speaking-language-of-robots.html
Thanks for this list. Bookmarked
Hello Thomson very informative about seo and good advice in adding robots.txt in our website very interesting.
nice tips….. thanks
Jolina,
Thanks for the comments and good luck with your website. Also let me know if you have any problem while installing the robot.txt file on your site.
This post is basic, but useful. Thanks!
Hey,
If I want the bot to read index.html in my root, but nothing else on the site, how would I do this?
Thanks ya’ll!
Another invaluable resource is Google’s Webmaster Tools (www.google.com/webmasters/tools/). They have a whole section dedicated to not only helping you build and test your robots.txt file, but will actually give you a list of URLs actually blocked by the Google.
great post , lerned lots… and you answered questions that need to be answered
Thanx for the info!
Helpful tips, can imagine these mistakes are so common
You have come up with a very nice list, great information.
Just one thing, google do support the “Allow:”command, one use they make is if you want to disallow your pages to be crawled but you want to keep showing Adsense ads on those pages then you do a disallow for all bots and allow for the google bot that crawls pages with ads.
This is awesome. Thanks for the list!
Hello,
27 (out of 100) of my site links have been blocked by robots.txt file. Would it be a problem for getting traffic to our site? If so, what should i do to unblock the links
Thanks for providing this info.
Thanks for valuable information We will follow your rules to create robots txt file for our websites.
Ideal robots.txt file is as follows
User-agent: *
Crawl-delay: 2
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /category
Disallow: /tag
Disallow: /author
Disallow: /trackback
Disallow: /*trackback
Disallow: /*trackback*
Disallow: /*/trackback
Disallow: /*?*
Disallow: /*.html/$
Disallow: /*feed*
# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*
# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*
Well Written article robots.txt file. This post is most recommendable to read every Web Master
RE: #9 — google’s own robots.txt has many ‘Allow’ statements: http://www.google.com/robots.txt
So, is this still right?
Looking for the clear Information about the robots.txt file from long time, Definitively this blog helped me in better understanding