Robots.txt is a special file which is located in the root of each server which is a plan text file which allows the administrator of a website to define which web content need to be allowed and disallowed for the bot which visitors their website.
All major search engine like Google, Yahaoo and MSN agrees to the Robots Exclusion Protocol. There are several elements that every website owner need to understand for a easing crawling of their website. Following are the top 10 common mistakes to be avoided while create a robots.txt file.
1. Adding robots.txt not under the root directory – This is one of the common mistake webmaster do. They upload the robots.txt file at the wrong place it must reside in the root of the domain and must be named “robots.txt”. A robots.txt file uploaded in subdirectory is not a valid one since blots check for robots.txt file only in the root of the domain name.
2. Wrong syntax in robots.txt – Another explanation is that the Webmaster used the wrong syntax when creating the robots.txt. Therefore, always double check the robots.txt file using tools like Robots.txt Checker
Here is an example
We advise you to start a file/directory name with a leading slash char (Example: /private.html).
3. Adding comment at the end of the sentence instead of at the beginning – If you wish to include comments in your robots.txt file, you should precede them with a # sign like this:
# Here are my comments about this entry.
4. Empty robots.txt file almost like not having one – If you have created a robots.txt file under your root directory and there is nothing in it, then it is similar like not having one. Because nothing is disallowed or no User-agent is given, everything is allowed for every bots.
5. Blocking the pages which you need to get indexed – If you are blocking spider bots and pages using robots.txt you should have thorough understanding of the syntax to be used any mistake can cause you huge problem with the spiderbots.
6. URL’s Paths are case sensitive – URL paths are often case sensitive, so be consistent with the site capitalization WARNING! Many robots and webservers are case-sensitive. So this path will not match any root-level folders named private or PRIVATE.
7. Misspelled robots/user agent names – SpiderBots will ignore mispelled User-Agent names. Check out your raw server log to find User-Agent name which you need to be blocked. Check out UserAgentString.com for a list of User Agent name.
8. Don’t add all the files in one single line – Some of the common mistake is adding all the files under on disallow.
This is a wrong syntax and robots will not understand this format. The correct syntax is given below.
9. No allow command in robots.txt – There is only one command that is Disallow: and there is no command called Allow: So if you want to allow the bots to visit the page just don’t add the files.
10. Missing the colon – Missing the colon in Disallow and User-agent entry. Here is one of the example of a missing colon entry.
#This is a wrong entry
#The correct entry
Please leave your comment if you find any other common mistakes which need to be avoided while generating a robots.txt file. Also below are few robots.txt useful resources and tools.