Wednesday, September 21, 2011

What is Robots.txt File, Creating and Using Robots.txt

About Robots.txt file

It is great when search engines frequently visit your site and index your content but often there are cases when indexing parts of your online content is not what you want. This can be achieved by creating a simple text file on the root path of your server and naming it to exactly robots.txt. So, in other words, a robots.txt is a file placed on your server to tell the various search engine spiders not to crawl or index certain sections or pages of your site. You can use it to prevent indexing totally, prevent certain areas of your site from being indexes or to issue individual indexing instructions to specific search engines.

One thing to note down here is that the file Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do.

How and where to create it?

The file itself is a simple text file, which can be created in Notepad or whatever is your favorite text editor. It needs to be saved to the root directory of your site, which is the directory where your home page or index page is. Misspelling is obvious, so be sure to name it correctly as "Robots.txt" and not "Robot.txt".

Structure of Robots.txt file

The structure of a robots.txt is pretty simple (and barely flexible) – it is an endless list of user agents and disallowed files and directories. Basically, the syntax is as follows:

User-Agent: [Spider or Bot Name]
Disallow: [Directory or Specific File Name]
 
User-agent: are search engines' crawlers or bots and disallow: means the list of files and directories to be excluded from indexing. Also if you want to include comment lines – just put the # sign at the beginning of the line:

# All user agents are disallowed to index the secure directory.
User-agent: *
Disallow: /secure/

Examples of Usage

A few examples will make it clearer to how to properly write contents in robots.txt file.

Exclude all robots from the entire web site

User-agent: *
Disallow: /

Allow all robots from the entire web site

The only difference from above is to omit the trailing '/' in Disallow section. Alternatively you can either create an empty robots.txt file or don't create any.

User-agent: *
Disallow:

Exclude a part of your web site

e.g. if you wish to exclude some directories (not all), then you may use the following syntax.

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
Disallow: /secure/
Disallow: /temp/

Exclude a single bot from entire web site

User-agent: Slurp
Disallow: /

Allow a single bot

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

Exclude a file from an Individual Search Engine

e.g you want to exclude your mydata.htm file that is placed under 'secure' directory from Google. (the name of Google bot that indexes pages is Googlebot)
User-agent: Googlebot
Disallow: /secure/mydata.htm

Exclude a file from all Search Engines

User-agent: *
Disallow: /secure/mydata.htm

Handling Complex Situations

Also you can combine multiple command one after another to handle complex situations. Let's take a bit complicated example in step by step manner.
(1) First you would ban all search engines from the directories you do not want indexed at all:
User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
Disallow: /secure/
Disallow: /temp/
 
(2) Next, suppose you want to exclude Yahoo from the entire web site:

User-agent: Slurp
Disallow: /


(3) Further, if you want to exclude Google from indexing the images from your web site:

User-agent: Googlebot-Image
Disallow: /Images/
Disallow: /Public/Images/


(4) Again, if you want to exclude certain files from all spiders:

User-agent: *
Disallow: /private/mybio.html

http://info.webtoolhub.com/kb-a19-what-is-robots-txt-file-creating-and-using-robots-txt.aspx

Website Design Hyderabad


 

1 comment:

  1. Hello,
    I have got a lot of information from this post. I would like to tell you that you have given me much knowledge about it. Robots.txt is a file that is used to exclude content from the crawling process of search engine spiders / bots. Thanks for sharing with us...
    SEO company New York

    ReplyDelete