Robots.txt

Rate this post
Robots.txt is a special text file which is used to provide instructions to the web crawlers or robots regarding which areas of your site should not be crawled and indexed. This file exists in the root or top level directory and is the first file that the web crawlers access. The crawlers read the information contained in the robots.txt file and proceed accordingly.

Syntax

User-Agent:* (Agent Name)
Disallow:  /  (File Path)

In the above lines, Agent Name is to be replaced by name of any search engine bots which you wish to exclude and the file path is to be replaced by the absolute url of the file which you wish to exclude.

Example 1

User-Agent:*
 Disallow:  /

The above contents would disallow all the robots from accessing the server and thereby stopping them from accessing the files contained therein.

Example 2

User-Agent: Google
 Disallow:  /

The above contents would disallow Google bot from accessing the
server and thereby stopping it from accessing the files contained
therein. 

When should you use Robots.txt ?

A pretty useful question, right? Well, you must use robots.txt if there are some special scripts in your server which you do not want the bots to access or if you want any specific bots not to crawl the contents of the site. For smaller sites of less than hundred pages , there is rarely any need for robots.txt as there are no special scripts to hide. But for larger sites that have huge databases associated with them, they may be some special pages or scripts which needs to be hidden from the bots. In that case, you must use this robots.txt file.


I have created robots.txt, now my secret pages are safe!

I have heard many people say this but in reality, robots.txt works only for obedient robots. The instructions contained in the file may or may not be followed by the search engine bots. The obedient ones will follow it and would not crawl the secret pages while the unobedient ones would disallow the instructions and can begin crawling. So if you want to keep the pages out of index, use no index meta tag. For more information on blocking pages from search engines visit:- How to block pages from search engines.

Robots.txt Example Entries and Use of Wildcards

1- To disallow crawling of a folder named “Abs”

User-agent: *
Disallow: /Abs/

2-  To block a page named “Soc.html”

User-agent: *
Disallow: /Soc.html

3- To block web pages that has file name ending with php

User-agent: *
Disallow: /*.php$

4- To block googlebot (Google’s crawling agent) from accessing contents in the folder named “B”

User-agent: googlebot
Disallow: /B/

5- To disallow all .jpg extension images from the crawlers

User-agent: *
Disallow: *.jpg

6- To exclude a file named “joker.php” contained in the folder named “circus”

User-agent: *
Disallow: /circus/joker.php

7- To prevent the Googlebot-Image from accessing images on your site

User-agent: Googlebot-Image
Disallow: /

Tricky Question??

What will the following entry do?

User-agent: *
Disallow:

Answer- It will allow the crawlers to access every folder and every web page on your server because you have not mentioned any folder or file name to be disallowed.

Free Robots.txt Generators

There are some free tools on the web which can help you in creating your own robots.txt file. These tools are given below:-

Seobook robots.txt generator
Advanced robots.txt generator (Software free download)
Yellowpipe robots.txt generator
Seochat robots generator
1 hit robots generator

 

One Response

  1. Anonymous November 8, 2012