Tuesday, January 6, 2009

Don't Overlook Robots.txt

In the good old days, web spiders would crawl your sites once you registered them with a search engine. Today, they are a lot more proactive, crawling sites when the domain names are registered. For this reason, it is not optional during the development phase to add a robots.txt file to all projects that instructs robots not to crawl the site.

It's super easy. Just create a text file in the website root directory named robots.txt. Put the following text in it.

User-agent: *
Disallow: /

That's it. Now your temporary website is safe from most webcrawlers. Note that all subdomains must have one of these in the root path.


There is no need to put a robots.txt in subdirectories or virtual paths, such as

Please note that without a robots.txt file, a web spider will attempt to crawl every file, every path in your website. It is rare that you would actually want a webcrawler to do this. For instance, do you really want all your button images and background images indexed?

