Pages

Tuesday, January 6, 2009

Don't Overlook Robots.txt

In the good old days, web spiders would crawl your sites once you registered them with a search engine. Today, they are a lot more proactive, crawling sites when the domain names are registered. For this reason, it is not optional during the development phase to add a robots.txt file to all projects that instructs robots not to crawl the site.

It's super easy. Just create a text file in the website root directory named robots.txt. Put the following text in it.

User-agent: *
Disallow: /

That's it. Now your temporary website is safe from most webcrawlers. Note that all subdomains must have one of these in the root path.

Examples:

http://www.mysite.com
http://demo.mysite.com
http://admin.mysite.com

There is no need to put a robots.txt in subdirectories or virtual paths, such as

http://www.mysite.com/admin
http://www.mysite.com/users

More info on Robots.txt here http://www.robotstxt.org/

More in-depth info here http://en.wikipedia.org/wiki/Robots.txt

A tutorial for when you want your site to be crawled. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360

Please note that without a robots.txt file, a web spider will attempt to crawl every file, every path in your website. It is rare that you would actually want a webcrawler to do this. For instance, do you really want all your button images and background images indexed?

No comments:

Share This!

Contact Us

Name

Email *

Message *