In the good old days, web spiders would crawl your sites once you registered them with a search engine. Today, they are a lot more proactive, crawling sites when the domain names are registered. For this reason, it is not optional during the development phase to add a robots.txt file to all projects that instructs robots not to crawl the site.
It's super easy. Just create a text file in the website root directory named robots.txt. Put the following text in it.
User-agent: *
Disallow: /
That's it. Now your temporary website is safe from most webcrawlers. Note that all subdomains must have one of these in the root path.
Examples:
http://www.mysite.com
http://demo.mysite.com
http://admin.mysite.com
There is no need to put a robots.txt in subdirectories or virtual paths, such as
http://www.mysite.com/admin
http://www.mysite.com/users
More info on Robots.txt here http://www.robotstxt.org/
More in-depth info here http://en.wikipedia.org/wiki/Robots.txt
A tutorial for when you want your site to be crawled. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360
Please note that without a robots.txt file, a web spider will attempt to crawl every file, every path in your website. It is rare that you would actually want a webcrawler to do this. For instance, do you really want all your button images and background images indexed?
No comments:
Post a Comment