Visual Studio Journey: Don't Overlook Robots.txt

Tuesday, January 6, 2009

Don't Overlook Robots.txt

In the good old days, web spiders would crawl your sites once you registered them with a search engine. Today, they are a lot more proactive, crawling sites when the domain names are registered. For this reason, it is not optional during the development phase to add a robots.txt file to all projects that instructs robots not to crawl the site.

It's super easy. Just create a text file in the website root directory named robots.txt. Put the following text in it.

User-agent: *
Disallow: /

That's it. Now your temporary website is safe from most webcrawlers. Note that all subdomains must have one of these in the root path.

Examples:

http://www.mysite.com
http://demo.mysite.com
http://admin.mysite.com

There is no need to put a robots.txt in subdirectories or virtual paths, such as

http://www.mysite.com/admin
http://www.mysite.com/users

More info on Robots.txt here http://www.robotstxt.org/

More in-depth info here http://en.wikipedia.org/wiki/Robots.txt

A tutorial for when you want your site to be crawled. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360

Please note that without a robots.txt file, a web spider will attempt to crawl every file, every path in your website. It is rare that you would actually want a webcrawler to do this. For instance, do you really want all your button images and background images indexed?

Visual Studio Journey

Pages

Tuesday, January 6, 2009

Don't Overlook Robots.txt

No comments:

Share This!

Contact Us

Monster Job Search Results C# ASP.NET Visual Studio

.Net Jobs on DICE

Donate