Definitions understanding robots.txt file - the file that instructs robots/spiders/search engine bots how to behave.
Web Server - the computer (and the software on that computer) that hosts your Web site
Crawlers/ Robots/ Googlebot - Also known as a "bot" or a "spider", a search engine spider follows links to web pages and then reads and retains the information it finds. This information eventually becomes the "copy" of a website in a search engine index. This process is often referred to as "crawling" the web. "Googlebot" is the name of the search engine crawler that is most used by Google.
The robot text file is used to disallow specific or all search engine spider’s access to folders or pages that you don't want indexed. The robots.txt file is the first file search engine "spiders" look for when indexing a website. The robots.txt file tells search engine spiders (robots) which files and directories they are NOT allowed to index. This helps to prevent incomplete site indexing as well as prevent exposing the files and directories that you don't want listed. You may disallow, for example, "Google Images" from indexing certain directories and pages on your site, but not block "Google" itself from indexing those same files.
You may have created a personnel page for company employees that you don't want listed. Some webmasters use it to exclude their guest book pages so to avoid people spamming. There are many different reasons to use the robots text file.
You need to upload it to the root of your web site or it will not work. You need to include both the user agent and a file or folder to disallow. The content of your robots.txt file tells search engine crawlers how they should visit your site. If there are files and directories you do not want indexed by search engines, you can use a robots.txt file to define where the robots can not go. These files are very simple text files that are placed on your web server. They must be placed on the root folder. An example is http://www.yourwebsite.com/robots.txt
If you want to see any websites' robot.txt file you can just add "/robots.txt" to their domain name. What do the robot instructions mean?
The "User-agent" part is there to specify directions to a specific robot if needed. There are two ways to use this in your file.
If you want to tell all robots the same thing you put a " * " after the "User-agent" It would look like this...
User-agent: * (This line is saying "these directions apply to all robots")
If you want to tell a specific robot something (in this example Googlebot) it would look like this...
User-agent: Googlebot (this line is saying "these directions apply to just Googlebot")
The "Disallow" part is there to tell the robots what folders they should not look at.
You can ONLY have one robots.txt on your site and only in the root directory (where your home page is):
BAD - Won't work: www.yourdomain.com/subdirectory/robots.txt
All major search engine spiders respect this, and naturally most spambots (email collectors for spammers) do not. If you truly want security on your site, you will have to actually put the files in a protected directory, rather than trusting the robots.txt file to do the job. It's guidance for robots, not security from prying eyes.
It's really nothing more than a "Notepad" type .txt file named "robots.txt"
The basic syntax is:
User-agent: spiders name here
Disallow:/ filename here
If you use:
The * acts as a wildcard and disallows all spiders. You may want to use this to stop search engines listing unfinished pages.
To disallow an entire directory use:
To disallow an individual file use:
You have to use a separate line for each disallow. You cannot you for example use:
Instead, you should use:
There are several reasons you would want to control a robots visiting to your website:
It saves your bandwidth -
The spider won't visit areas where there is no useful information (your cgi-bin, images, etc)
It gives you a very basic level of protection -
Although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma.
It cleans up your logs -
Every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month.
It can prevent spam and penalties associated with duplicate content -
Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are notready for public viewing yet.
Yahoo, Google, MSN & Ask all excepts this as a way to find your sitemap. Just add this to your robots.txt file and search engine spiders will find your sitemap very easily.
You can generate one at Google Webmaster tools.