All search engines, or at least all the important ones, now look for a robots.txt file as soon their spiders or bots arrive on your site. So, even if you currently do not need to exclude the spiders from any part of your site, having a robots.txt file is still a good idea, it can act as a sort of invitation into your site.
What is a robots.txt file?
A robots.txt file provides restrictions to search engine robots (known as “bots”) that crawl the web. Robots are used to find content to index in the search engine’s database.
These bots are automated, and before they access any sections of a site, they check to see if a robots.txt file exists that prevents them from indexing certain pages.
The robots.txt file is a simple text file (no HTML), that must be placed in your root directory, for example: http://www.yourdomain.com/robots.txt
A robots.txt file provides restrictions to search engine robots (known as “bots”) that crawl the web. Robots are used to find content to index in the search engine’s database.
These bots are automated, and before they access any sections of a site, they check to see if a robots.txt file exists that prevents them from indexing certain pages.
The robots.txt file is a simple text file (no HTML), that must be placed in your root directory, for example: http://www.yourdomain.com/robots.txt
Reasons for using a robots.txt file?
There are 3 primary reasons for using a robots.txt file on your website:
There are 3 primary reasons for using a robots.txt file on your website:
- Information you don’t want made public through search
In situations where you have content on your website which you don’t want accessed via searches, the robots.txt will prevent search engines from including it in their index. - Duplicate Content
Often similar content is presented on a website under various URLs (e.g. the same blog post might appear under various categories). Duplicate content can incur penalties by search engines which is bad from an SEO point of view. The robots.txt file can help you control which version of the content the search engines include in their index. - Manage bandwidth usage
Some website’s have limited bandwidth allowances (based on hosting packages). As robots use up bandwidth when indexing your site, in some instances – you might want to stop some user agents from indexing elements of your site to conserve bandwidth usage.
How to create a robots.txt file?
The robots.txt file is just a simple text file. To create your own robots.txt file, open a new document in a simple text editor (e.g. notepad).
The content of a robots.txt file consists of “records” which tell the specific search engine robots what to index and what not to access.
Each of these records consist of two fields – the user agent line (which specifies the robot to control) and one or more Disallow lines. Here’s an example:
If you want only specific pages not indexed, then you need to specify the exact file. For example:
If you want all the search engine robots to have access to the same content, you can use a generic user agent record – which will control all of them in the same way.
How to find which User Agents to control?
The first place to look for a list of the robots currently indexing your website is in your log files.
For SEO purposes, you’ll generally want all search engines indexing the same content, so using “User-agent: *” is the best strategy.
If you want to get specific with your user agents, you can find a comprehensive list at http://www.user-agents.org/
At Levonsys, we are constantly researching and innovating with new Search Engine Optimization techniques to ensure we are giving clients the best possible SEO services, and the best possible results. Our Search Engine Optimization goals may be ambitious, but we also make sure they are achievable by using tried and tested ethical Search Engine Optimization techniques.
Levonsys
Email: sales@levonsys.com | Website: www.levonsys.com
The robots.txt file is just a simple text file. To create your own robots.txt file, open a new document in a simple text editor (e.g. notepad).
The content of a robots.txt file consists of “records” which tell the specific search engine robots what to index and what not to access.
Each of these records consist of two fields – the user agent line (which specifies the robot to control) and one or more Disallow lines. Here’s an example:
User-agent: googlebot
Disallow: /admin/
This example record allows the “googlebot”, which is Google’s spider to access every page from a site except files from the “admin” directory. All files in the “admin” directory will be ignored.Disallow: /admin/
If you want only specific pages not indexed, then you need to specify the exact file. For example:
User-agent: googlebot
Disallow: /admin/login.html
Should you want your entire site and all its content to be indexed, then simply leave the disallow line blank.Disallow: /admin/login.html
If you want all the search engine robots to have access to the same content, you can use a generic user agent record – which will control all of them in the same way.
User-agent: *
Disallow: /admin/
Disallow: /comments/
Disallow: /admin/
Disallow: /comments/
The first place to look for a list of the robots currently indexing your website is in your log files.
For SEO purposes, you’ll generally want all search engines indexing the same content, so using “User-agent: *” is the best strategy.
If you want to get specific with your user agents, you can find a comprehensive list at http://www.user-agents.org/
At Levonsys, we are constantly researching and innovating with new Search Engine Optimization techniques to ensure we are giving clients the best possible SEO services, and the best possible results. Our Search Engine Optimization goals may be ambitious, but we also make sure they are achievable by using tried and tested ethical Search Engine Optimization techniques.
Levonsys
Email: sales@levonsys.com | Website: www.levonsys.com