Why do you need robots.txt?
There is a number of reasons to use a robots.txt file.
- When our site is in development mode and some of the parts or pages are still under construction.
- We have few directories or files which we want to keep private for some reasons like drafted designs or experimental data.
- Disallowing search engines from indexing certain files on your website (images, PDFs, etc.)
But any misconfiguration in the robots.txt file can result in a total loss.! It can be very risky if you accidentally disallow Googlebot from crawling and indexing your entire site. Be aware of the rule set and make sure you are not making any mistakes.
Search engines have web robots which frequently crawl the web. A robots.txt file simply instructs to web robots, search engine spider or searchbot about the directories and pages of a website that whether they are allowed to crawl the certain areas or files of a website.
For example, if you do not wish your page - thank-you.html to be scanned by the web robots then you can restrict them.
User-agent: * Disallow: /thank-you.html
How to create a robots.txt file
Creating robots.txt file is easy. Get started by opening any text editor and save with .txt extension. Now simply upload robots.txt to the root directory of your website.
"User-agent:" part specifies which search engine robot you want to block. An asterisk (*) is used as a wildcard with User-agent for all search engines. So the below robots.txt code snippet will disallow all the user-agents to crawl or index the website.
Disallow all indexing (Universal Match)
User-agent: * Disallow: /
Disallow specific User-agents
In case you want to block specific user-agent for example, Googlebot then we write
User-agent: Googlebot Disallow: /images/
To block access to all URLs that include a question mark (?), you could use the following entry:
User-agent: * Disallow: /*?
We can also use $ to block the URLs ending with the specific file type. For example, if we want to block the URLs that end with .php then
User-agent: Googlebot Disallow: /*.php$
What happens if your website has no robots.txt
Well, in this case, a robot comes and visit each directory, page and content to index. This is true for an empty robots.txt as well.
This may be a little complex. So let's have an example to understand it properly. Let's assume that in our web directory there is a sub-directory which has few pages. We want to allow only a specific single page to be crawled and indexed.
User-agent: * Allow: /site2/special-page.php Disallow: /site2/
So we disallowed the directory 'site2' but allowed the 'special-page.php' under the same directory. To make this rule widely applicable to the number of robots, it is necessary to place the Allow directive(s) first, followed by the Disallow. If the robots follow the standard only when the order is followed as well.
We can specify rules for multiple user-agents as shown in the example below -
User-agent: googlebot # all Google services Disallow: /private/ # disallow specified directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow specified directory