Generating a robots.txt file

Robots.txt is a text file, named robots.txt, located in the root of your site. This document contains detailed instructions for search engines (hence the name) regarding what should be indexed and what shouldn't. As a webmaster, it's important to understand that proper configuration of Robots.txt is mandatory if you want accurate and fast indexing of your site.

What is the robots.txt file?

Robots.txt is a service text file that is placed in the root directory of a site. The path to it looks like this: site.com/robots.txt

The file specifies how search bots should index the site. These rules can apply to all bots or to those you specify.

The robots.txt file helps to use the crawling budget correctly — the number of site pages that a search bot crawls. If you let the bot scan all existing pages in a row, the crawling budget may run out before the bot gets to the important pages.

What you need to close in robots.txt

You should not allow for indexing everything non-functional, non-beneficial for website users, or duplicated. Usually, various unnecessary information is indexed:

Website search pages;
Cart, checkout page;
Sorting and filter pages;
Registration pages and personal account.

How does the service for creating robots.txt work?

To generate a file, you can set the 'indexing delay', which is the time interval between page indexing for bots.

If a sitemap link is available, it should also be included in the file creation process;

In the "indexing rules" section, add pages to be indexed and specify a specific bot;

In addition to this, add meta tag restrictions, as search bots can still find pages hidden only with robots.txt.

Syntax of the robots.txt file

The robots.txt file provides rules for web crawlers on how to interact with a website. It includes several directives:

User-agent: Specifies which web crawlers the rules apply to. Using * applies the rules to all crawlers. For example:
```
User-agent: *
```
Or for a specific crawler like Googlebot:
```
User-agent: Googlebot
```
Disallow: Lists the URLs or paths that crawlers are not allowed to index. For instance:
```
Disallow: /private/
```
Allow: Works alongside Disallow to permit indexing of certain URLs or paths within a disallowed directory. For example:
```
Disallow: /images/
Allow: /images/logo.png
```
Crawl-delay: Sets a delay between requests made by the crawler to the server, measured in seconds. This helps reduce server load. Example:
```
Crawl-delay: 10
```
Sitemap: Points to the XML sitemap of the website. Example:
```
Sitemap: https://www.example.com/sitemap.xml
```
Host: Specifies the preferred domain for the website, important for sites with mirror domains. Example:
```
Host: www.example.com
```

A complete robots.txt file might look like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
Crawl-delay: 5
Sitemap: https://www.example.com/sitemap.xml
Host: www.example.com

This example tells all crawlers to avoid indexing certain directories (/cgi-bin/, /tmp/, /private/), to wait 5 seconds between requests, and provides the site's sitemap and preferred domain.

Questions and answers about robots.txt

Difference between Sitemap and robots.txt

A sitemap is a file that shows search bots all the website pages available for indexing and indicates how often the content is updated. It is important to note that the robots.txt file does not contain a list of all pages available for indexing, but rather, it contains rules for indexing existing pages.

What errors can occur when using the robots.txt file?

Incorrect use of the robots.txt file can block important pages from indexing. In addition, it is important to understand that robots.txt is not a reliable way to protect confidential information, as some crawlers may ignore it.

How to check that the robots.txt file is working properly?

You can use Google Search Console or Yandex.Webmaster tools to verify if your robots.txt file is working properly. These services provide tools for checking it.