Robots txt disallow Disallow everyone everything. How to prevent the indexing of the necessary pages. How to search for pages that need to be closed from indexing

When visiting a site, the search robot uses a limited amount of resources for indexing. That is, a search robot can download a certain number of pages in one visit. Depending on the update frequency, volume, number of documents, and many, robots may come more often and download more pages.

The more and more often pages are downloaded, the more faster information from your site gets into the search results. In addition to the fact that pages will appear faster in search, changes in the content of documents also take effect faster.

Fast site indexing

Fast indexing of site pages helps to fight the theft of unique content, allows due to its freshness and relevance. But the most important thing. Faster indexing allows you to track how certain changes affect the position of the site in search results.

Poor, slow site indexing

Why is the site indexed badly? There can be many reasons, and here are the main reasons for the slow indexing of the site.

  • Website pages are loading slowly. This can cause the site to be completely excluded from the index.
  • The site is rarely updated. Why would a robot often come to a site where new pages appear once a month.
  • Non-unique content. If the site contains (articles, photos), the search engine will reduce trust (trust) to your site and reduce the resource consumption for its indexing.
  • A large number of pages. If the site has a lot of pages and not, then it can take a very long time to index or re-index all the pages of the site.
  • Complex site structure. The intricate structure of the site and a large number of attachments make it very difficult to index the pages of the site.
  • Lots of extra pages. Every site has landing pages whose content is static, unique, and useful to users, and side pages like , login, or filter pages. If such pages exist, there are usually a lot of them, but not all of them get indexed. And the pages that get there compete with the landing pages. All these pages are regularly re-indexed, consuming the already limited resource allocated for indexing your site.
  • Dynamic Pages. If there are pages on the site whose content does not depend on dynamic parameters (example: site.ru/page.html?lol=1&wow=2&bom=3), as a result, many duplicates of the site.ru/page.html landing page may appear.

There are other reasons for poor site indexing. However, the most common mistake is.

Remove everything from indexing

There are many opportunities to rationally use the resources that search engines allocate for indexing a site. And it is robots.txt that opens wide opportunities for managing the indexing of the site.

Using the directives Allow, Disallow, Clean-param and others, you can effectively distribute not only the attention of the search robot, but also significantly reduce the load on the site.

First, you need to exclude everything unnecessary from indexing, using the Disallow directive.

For example, let's disable the login and registration pages:

Disallow: /login Disallow: /register

Disable tag indexing:

Disallow: /tag

Some dynamic pages:

Disallow: /*?lol=1

Or all dynamic pages:

Disallow: /*?*

Or nullify pages with dynamic parameters:

Clean Param: lol&wow&bom /

On many sites, the number of pages found by the robot may differ from the number of pages in the search by 3 or more times. That is, more than 60% of the site's pages do not participate in the search and are ballast that must either be entered into the search or get rid of it. By excluding non-landing pages and bringing the number of pages in search closer to 100%, you will see a significant increase in site indexing speed, higher positions in search results and more traffic.

More about site indexing, the impact of indexing on the issuance, site pages, others ways to speed up site indexing and reasons for bad site indexing read the following posts. Meanwhile.

Drop unnecessary ballast and go to the top faster.

How to prevent certain pages from being indexed?

Permissions and prohibitions for indexing are taken by all search engines from the file robots.txt located in the root directory of the server. A ban on indexing a number of pages may appear, for example, for reasons of secrecy or from a desire not to index the same documents in different encodings. The smaller your server, the faster the robot will bypass it. Therefore, disable all documents in the robots.txt file that do not make sense to be indexed (for example, statistics files or lists of files in directories). Pay special attention to CGI or ISAPI scripts - our robot indexes them along with other documents.

In its simplest form (everything is allowed except the script directory), the robots.txt file looks like this:

User Agent: *
Disallow: /cgi-bin/

A detailed description of the file specification can be found on the page: "".

When writing robots.txt, pay attention to the following common mistakes:

1. The line with the User-Agent field is required and must precede the lines with the field Disallow. For example, the following robots.txt file does not prohibit anything:

Disallow: /cgi-bin
Disallow: /forum

2. Blank lines in the robots.txt file are significant, they separate entries related to different robots. For example, in the following fragment of the robots.txt file, the line Disallow: /forum ignored because there is no field line before it user-agent.

User Agent: *
Disallow: /cgi-bin
Disallow: /forum

3. Line with a field Disallow can prevent indexing of documents with only one prefix. To disable multiple prefixes, write multiple lines. For example, the file below prevents indexing of documents beginning with “ /cgi-bin/forum”, which, most likely, do not exist (and not documents with prefixes /cgi-bin and /forum).

User Agent: *
Disallow: /cgi-bin /forum

4. In lines with a field Disallow not absolute, but relative prefixes are written. That is the file

User Agent: *
Disallow: www.myhost.ru/cgi-bin

prohibits, for example, document indexing http://www.myhost.ru/www.myhost.ru/cgi-bin/counter.cgi, but does NOT prevent indexing of the document http://www.myhost.ru/cgi-bin/counter.cgi.

5. In lines with a field Disallow prefixes are specified, and not something else. So the file:

User Agent: *
Disallow: *

prohibits indexing of documents starting with the character "*" (which do not exist in nature), and is very different from a file:

User Agent: *
disallow: /

which prevents indexing of the entire site.

If you cannot create/modify a file robots.txt, then all is not lost - just add an additional tag into the HTML code of your page (inside the tag ):

Then this document will also not be indexed.

You can also use the tag

It means that the search engine robot should not follow links from this page.

To simultaneously prohibit page indexing and bypass links from it, use the tag

How to prevent certain parts of the text from being indexed?

To prevent certain pieces of text in a document from being indexed, mark them with tags

Attention! The NOINDEX tag must not break the nesting of other tags. If you specify the following erroneous construct:


…code1…


…code2…

…code3…

the ban on indexing will include not only "code1" and "code2", but also "code3".

How to choose a master virtual host from multiple mirrors?

If your site is located on the same server (same IP), but is visible in the outside world under different names (mirrors, different virtual hosts), Yandex recommends that you select the name under which you want to be indexed. Otherwise, Yandex will choose the main mirror on its own, and other names will be prohibited from indexing.

In order for the mirror you have chosen to be indexed, it is enough to prohibit the indexing of all other mirrors using . This can be done using the non-standard robots.txt extension - the directive Host, specifying the name of the main mirror as its parameter. If a www.glavnoye-zerkalo.ru is the main mirror, then robots.txt should look something like this:

User Agent: *
Disallow: /forum
Disallow: /cgi-bin
Host: www.glavnoye-zerkalo.ru

For compatibility with robots that do not fully follow the standard when processing robots.txt, the Host directive must be added in the group starting with the User-Agent entry, immediately after the Disallow entries.

directive argument Host is a domain name with a port number ( 80 default), separated by a colon. If any site is not specified as an argument for Host, it implies the presence of the directive disallow: /, i.e. complete prohibition of indexing (if there is at least one correct directive in the group Host). So the files robots.txt kind

User Agent: *
Host: www.myhost.ru

User Agent: *
Host: www.myhost.ru:80

are equivalent and forbid indexing as www.otherhost.ru, and www.myhost.ru:8080.

The parameter of the Host directive must consist of a single valid host name (i.e. the corresponding RFC 952 and non-IP address) and a valid port number. Wrong lines Host ignored.

# Examples of ignored Host directives
Host: www.myhost-.ru
Host: www.- myhost.ru
Host: www.myhost.ru:0
Host: www.my_host.ru
Host: . my-host.com:8000
Host: my-host.ru.
Host: my..host.ru
Host: www.myhost.ru/
Host: www.myhost.ru:8080/
Host: http://www.myhost.ru
Host: www.mysi.te
Host: 213.180.194.129
Host: www.firsthost.ru, www.secondhost.ru
Host: www.firsthost.ru www.secondhost.ru

If you have an Apache server, instead of using the Host directive, you can set robots.txt using SSI directives:


User Agent: *
disallow: /

In this file, the robot is prohibited from bypassing all hosts except www.main_name.ru

To enable SSI, see your server's documentation or contact your system administrator. You can check the result by simply querying the pages:

Http://www.main_name.ru/robots.txt
http://www.other_name.ru/robots.txt etc. The results should be different.

Recommendations for the web server Russian Apache

In robots.txt on sites with Russian Apache, all encodings, except for the main one, should be prohibited for robots.

If the encodings are decomposed by ports (or servers), then it is necessary to issue DIFFERENT robots.txt on different ports (servers). Namely, in all robots.txt files for all ports / servers, except for the “main”, it should be written:

User Agent: *
disallow: /

To do this, you can use the SSI mechanism, .

If encodings in your Apache are distinguished by the names of "virtual" directories, then you need to write one robots.txt, which should contain approximately the following lines (depending on the names of the directories):

User Agent: *
Disallow: /dos
Disallow: /mac
Disallow: /koi

Sales Generator

Reading time: 18 minutes

We will send the material to you:

Issues discussed in the material:

  • What role does the robots.txt file play in site indexing
  • How to disable the indexing of the site and its individual pages using robots.txt
  • What robots.txt directives are used for site indexing settings
  • What are the most common mistakes made when creating a robots.txt file

The web resource is ready for work: it is filled with high-quality unique texts, original images, it is convenient to navigate through sections, and the design is pleasing to the eye. It remains only to present your brainchild to the Internet users. But search engines should be the first to get acquainted with the portal. The dating process is called indexing, and one of the main roles in it is played by the text file robots. In order for the robots.txt site to be indexed successfully, a number of specific requirements must be met.



Web resource engine (CMS) is one of the factors that significantly affect the speed of indexing by search spiders. Why is it important to direct crawlers to only important pages that should appear in SERPs?

  1. The search engine robot looks at a limited number of files on a particular resource, and then goes to the next site. In the absence of specified restrictions, the search spider can start by indexing engine files, the number of which is sometimes in the thousands - the robot will simply not have time for the main content.
  2. Or it will index completely different pages on which you plan to advance. Even worse, if search engines see the duplication of content they hate so much, when different links lead to the same (or almost identical) text or image.

Therefore, to forbid the search engine spiders to see too much is a necessity. This is what robots.txt is intended for - a regular text file, the name of which is written in lowercase letters without the use of capital letters. It is created in any text editor (Notepad++, SciTE, VEdit, etc.) and edited here. The file allows you to influence the indexing of the site by Yandex and Google.

For a programmer who does not yet have sufficient experience, it is better to first familiarize yourself with examples of the correct filling of a file. You need to select the web resources that are of interest to him, and in the address bar of the browser, type site.ru/robots.txt(where the first part before "/" is the name of the portal).

It is important to view only sites running on the engine you are interested in, since the CMS folders that are prohibited from indexing are named differently in different management systems. Therefore, the engine becomes the starting point. If your site is powered by WordPress, you need to look for blogs running on the same engine; for Joomla! will have its own ideal robots, etc. At the same time, it is advisable to take files from portals that attract significant traffic from search as samples.

What is site indexing with robots.txt



Search indexing- the most important indicator on which the success of the promotion largely depends. It seems that the site was created ideally: user requests are taken into account, the content is on top, navigation is convenient, but the site cannot make friends with search engines. The reasons must be sought in the technical side, specifically in the tools with which you can influence the indexing.

There are two of them - Sitemap.xml and robots.txt. Important files that complement each other and at the same time solve polar problems. The sitemap invites the spiders to "Welcome, please index all of these sections" by giving the bots the URL of each page to be indexed and the time it was last updated. The robots.txt file, on the other hand, serves as a stop sign, preventing spiders from crawling through any part of the site.

This file and the similarly named robots meta tag, which allows for finer settings, contain clear instructions for search engine crawlers, indicating prohibitions on indexing certain pages or entire sections.

Correctly set limits will best affect the indexing of the site. Although there are still amateurs who believe that it is possible to allow bots to study absolutely all files. But in this situation, the number of pages entered into the search engine database does not mean high quality indexing. Why, for example, do robots need the administrative and technical parts of the site or print pages (they are convenient for the user, and search engines are presented as duplicate content)? There are a lot of pages and files that bots spend time on, in fact, for nothing.

When a spider visits your site, it immediately looks for the robots.txt file intended for it. Having not found a document or finding it in an incorrect form, the bot starts to act independently, indexing literally everything in a row according to an algorithm known only to it. It doesn't necessarily start with new content that you would like to notify users about first. At best, indexing will simply drag on, at worst, it can also result in penalties for duplicates.

Having a proper robots text file will avoid many problems.



There are three ways to prevent indexing of sections or pages of a web resource, from point to high-level:

  • The noindex tag and the attribute are completely different code elements that serve different purposes, but are equally valuable helpers for SEO optimizers. The question of their processing by search engines has become almost philosophical, but the fact remains: noindex allows you to hide part of the text from robots (it is not in the html standards, but it definitely works for Yandex), and nofollow prohibits following the link and passing its weight ( included in the standard classification, valid for all search engines).
  • The robots meta tag on a particular page affects that particular page. Below we will take a closer look at how to indicate in it the prohibition of indexing and following links located in the document. The meta tag is completely valid, systems take into account (or try to take into account) the specified data. Moreover, Google, choosing between robots in the form of a file in the root directory of the site and the page meta tag, gives priority to the latter.
  • robots.txt - this method is completely valid, supported by all search engines and other bots living on the Web. Nevertheless, his directives are not always regarded as an order to be executed (it was said above about non-authority for Google). The indexing rules specified in the file are valid for the site as a whole: individual pages, directories, sections.

Using examples, consider a ban on indexing the portal and its parts.



There are many reasons to stop spiders from indexing a website. It is still under development, is being redesigned or upgraded, the resource is an experimental platform, not intended for users.

A site can be blocked from indexing by robots.txt for all search engines, for an individual robot, or it can be banned for all but one.

2. How to disable robots.txt site indexing on individual pages

If the resource is small, then it is unlikely that you will need to hide pages (what is there to hide on a business card site), and large portals containing a substantial amount of service information cannot do without prohibitions. It is necessary to close from robots:

  • administrative panel;
  • service directories;
  • site search;
  • Personal Area;
  • registration forms;
  • order forms;
  • comparison of goods;
  • favorites;
  • basket;
  • captcha;
  • pop-ups and banners;
  • session IDs.

Irrelevant news and events, calendar events, promotions, special offers - these are the so-called garbage pages that are best hidden. It is also better to close outdated content on information sites in order to prevent negative ratings from search engines. Try to keep the updates regular - then you won't have to play hide-and-seek with search engines.

Prohibition of robots for indexing:



In robots.txt, you can specify complete or selective prohibitions on indexing folders, files, scripts, utm-tags, which can be an order both for individual search spiders and for robots of all systems.

Prohibition of indexing:

The robots meta tag serves as an alternative to the text file of the same name. It is written in the source code of the web resource (in the index.html file), placed in the container . It is necessary to clarify who cannot index the site. If the ban is general, robots; if entry is denied to only one crawler, you need to specify its name (Google - Googlebot, "Yandex" - Yandex).

There are two options for writing a meta tag.

The "content" attribute can have the following values:

  • none - prohibition of indexing (including noindex and nofollow);
  • noindex - prohibition of content indexing;
  • nofollow - ban indexing links;
  • follow - permission to index links;
  • index - allow content indexing;
  • all - allow content and links to be indexed.

For different cases you need to use combinations of values. For example, if you disable content indexing, you need to allow bots to index links: content="noindex, follow".


By closing the website from search engines through meta tags, the owner does not need to create robots.txt at the root.

It must be remembered that in the issue of indexing, a lot depends on the "politeness" of the spider. If he is “educated”, then the rules prescribed by the master will be relevant. But in general, the validity of the robots directives (both the file and the meta tag) does not mean one hundred percent following them. Even for search engines, not every ban is ironclad, and there is no need to talk about various kinds of content thieves. They are initially configured to circumvent all prohibitions.

In addition, not all crawlers are interested in content. For some, only links are important, for others - micro-markup, others check mirror copies of sites, and so on. At the same time, system spiders do not crawl around the site at all, like viruses, but remotely request the necessary pages. Therefore, most often they do not create any problems for resource owners. But, if mistakes were made during the design of the robot or some external non-standard situation occurred, the crawler can significantly load the indexed portal.



Commands used:

1. "User-agent:"

The main guideline of the robots.txt file. Used for specification. The name of the bot is entered, for which further instructions will follow. For example:

  • User agent: Googlebot- the basic directive in this form means that all the following commands concern only the Google indexing robot;
  • User agent: Yandex- the prescribed permissions and prohibitions are intended for the Yandex robot.

Recording User-agent: * means referring to all other search engines (the special character "*" means "any text"). If we take into account the above example, then the asterisk will designate all search engines, except for "Yandex". Because Google completely dispenses with personal appeal, being content with the general designation "any text."


The most common command to disable indexing. Referring to the robot in "User-agent:", then the programmer indicates that he does not allow the bot to index part of the site or the entire site (in this case, the path from the root is indicated). The search spider understands this by expanding the command. We'll figure it out too.

User agent: Yandex

If there is such an entry in robots.txt, then the Yandex search bot understands that it cannot index the web resource as such: there are no clarifications after the forbidding sign “/”.

User agent: Yandex

Disallow: /wp-admin

In this example, there are clarifications: the ban on indexing applies only to the system folder wp-admin(the site is powered by WordPress). The Yandex robot sees the command and does not index the specified folder.

User agent: Yandex

Disallow: /wp-content/themes

This directive tells the crawler that it can index all the content " wp-content", with the exception of " themes", which the robot will do.

User agent: Yandex

Disallow: /index$

Another important symbol "$" appears, which allows for flexibility in prohibitions. In this case, the robot understands that it is not allowed to index pages whose links contain the sequence of letters " index". A separate file with the same name " index.php» You can index, and the robot clearly understands this.

You can enter a ban on the indexing of individual pages of the resource, the links of which contain certain characters. For example:

User agent: Yandex

The Yandex robot reads the command this way: do not index all pages with URLs containing "&" between any other characters.

User agent: Yandex

In this case, the robot understands that pages cannot be indexed only if their addresses end with "&".

Why it is impossible to index system files, archives, personal data of users, we think it is clear - this is not a topic for discussion. There is absolutely no need for a search bot to waste time checking data that no one needs. But regarding bans on page indexing, many people ask questions: what is the reason for the expediency of prohibitive directives? Experienced developers can give a dozen different reasons for tabooing indexing, but the main one will be the need to get rid of duplicate pages in the search. If there are any, it dramatically negatively affects ranking, relevance and other important aspects. Therefore, internal SEO optimization is unthinkable without robots.txt, in which it is quite simple to deal with duplicates: you just need to correctly use the "Disallow:" directive and special characters.

3. "Allow:"



The magic robots file allows you not only to hide unnecessary things from search engines, but also to open the site for indexing. robots.txt containing the command " allow:”, tells search engine spiders which elements of the web resource must be added to the database. The same clarifications as in the previous command come to the rescue, only now they expand the range of permissions for crawlers.

Let's take one of the examples given in the previous paragraph and see how the situation changes:

User agent: Yandex

Allow: /wp-admin

If "Disallow:" meant a ban, then now the contents of the system folder wp-admin becomes the property of Yandex legally and may appear in search results.

But in practice, this command is rarely used. There is a perfectly logical explanation for this: the absence of a disallow, indicated by "Disallow:", allows search spiders to consider the entire site as allowed for indexing. A separate directive is not required for this. If there are prohibitions, the content that does not fall under them is also indexed by robots by default.



Two more important commands for search spiders. " host:"- a target directive for a domestic search engine. Yandex is guided by it when determining the main mirror of a web resource whose address (with or without www) will participate in the search.

Consider the example of PR-CY.ru:

User agent: Yandex

The directive is used to avoid duplication of resource content.

Team " sitemap:» helps robots move correctly to the site map - a special file that represents a hierarchical structure of pages, content type, information about the frequency of updates, etc. The file serves as a navigator for search spiders sitemap.xml(on wordpress engine) sitemap.xml.gz), which they need to get to as quickly as possible. Then the indexing will speed up not only the site map, but also all other pages that will not slow down to appear in the search results.

Hypothetical example:

Commands that are indicated in the robots text file and are accepted by Yandex:

Directive

What is he doing

Names the search spider for which the rules listed in the file are written.

Indicates a ban for robots to index the site, its sections or individual pages.

Specifies the path to the sitemap hosted on the web resource.

Contains the following information for the search spider: The URL of the page includes non-indexable parameters (such as UTM tags).

Gives permission to index sections and pages of a web resource.

Allows you to delay scanning. Indicates the minimum time (in seconds) for the crawler between page loads: after checking one, the spider waits the specified amount of time before requesting the next page from the list.

*Required directive.

The Disallow, Sitemap, and Clean-param commands are the most commonly requested. Let's look at an example:

  • User-agent: * #indicating the robots to which the following commands are intended.
  • Disallow: /bin/ # Prevent indexers from crawling links from the Shopping Cart.
  • Disallow: /search/ # disallow indexing of search pages on the site.
  • Disallow: /admin/ # disallow search in the admin panel.
  • Sitemap: http://example.com/sitemap # indicates the path to the sitemap for the crawler.
  • Clean-param: ref /some_dir/get_book.pl

Recall that the above interpretations of directives are relevant for Yandex - spiders of other search engines can read commands differently.



The theoretical base is created - it is time to create an ideal (well, or very close to it) text file robots. If the site is running on an engine (Joomla!, WordPress, etc.), it is supplied with a mass of objects, without which normal operation is impossible. But there is no informative component in such files. In most CMS, the content storage is the database, but the robots cannot get to it. And they continue to look for content in engine files. Accordingly, the time allocated for indexing is wasted.

Very important Strive for unique content your web resource , carefully monitoring the occurrence of duplicates. Even a partial repetition of the information content of the site does not have the best effect on its evaluation by search engines. If the same content can be found at different URLs, this is also considered duplicate.

The two main search engines, Yandex and Google, will inevitably reveal duplication during crawling and artificially lower the position of the web resource in the search results.

Don't forget a great tool to help you deal with duplication - canonical meta tag. By writing a different URL in it, the webmaster thus indicates to the search spider the preferred page for indexing, which will be the canonical one.

For example, a page with pagination https://ktonanovenkogo.ru/page/2 contains the Canonical meta tag pointing to https://ktonanovenkogo.ru , which eliminates problems with duplicate headers.

So, we put together all the theoretical knowledge gained and proceed to their practical implementation in robots.txt for your web resource, the specifics of which must be taken into account. What is required for this important file:

  • text editor (Notepad or any other) for writing and editing robots;
  • a tester who will help find errors in the created document and check the correctness of indexing bans (for example, Yandex.Webmaster);
  • An FTP client that simplifies uploading a finished and verified file to the root of a web resource (if the site is running on WordPress, then robots is most often stored in the Public_html system folder).

The first thing a search crawler does is request a file created specifically for it and located at the URL "/robots.txt".

A web resource can contain a single file "/robots.txt". No need to put it in custom subdirectories where spiders won't look for the document anyway. If you want to create robots in subdirectories, you need to remember that you still need to collect them into a single file in the root folder. Using the "Robots" meta tag is more appropriate.

URLs are case sensitive - remember that "/robots.txt" is not capitalized.

Now you need to be patient and wait for the search spiders, who will first examine your properly created, correct robots.txt and start crawling your web portal.

Correct setting of robots.txt for indexing sites on different engines

If you have a commercial resource, then the creation of the robots file should be entrusted to an experienced SEO specialist. This is especially important if the project is complex. For those who are not ready to accept what has been said for an axiom, let us explain: this important text file has a serious impact on the indexing of the resource by search engines, the speed of processing the site by bots depends on its correctness, and the content of robots has its own specifics. The developer needs to take into account the type of site (blog, online store, etc.), engine, structural features and other important aspects that a novice master may not be able to do.

At the same time, you need to make the most important decisions: what to close from crawling, what to leave visible to crawlers so that the pages appear in the search. It will be very difficult for an inexperienced SEO to cope with such a volume of work.


User-agent:* # general rules for robots, except for "Yandex" and Google,

Disallow: /cgi-bin # hosting folder
disallow: /? # all query parameters on the main
Disallow: /wp- # all WP files: /wp-json/, /wp-includes, /wp-content/plugins
Disallow: /wp/ # if there is a /wp/ subdirectory where the CMS is installed (if not, # the rule can be removed)
Disallow: *?s= # search
Disallow: *&s= # search
Disallow: /search/ # search
Disallow: /author/ # archivist
Disallow: /users/ # archivers
Disallow: */trackback # trackbacks, notifications in comments about an open # link to an article
Disallow: */feed # all feeds
Disallow: */rss # rssfeed
Disallow: */embed # all embeds
Disallow: */wlwmanifest.xml # Windows Live Writer manifest xml file (can be removed if not used)
Disallow: /xmlrpc.php # WordPress API file
Disallow: *utm*= # links with utm tags
Disallow: *openstat= # tagged linksopenstat
Allow: */uploads # open folder with uploads files
Sitemap: http://site.ru/sitemap.xml # sitemap address

User-agent: GoogleBot& # rules for Google

Disallow: /cgi-bin

Disallow: /wp-
Disallow: /wp/
Disallow: *?s=
Disallow: *&s=
Disallow: /search/
Disallow: /author/
Disallow: /users/
Disallow: */trackback
Disallow: */feed
Disallow: */rss
Disallow: */embed
Disallow: */wlwmanifest.xml
Disallow: /xmlrpc.php
Disallow: *utm*=
Disallow: *openstat=
Allow: */uploadsAllow: /*/*.js # open js scripts inside /wp- (/*/ - for priority)
Allow: /*/*.css # open css files inside /wp- (/*/ - for priority)
Allow: /wp-*.png # images in plugins, cache folder, etc.
Allow: /wp-*.jpg # images in plugins, cache folder, etc.
Allow: /wp-*.jpeg # pictures in plugins, cache folder, etc.
Allow: /wp-*.gif # pictures in plugins, cache folder, etc.
Allow: /wp-admin/admin-ajax.php # used by plugins to not block JS and CSS

User-agent: Yandex # rules for Yandex

Disallow: /cgi-bin

Disallow: /wp-
Disallow: /wp/
Disallow: *?s=
Disallow: *&s=
Disallow: /search/
Disallow: /author/
Disallow: /users/
Disallow: */trackback
Disallow: */feed
Disallow: */rss
Disallow: */embed
Disallow: */wlwmanifest.xml
Disallow: /xmlrpc.php
Allow: */uploads
Allow: /*/*.js
Allow: /*/*.css
Allow: /wp-*.png
Allow: /wp-*.jpg
Allow: /wp-*.jpeg
Allow: /wp-*.gif
Allow: /wp-admin/admin-ajax.php
Clean-Param: utm_source&utm_medium&utm_campaign # Yandex recommends not closing # from indexing, but deleting tag parameters, # Google does not support such rules
Clean-Param: openstat # similar



User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
Sitemap: http://path of your XML sitemap



User-agent: *
Disallow: /*index.php$
Disallow: /bitrix/
Disallow: /auth/
Disallow: /personal/
Disallow: /upload/
Disallow: /search/
Disallow: /*/search/
Disallow: /*/slide_show/
Disallow: /*/gallery/*order=*
Disallow: /*?print=
Disallow: /*&print=
Disallow: /*register=
Disallow: /*forgot_password=
Disallow: /*change_password=
Disallow: /*login=
Disallow: /*logout=
Disallow: /*auth=
Disallow: /*?action=
Disallow: /*action=ADD_TO_COMPARE_LIST
Disallow: /*action=DELETE_FROM_COMPARE_LIST
Disallow: /*action=ADD2BASKET
Disallow: /*action=BUY
Disallow: /*bitrix_*=
Disallow: /*backurl=*
Disallow: /*BACKURL=*
Disallow: /*back_url=*
Disallow: /*BACK_URL=*
Disallow: /*back_url_admin=*
Disallow: /*print_course=Y
Disallow: /*COURSE_ID=
Disallow: /*?COURSE_ID=
Disallow: /*?PAGEN
Disallow: /*PAGEN_1=
Disallow: /*PAGEN_2=
Disallow: /*PAGEN_3=
Disallow: /*PAGEN_4=
Disallow: /*PAGEN_5=
Disallow: /*PAGEN_6=
Disallow: /*PAGEN_7=


Disallow: /*PAGE_NAME=search
Disallow: /*PAGE_NAME=user_post
Disallow: /*PAGE_NAME=detail_slide_show
Disallow: /*SHOWALL
Disallow: /*show_all=
Sitemap: http://path of your XML sitemap



User-agent: *
Disallow: /assets/cache/
Disallow: /assets/docs/
Disallow: /assets/export/
Disallow: /assets/import/
Disallow: /assets/modules/
Disallow: /assets/plugins/
Disallow: /assets/snippets/
Disallow: /install/
Disallow: /manager/
Sitemap: http://site.ru/sitemap.xml

5. Robots.txt, an example for Drupal

User-agent: *
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
Disallow: /profile
Disallow: /profile/*
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /index.php
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: *register*
Disallow: *login*
Disallow: /top-rated-
Disallow: /messages/
Disallow: /book/export/
Disallow: /user2userpoints/
Disallow: /myuserpoints/
Disallow: /tagadelic/
Disallow: /referral/
Disallow: /aggregator/
Disallow: /files/pin/
Disallow: /your-votes
Disallow: /comments/recent
Disallow: /*/edit/
Disallow: /*/delete/
Disallow: /*/export/html/
Disallow: /taxonomy/term/*/0$
Disallow: /*/edit$
Disallow: /*/outline$
Disallow: /*/revisions$
Disallow: /*/contact$
Disallow: /*downloadpipe
Disallow: /node$
Disallow: /node/*/track$

Disallow: /*?page=0
Disallow: /*section
Disallow: /* order
Disallow: /*?sort*
Disallow: /*&sort*
Disallow: /*votesupdown
Disallow: /*calendar
Disallow: /*index.php
Allow: /*?page=

Sitemap: http://path to your XML sitemap

ATTENTION! Site content management systems are constantly updated, so the robots file may also change: additional pages or groups of files may be closed, or, conversely, opened for indexing. It depends on the goals of the web resource and the current engine changes.

7 common mistakes when indexing a site using robots.txt



Errors made during file creation cause robots.txt to function incorrectly or even lead to the impossibility of the file to work.

What errors are possible:

  • Logical (marked rules collide). You can identify this type of error during testing in Yandex.Webmaster and GoogleRobotsTestingTool.
  • Syntactic (directives are written with errors).

More common than others are:

  • the record is not case-sensitive;
  • capital letters are used;
  • all rules are listed on one line;
  • rules are not separated by an empty line;
  • specifying the crawler in the directive;
  • each file of the folder that needs to be closed is listed separately;
  • the mandatory Disallow directive is missing.

Consider common mistakes, their consequences and, most importantly, measures to prevent them on your web resource.

  1. File location. The URL of the file should be of the following form: http://site.ru/robots.txt (instead of site.ru, the address of your site is listed). The robots.txt file is based exclusively in the root folder of the resource - otherwise, search spiders will not see it. Without getting banned, they will crawl the entire site and even those files and folders that you would like to hide from search results.
  2. Case sensitive. No capital letters. http://site.ru/Robots.txt is wrong. In this case, the search engine robot will receive a 404 (error page) or 301 (redirect) as a server response. Crawling will take place without taking into account the directives indicated in robots. If everything is done correctly, the server response is code 200, in which the owner of the resource will be able to control the search crawler. The only correct option is "robots.txt".
  3. Opening in a browser page. Search spiders will only be able to correctly read and use the directives of the robots.txt file if it opens in a browser page. It is important to pay close attention to the server side of the engine. Sometimes a file of this type is offered for download. Then you should set up the display - otherwise the robots will crawl the site as they please.
  4. Prohibition and permission errors."Disallow" - a directive to prohibit scanning of the site or its sections. For example, you need to prevent robots from indexing pages with search results on the site. In this case, the robots.txt file should contain the line: "Disallow: /search/". The crawler understands that all pages where "search" occurs are prohibited from crawling. With a total ban on indexing, Disallow: / is written. But the allowing directive "Allow" is not necessary in this case. Although it is not uncommon for a command to be written like this: “Allow:”, assuming that the robot will perceive this as permission to index “nothing”. You can allow the entire site to be indexed through the "Allow: /" directive. There is no need to confuse commands. This leads to crawling errors by spiders, which eventually add pages that are absolutely not the ones that should be promoted.
  5. directive match. Disallow: and Allow: for the same page are found in robots, which causes crawlers to prioritize the allow directive. For example, initially the partition was opened for crawling by spiders. Then, for some reason, it was decided to hide it from the index. Naturally, a ban is added to the robots.txt file, but the webmaster forgets to remove the permission. For search engines, the ban is not so important: they prefer to index the page bypassing commands that exclude each other.
  6. Host directive:. Recognized only by Yandex spiders and used to determine the main mirror. A useful command, but, alas, it seems to be erroneous or unknown to all other search engines. When involving it in your robots, it is optimal to specify as User-agent: everyone and the Yandex robot, for which you can personally register the Host command:

    User Agent: Yandex
    Host: site.ru

    The directive prescribed for all crawlers will be perceived by them as erroneous.

  7. Sitemap directive:. With the help of a sitemap, bots find out what pages are on a web resource. A very common mistake is developers not paying attention to the location of the sitemap.xml file, although it determines the list of URLs included in the map. By placing the file outside the root folder, the developers themselves put the site at risk: crawlers incorrectly determine the number of pages, as a result, important parts of the web resource are not included in the search results.

For example, by placing a Sitemap file in a directory at the URL http://primer.ru/catalog/sitemap.xml , you can include any URLs starting with http://primer.ru/catalog/ ... And URLs like, say, http://primer.ru/images/ ... should not be included in the list.

Summarize. If the site owner wants to influence the process of indexing a web resource by search bots, the robots.txt file is of particular importance. It is necessary to carefully check the created document for logical and syntactical errors, so that in the end the directives work for the overall success of your site, ensuring high-quality and fast indexing.

How to avoid errors by creating the correct robots.txt structure for site indexing



The structure of robots.txt is clear and simple, it is quite possible to write the file yourself. You just need to carefully monitor the syntax that is extremely important for robots. Search bots follow the directives of the document voluntarily, but search engines interpret the syntax differently.

A list of the following mandatory rules will help eliminate the most common mistakes when creating robots.txt. To write the right document, you should remember that:

  • each directive starts with new line;
  • in one line - no more than one command;
  • a space cannot be placed at the beginning of a line;
  • the command parameter must be on one line;
  • directive parameters do not need to be quoted;
  • command parameters do not require a semicolon at the end;
  • the directive in robots.txt is specified in the format: [command_name]:[optional space][value][optional space];
  • after the pound sign # comments are allowed in robots.txt;
  • an empty string can be interpreted as the end of the User-agent command;
  • the prohibiting directive with an empty value - "Disallow:" is similar to the directive "Allow: /" that allows scanning the entire site;
  • "Allow", "Disallow" directives can contain no more than one parameter. Each new parameter is written on a new line;
  • only lowercase letters are used in the name of the robots.txt file. Robots.txt or ROBOTS.TXT - erroneous spellings;
  • The robots.txt standard does not regulate case sensitivity, but files and folders are often sensitive in this matter. Therefore, although it is acceptable to use capital letters in the names of commands and parameters, this is considered bad form. It is better not to get carried away with the upper case;
  • when the command parameter is a folder, a slash "/" is required before the name, for example: Disallow: /category;
  • if the robots.txt file weighs more than 32 KB, search bots perceive it as equivalent to "Disallow:" and consider it completely allowing indexing;
  • unavailability of robots.txt (by different reasons) can be perceived by crawlers as the absence of prohibitions on scanning;
  • empty robots.txt is regarded as allowing indexing of the site as a whole;
  • if multiple "User-agent" commands are listed without a blank line between them, search spiders may treat the first directive as the only one, ignoring all subsequent "User-agent" directives;
  • robots.txt does not allow the use of any symbols of national alphabets.

The above rules are not relevant for all search engines, because they interpret the robots.txt syntax differently. For example, "Yandex" selects entries by the presence in the "User-agent" line, so it does not matter for it the presence of an empty line between different "User-agent" directives.

In general, robots should contain only what is really needed for proper indexing. No need to try to embrace the immensity and fit the maximum data into the document. The best robots.txt is a meaningful file, the number of lines doesn't matter.

The text document robots needs to be checked for the correct structure and correct syntax, which will help the services presented on the Web. To do this, you need to upload robots.txt to the root folder of your site, otherwise the service may report that it was unable to load the required document. Before robots.txt is recommended to check for availability at the address of the file (your_site.ru/robots.txt).

The largest search engines Yandex and Google offer their website analysis services to webmasters. One of the aspects of analytical work is robots check:

There are a lot of online robots.txt validators on the Internet, you can choose any one you like.

Array ( => 24 [~ID] => 24 => 10.10.2019 18:52:28 [~TIMESTAMP_X] => 10.10.2019 18:52:28 => 1 [~MODIFIED_BY] => 1 => 10.10. 2019 18:51:03 [~DATE_CREATE] => 10/10/2019 18:51:03 => 1 [~CREATED_BY] => 1 => 6 [~IBLOCK_ID] => 6 => [~IBLOCK_SECTION_ID] => => Y [~ACTIVE] => Y => Y [~GLOBAL_ACTIVE] => Y => 500 [~SORT] => 500 => Articles by Pavel Bobylev [~NAME] => Articles by Pavel Bobylev => 11744 [~PICTURE] = > 11744 => 13 [~LEFT_MARGIN] => 13 => 14 [~RIGHT_MARGIN] => 14 => 1 [~DEPTH_LEVEL] => 1 => Pavel Bobylev [~DESCRIPTION] => Pavel Bobylev => text [~DESCRIPTION_TYPE ] => text => Articles by Pavel Bobylev Pavel Bobylev [~SEARCHABLE_CONTENT] => Articles by Pavel Bobylev Pavel Bobylev => stati-pavla-bobyleva [~CODE] => stati-pavla-bobyleva => [~XML_ID] => => [~TMP_ID] => => [~DETAIL_PICTURE] => => [~SOCNET_GROUP_ID] => => /blog/index.php?ID=6 [~LIST_PAGE_URL] => /blog/index.php?ID=6 => /blog/list.php?SECTION_ID=24 [~SECTION_PAGE_URL] => /b log/list.php?SECTION_ID=24 => blog [~IBLOCK_TYPE_ID] => blog => blog [~IBLOCK_CODE] => blog => [~IBLOCK_EXTERNAL_ID] => => [~EXTERNAL_ID] =>)

The robots.txt file is a set of directives (a set of rules for robots) with which you can prevent or allow search robots to index certain sections and files of your site, as well as provide additional information. Initially, with the help of robots.txt, it was really only possible to prohibit the indexing of sections, the ability to allow indexing appeared later, and was introduced by the search leaders Yandex and Google.

The structure of the robots.txt file

First, the User-agent directive is written, which shows which crawler the instructions refer to.

A small list of well-known and commonly used User-agents:

  • User-agent:*
  • User agent: Yandex
  • User agent: Googlebot
  • User agent: Bingbot
  • User-agent: YandexImages
  • User agent: Mail.RU

Next, the Disallow and Allow directives are specified, which prohibit or allow indexing of sections, individual pages of the site or files, respectively. Then we repeat these steps for the next User-agent. At the end of the file, the Sitemap directive is specified, where the address of your sitemap is specified.

By writing the Disallow and Allow directives, you can use the special characters * and $. Here * means "any character" and $ means "end of line". For example, Disallow: /admin/*.php means that the indexing of all files that are in the admin folder and ends with .php is prohibited, Disallow: /admin$ prohibits the /admin address, but does not prohibit /admin.php, or / admin/new/ , if any.

If all User-agents use the same set of directives, there is no need to duplicate this information for each of them, User-agent: * will suffice. In the case when it is necessary to supplement information for some of the user-agent, you should duplicate the information and add a new one.

Example robots.txt for WordPress:

*Note for User agent: Yandex

Check robots.txt

Old version of Search Console

To check the correctness of the robots.txt, you can use Webmaster from Google- you need to go to the "Scanning" section and then "View as Googlebot", then click the "Get and display" button. As a result of the scan, two screenshots of the site will be presented, which show how users see the site and how search robots see it. And below you will see a list of files, the prohibition of indexing which prevents the correct reading of your site by search robots (they will need to be allowed to be indexed for the Google robot).

Usually these can be various style files (css), JavaScript, as well as images. After you allow these files for indexing, both screenshots in Webmaster should be identical. The exceptions are files that are located remotely, for example, the Yandex.Metrica script, buttons social networks etc. You will not be able to prohibit / allow them for indexing. For more information on how to resolve the "Googlebot cannot access the CSS and JS files on the site" error, read our blog.

New version of Search console

AT new version there is no separate menu item for checking robots.txt. Now it's enough just to insert the address of the desired country into the search bar.

In the next window, click "Examine the scanned page."

In the window that appears, you can see resources that, for one reason or another, are inaccessible to the google robot. In this particular example, there are no resources blocked by the robots.txt file.

If there are such resources, you will see messages like the following:

Every site has a unique robots.txt, but some common features can be listed as follows:

  • Close authorization pages, registration pages from indexing, remember your password and other technical pages.
  • Resource admin panel.
  • Sorting pages, pages of the type of displaying information on the site.
  • For online shopping cart pages, favorites. You can read more details in tips for online stores on indexing settings on the Yandex blog.
  • Search page.

This is just an approximate list of what can be closed from indexing from search engine robots. In each case, you need to understand on an individual basis, in some situations there may be exceptions to the rules.

Conclusion

The robots.txt file is an important tool for regulating the relationship between the site and the search engine robot, it is important to take the time to set it up.

In the article a large number of information is devoted to Yandex and Google robots, but this does not mean that you need to create a file only for them. There are other robots - Bing, Mail.ru, etc. You can supplement robots.txt with instructions for them.

Many modern cms create a robots.txt file automatically and may contain obsolete directives. Therefore, after reading this article, I recommend checking the robots.txt file on your site, and if they are present there, it is advisable to delete them. If you don't know how to do this, please contact

Robots.txt for wordpress is one of the main tools for setting up indexing. Earlier we talked about speeding up and improving the article indexing process. Moreover, they considered this issue as if the search robot did not know and could not do anything. And we have to tell him. For this we used a sitemap file.

Perhaps you still do not know how the search robot indexes your site? By default, everything is allowed to be indexed. But he doesn't do it right away. The robot, having received a signal that it is necessary to visit the site, puts it in a queue. Therefore, indexing does not occur instantly at our request, but after some time. Once it's your site's turn, this spider robot is right there. First of all, it looks for the robots.txt file.

If robots.txt is found, it reads all directives and sees the file address at the end. Then the robot, in accordance with the sitemap, bypasses all the materials provided for indexing. He does this within a limited period of time. That is why, if you have created a site with several thousand pages and posted it in its entirety, then the robot simply will not have time to go around all the pages in one go. And only those that he managed to view will get into the index. And the robot walks all over the site and spends its time on it. And it’s not a fact that in the first place he will view exactly those pages that you are waiting for in the search results.

If the robot does not find the robots.txt file, it considers that everything is allowed to be indexed. And he begins to rummage through all the back streets. After making a complete copy of everything he could find, he leaves your site, until the next time. As you understand, after such a search, everything that is needed and everything that is not needed gets into the search engine index base. What you need to know is your articles, pages, pictures, videos, etc. Why don't you need to index?

For WordPress, this turns out to be a very important issue. The answer to it affects both the acceleration of the indexing of the content of your site, and its security. The point is that all service information no need to index. And it is generally desirable to hide WordPress files from prying eyes. This will reduce the chance of your site being hacked.

WordPress creates a lot of copies of your articles with different URLs but the same content. It looks like this:

//site_name/article_name,

//site_name/category_name/article_name,

//site_name/heading_name/subheading_name/article_name,

//site_name/tag_name/article_name,

//site_name/archive_creation_date/article_name

With tags and archives in general guard. How many tags an article is attached to, so many copies are created. When editing an article, as many archives will be created on different dates, as many new addresses with almost similar content will appear. And there are also copies of articles with addresses for each comment. It's just plain awful.

A huge number of duplicates search engines evaluate as a bad site. If all these copies are indexed and provided in the search, then the weight of the main article will be spread over all copies, which is very bad. And it is not a fact that the article with the main address will be shown as a result of the search. Hence it is necessary to forbid indexing of all copies.

WordPress formats images as separate articles without text. In this form, without text and description, they look absolutely incorrect as articles. Therefore, you need to take measures to prevent these addresses from being indexed by search engines.

Why shouldn't it be indexed?

Five reasons to ban indexing!

  1. Full indexing puts extra load on your server.
  2. It takes precious time of the robot itself.
  3. Perhaps this is the most important thing, incorrect information can be misinterpreted by search engines. This will lead to incorrect ranking of articles and pages, and subsequently to incorrect results in the search results.
  4. Folders with templates and plugins contain a huge number of links to the sites of creators and advertisers. This is very bad for a young site, when there are no or very few links to your site from outside.
  5. By indexing all copies of your articles in archives and comments, the search engine gets a bad opinion of your site. Lots of duplicates. Many outbound links The search engine will downgrade your site in search results to the point of filtering. And the pictures, designed as a separate article with a title and without text, terrify the robot. If there are a lot of them, then the site may rattle under the Yandex AGS filter. My site was there. Checked!

Now, after all that has been said, a reasonable question arises: "Is it possible to somehow prohibit indexing something that is not necessary?". It turns out you can. At least not by order, but by recommendation. The situation of not completely prohibiting the indexing of some objects occurs due to the sitemap.xml file, which is processed after robots.txt. It turns out like this: robots.txt prohibits, and sitemap.xml allows. And yet we can solve this problem. How to do it right now and consider.

The wordpress robots.txt file is dynamic by default and doesn't really exist in wordpress. And it is generated only at the moment when someone requests it, be it a robot or just a visitor. That is, if you go to the site via an FTP connection, then you simply will not find the robots.txt file for wordpress in the root folder. And if you specify its specific address http://your_site_name/robots.txt in the browser, then you will get its contents on the screen as if the file exists. The content of this generated wordpress robots.txt file will be:

In the rules for compiling the robots.txt file, by default, everything is allowed to be indexed. The User-agent: * directive indicates that all subsequent commands apply to all search agents (*). But then nothing is limited. And as you know, this is not enough. We have already discussed folders and records with limited access, quite a lot.

In order to be able to make changes to the robots.txt file and save them there, you need to create it in a static, permanent form.

How to create robots.txt for wordpress

In any text editor (only in no case use MS Word and the like with automatic text formatting elements) create a text file with the following approximate content and send it to the root folder of your site. Changes can be made as needed.

You just need to take into account the features of compiling the file:

At the beginning of the lines of numbers, as here in the article, there should not be. The numbers are given here for the convenience of reviewing the contents of the file. There should not be any extra characters at the end of each line, including spaces or tabs. Between blocks there should be an empty line without any characters, including spaces. Just one space can do you great harm - BE CAREFUL .

How to check robots.txt for wordpress

You can check robots.txt for extra spaces in the following way. In a text editor, select all text by pressing Ctrl+A. If there are no spaces at the end of lines and empty lines, you will notice this. And if there is a selected void, then you need to remove the spaces and everything will be OK.

You can check if the prescribed rules are working correctly at the following links:

  • Robots.txt parsing Yandex Webmaster
  • Parsing robots.txt in Google Search console .
  • Service for creating a robots.txt file: http://pr-cy.ru/robots/
  • Service for creating and checking robots.txt: https://seolib.ru/tools/generate/robots/
  • Documentation from Yandex .
  • Documentation from google(English)

There is another way to check the robots.txt file for a wordpress site, this is to upload its content to the Yandex webmaster or specify the address of its location. If there are any errors, you will immediately know.

Correct robots.txt for wordpress

Now let's jump right into the content of the robots.txt file for a wordpress site. What directives must be present in it. The approximate content of the robots.txt file for wordpress, given its features, is given below:

User-agent: * Disallow: /wp-login.php Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/themes Disallow: */*comments Disallow: * /*category Disallow: */*tag Disallow: */trackback Disallow: */*feed Disallow: /*?* Disallow: /?s= Allow: /wp-admin/admin-ajax.php Allow: /wp-content /uploads/ Allow: /*?replytocom User-agent: Yandex Disallow: /wp-login.php Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/themes Disallow: */comments Disallow: */*category Disallow: */*tag Disallow: */trackback Disallow: */*feed Disallow: /*?* Disallow: /*?s= Allow: /wp-admin/admin- ajax.php Allow: /wp-content/uploads/ Allow: /*?replytocom Crawl-delay: 2.0 Host: site.ru Sitemap: http://site.ru/sitemap.xml

Wordpress robots.txt directives

Now let's take a closer look:

1 - 16 lines block settings for all robots

User-agent: - This is a required directive that defines the search agent. The asterisk says that the directive is for robots of all search engines. If the block is intended for a specific robot, then you must specify its name, for example, Yandex, as in line 18.

By default, everything is allowed for indexing. This is equivalent to the Allow: / directive.

Therefore, to prohibit indexing of specific folders or files, a special Disallow: directive is used.

In our example, using folder names and file name masks, a ban is made on all WordPress service folders, such as admin, themes, plugins, comments, category, tag... If you specify a directive in this form Disallow: /, then a ban will be given indexing the entire site.

Allow: - as I said, the directive allows indexing folders or files. It should be used when there are files deep in the forbidden folders that still need to be indexed.

In my example, line 3 Disallow: /wp-admin - prohibits indexing of the /wp-admin folder, and line 14 Allow: /wp-admin/admin-ajax.php - allows indexing of the file /admin-ajax.php located in the forbidden indexing folder /wp-admin/.

17 - Empty line (just pressing the Enter button without spaces)

18 - 33 settings block specifically for the Yandex agent (User-agent: Yandex). As you noticed, this block completely repeats all the commands of the previous block. And the question arises: "What the hell is such a trouble?". So this is all done just because of a few directives that we will consider further.

34 - Crawl-delay - Optional directive for Yandex only. It is used when the server is heavily loaded and does not have time to process robot requests. It allows you to set the search robot the minimum delay (in seconds and tenths of a second) between the end of loading one page and the start of loading the next. The maximum allowed value is 2.0 seconds. It is added directly after the Disallow and Allow directives.

35 - Empty string

36 - Host: site.ru - domain name of your site (MANDATORY directive for the Yandex block). If our site uses the HTTPS protocol, then the address must be specified in full as shown below:

Host: https://site.ru

37 - An empty string (just pressing the Enter button without spaces) must be present.

38 - Sitemap: http://site.ru/sitemap.xml - sitemap.xml file(s) location address (MANDATORY directive), located at the end of the file after an empty line and applies to all blocks.

Masks for robots.txt file directives for wordpress

Now a little how to create masks:

  1. Disallow: /wp-register.php - Disable indexing of the wp-register.php file located in the root folder.
  2. Disallow: /wp-admin - prohibits indexing the contents of the wp-admin folder located in the root folder.
  3. Disallow: /trackback - disables indexing of notifications.
  4. Disallow: /wp-content/plugins - prohibits indexing the contents of the plugins folder located in a subfolder (second level folder) of wp-content.
  5. Disallow: /feed - prohibits the indexing of the feed i.e. closes the site's RSS feed.
  6. * - means any sequence of characters, therefore it can replace both one character and part of the name or the entire name of a file or folder. The absence of a specific name at the end is tantamount to writing *.
  7. Disallow: */*comments - prohibits indexing the contents of folders and files in the name of which there are comments and located in any folders. In this case, it prevents comments from being indexed.
  8. Disallow: *?s= - prohibits indexing search pages

The above lines can be used as a working robots.txt file for wordpress. Only in 36, 38 lines you need to enter the address of your site and MANDATORY REMOVE line numbers. And you will get a working robots.txt file for wordpress , adapted to any search engine.

The only feature is that the size of the working robots.txt file for a wordpress site should not exceed 32 kB of disk space.

If you are absolutely not interested in Yandex, then you will not need lines 18-35 at all. That's probably all. I hope that the article was useful. If you have any questions write in the comments.