Robots.txt files are a must have on every site that wants to be indexed by search engines on the internet.
Let’s be honest you don't want a search engine crawling and indexing every single page on your site. Google bots only have a certain amount of bandwidth when evaluating sites.
You don’t want them to waste their bandwidth on pages that don’t matter.
Let’s say Google has allocated bandwidth to crawl 500 URLs for your website. Your website has 100 blog posts and 450 images. 50 of those pages aren’t going to be indexed, some of those could be the blogs that would drive more traffic to your site.
Make Google work smarter, not harder.
We know that setting up a robots.txt file in WordPress can be difficult if you've never done it before. You might not have a good knowledge base on the different components of a robots.txt file or just not understand how to properly implement them.
Most Wordpress Robots.txt files look like this:
That’s simply not good enough if we want Google to crawl efficiently and effectively.
This post is going to explain the components of a robots.txt file. We’ll also show you how to set up your WordPress robots.txt file and why we're making each command.
What is a Robots.txt File
A text file that instructs web robots how to correctly crawl your site is called a robots.txt files.
More specifically, robots.txt files let search engines know what to crawl and index. Meaning, can these pages be seen or not on a search engine results page (SERP).
You give user-agents or crawlers a set of rules. You can allow them or disallow them to crawl certain pages on your site.
Without these rules, Google makes up its own rules. They will crawl and index at will. It’s up to you to give them the guidelines they need to get to the right pages.
What Are the Components of a Robots.txt File
User-Agent: A user-agent is specific to every individual site that is currently on the internet. It connects users and the internet. When used in a robots.txt file the user agent is the site or search engine you are allowing or disallowing to crawl and index your site. This makes the information of your site accessible to individuals who are searching for something specific that you might have the content for. Examples of User-Agents are Googlebot and Bingbot.
Allow: The allow component of robots.txt files tells the User-Agent which pages it can go to. The command is straightforward and not used often. Google will assume it is allowed to crawl a page that is not specifically blocked by the robots file or tagged with nofollow, noindex.
Disallow: Disallow is another command much like allow but instead of telling the user-agent to crawl the page your saying “don’t even look at it.” Disallow prevents the crawler from going to a specific page or groups of pages.
Sitemap: The Sitemap component is pretty much what the word says. It is a collection of pages that you want the crawler to hit. The robots file is the rules you want them to follow, the sitemap is the blueprint for where to go. You always want user-agents to crawl the sitemap. They need to know the structure of your site from the top down.
Crawl-Delay: Crawl-delay makes is something to watch over if you're running a really big site, it's important for smaller sites too but for the big ones it can really make a huge difference. Essentially if you create a crawl-delay in your robtos.txt file you're telling a user-agent to wait however long you entered before making another crawl action on your site. It can save bandwidth for search engines. They are used on either large content machines like news sites or ecommerce platforms with hundreds of thousands of pages.
How a robots.txt file works
Folders vs URLs
Disallowing or allowing folders is the most efficient way to make sure whole sections of pages are not being indexed. /tags/ is an example of a folder that you would want to disallow. Disallowing or allowing individual URLs is a completely viable option when creating a robots.txt file but we wouldn't say its the most efficient. You would have to go back into the robots.txt file every time a new page is made with a certain URL component in it to allow or disallow the URL.
Regex use in robots
Robots.txt files don't fully understand Regular expression or Regex. It is possible to use wildcard characters such as * and $.
Yoast does a great job explaining the use of Regex and wildcard characters here.
Techopedia.com defines regex as:
“A regular expression is a method used in programming for pattern matching. Regular expressions provide a flexible and concise means to match strings of text.”
Constructing a Robots.txt file
This is where we're going to show you how to construct a robots.txt file for WordPress. Not all of these will be relevant for your site but these are the most common commands we see on WordPress sites.
- User-agent: *
- Allow: /wp-admin/admin-ajax.php
- Disallow: /wp-admin/
- Disallow: /tag/
- Disallow: /author/
- Disallow: /category/
- Disallow: /thank-you/
- Disallow: *thank-you*
- Disallow: /wp-content/
- Disallow: /?s=*
- Sitemap: link
1) User-agent: *
By identifying the User-agent with * you're saying that all these commands are the same for every crawler that will hit your site. Sometimes you'll want to be specific depending on site size but it's not always necessary
2) Allow: /wp-admin/admin-ajax.php
Many times your plugins and themes run with admin-ajax.php for their CSS. If you didn't allow the crawlers to hit this part of your site Google or other search engines would have a hard time rendering your site and not be able to get to your content.
3) Disallow: /wp-admin/
This command disallows the crawler to hit anything with /wp-admin/ in the URL. It's important because you don't need any admin pages that are supposed to be internal out there for everyone to see.
You’ll see some blogs say that adding this is a security measure to prevent hackers from reaching the admin area. This is pretty stupid because almost every WP site has a /wp-admin/ area. It isn’t exactly a secret where to login to WordPress. That doesn’t mean you want these pages indexed, but it won’t make your site any “safer.”
4) Disallow: /tag/
Disallowing /tag/ takes crawling every page with a tag in it. Think about it you can have hundreds on tags on one page, you don't want a crawler to hit every one of those pages. It would kill the bandwidth of the crawler.
Disclaimer: Don’t add this if you have tag pages that are generating organic traffic. The point of this is to remove pages from the index that are wasting Google’s time and aren’t generating traffic. Tag pages usually don’t generate any meaningful traffic, but that’s not a guarantee.
5) Disallow: /author/
You disallow the /author/ for a lot of the same reasons as tag it's just a waste if the crawler hits ever author page.
Disclaimer: Don’t add this if you have author pages that are generating organic traffic. The point of this is to remove pages from the index that are wasting Google’s time and aren’t generating traffic. Author pages usually don’t generate any meaningful traffic, but that’s not a guarantee.
6) Disallow: /category/
You disallow the /category/ for a lot of the same reasons as tag it's just a waste if the crawler hits ever author page.
Disclaimer: Don’t add this if you have category pages that are generating organic traffic. The point of this is to remove pages from the index that are wasting Google’s time and aren’t generating traffic. Category pages usually don’t generate any meaningful traffic, but that’s not a guarantee.
7) Disallow Your Thank You Page
Disallowing your thank you page or pages is critical.
Most B2B lead gen websites set up goals in Google Analytics to measure form submissions. These are commonly done through destination goals.
If a user hits a thank you page, then a goal completion is sent to Google Analytics.
If thank you pages started showing up off of a SERP and people start clicking, you would have incorrect data in Google Analytics.
Option 1: Disallowing /thank-you/ in the URL will block any /thank-you/ folder you have set up. If your thank you page looks like this: www.mysite.com/contact/thank-you/, then this would be the correct option for you.
Option 2: Disallowing *thank-you* it's only different from /thank-you/ because it will take out thank you in every URL it doesn't need to just be in the /thank-you/ folder like the one above. If your thank you page looks like this: www.mysite.com/contact-thank-you/, this is the right one for you.
Option 3: would be to block each individual thank you page. It would take more time but it's still effective. This works for any URL structure.
9) Disallow: /wp-content/
Disallowing /wp-content/ takes all of your images out of SERP. People can still see your images once they hit your site but you don’t need them to be seen on the results pages. This helps against infringement in case you might have not had the right to use an image.
10) Disallow Your Site Search Results
Disallowing your site search results pages makes it so crawlers won't crawl yours on page search results. There is just way too much information to search on a site and every individual's search won't be the same.
This can be done in a number of ways, such as:
It really all just depends on how your URL is setup.
Go to your search bar. Search anything. Look at URL to know which to disallow.
We used visitraleigh.com as an example:
In Visit Raleigh’s case, they could use either */search/ or */?q to disallow their searches.
11) Sitemap: link
This command is allowing the crawler to hit your sitemap so it understands how the site is built and the structure in order to crawl. The “link” is a filler for your sites actual URL.