Jeremy Smith Jan 1, 2020 8:06:38 AM 9 min read

Crawlability: The truth about robots and sitemaps

If you are working in SEO, you probably recognize that for your site to be listed in search engine results it has to have been crawled and indexed by the search engine. You know that search engine “robots” do this work.

But what you may have overlooked is how much a website administrator can influence what a robot sees.

Two files on your website are crucial to how your site and its individual pages are seen and shown by search engines: robots.txt and sitemap.xml.

And the truth about them is, they are not automatic. What they say and cause search engines to do is up to you.

Let’s take a look at some dos and don’ts about robots and sitemaps.

Program Your Website to Instruct Robots

For all the power of Google and other search engines (well, Google mainly), their robots have not yet taken over. We can tell search engine robots what they may and may not do.

The first thing a search engine spider like Googlebot looks at when it visits a website is the robots.txt file. The robots.txt file on your website contains instructions for search engine robots that crawl and index the site’s contents.

The file’s content tells the crawler which files to index and which ones to ignore.

It is important that the robots.txt file does not block certain page resources that Google needs to understand, index and rank your pages.

But there may be files on your site that Google does not need to see. Your robots.txt file can block it from these pages. Lack of a robots.txt file means nothing is blocked from crawlers.

A properly written robots.txt file makes the crawl go faster. A longer crawl time, which can occur on a complex site with nothing blocked, can slow site response and hurt search engine ranking.

The instructions in a robots.txt file can cause one of three results regarding search engine robots:

  • All content may be crawled.
  • No content may be crawled.
  • Some content may be crawled.

The common mistakes are to not disallow any crawls, or to disallow all crawls. A proper robots.txt file allows the site’s relevant content to be crawled. But, blocking some parts of the site from crawls will speed the process, which is better for SEO.

Here are a few quick examples of content that does not need to be crawled and should therefore be blocked by your robots.txt file:

  • Search results pages, which provide internal links but that don’t display the site’s actual contents.
  • Auto-generated content, like text generated from scraping Atom/RSS feeds or by combining content from different web pages without adding sufficient originality.
  • Content of a website that has been automatically translated from another language.
  • Technical pages that are part of the site’s publishing platform and do not contain user-oriented content.

The Big Truth About Robots.txt Files

For all the usefulness of the robots.txt file, the problem with it is that it is wide open to anyone who wants to look at it. It is a public file, and is not secure.

Anyone can see what files or folders a robots.txt file blocks search engines from crawling. If this is sensitive information, there are other ways in, and you’ve just shown anyone with nefarious intentions where to start probing.

A more secure methodology is blocking with meta “no index” tagging to block robots, and password protection for folders or files to deter user snooping.

Better yet, don’t put information on your website that you want kept private.

Organize Sitemaps For Faster Crawling

A robots.txt file should also include a link to the sitemap.xml file or sitemap index. An XML sitemap is designed to help search engines discover and index your site’s most valuable pages. It works like a map, guiding the robot by showing how the site is configured and how content is related.

The sitemap consists of an XML file that lists all the website’s URLs that you want to compete in organic search. This file also contains ancillary information, such as when a page has been updated, its update frequency and its relative importance.

All URLs in a sitemap should be the canonical version, the URL for the version of the page that you want visitors to see.

If there are errors in your sitemap, search engines will have a more difficult time finding and indexing your content, and this will slow the crawl and damage your site’s search ranking.

Don’t send a robot to any page that redirects to another page or a dead page. Keep orphaned pages out of your sitemap, as well. An orphaned page is one that no page on your website links to. It cannot be found by search engines, making it, for all SEO intents and purposes, useless.

Crawlers don’t require strict organization of your sitemap. Computers can process information much faster than humans, so they’ll figure it out. However, when your developer looks at a sitemap, it’s helpful to have it organized in a way that will make it easy to read.

The Big Truth About Sitemaps

Your website should have more than one sitemap.

Google recommends creating separate sitemaps for different types of content: images, videos and text, for example. Then it’s important to update them each time new content is published in those formats. If you have a separate mobile-friendly subdomain, it needs its own sitemap too.

If you have a very large site, keep in mind that Google allows a maximum of 50,000 URLs in sitemaps it will index. If you have more, you’ll need to create an additional sitemap.         

Keep in mind, too, that search engines cannot crawl pages that are blocked, whether by noindex tags or by the robots.txt file. Blocked pages in a sitemap are contrary to the sitemap’s purpose.

Finally, you have to tell Google to crawl and index your sitemap. Submit the file(s) through Search Console, where, in the left sidebar, you click Site Configuration and then Sitemaps. Here, you can test or “add” a sitemap.



Jeremy Smith

Digital marketer with a penchant for dance; helping clients see the light through the jungle of tweets since before Twitter was cool.