Robots.txt Explained

Updated: 2022-07-26 / Article by: Jerry Low

The robots.txt file is a simple text document containing search engine crawlers' instructions. It tells them which pages to crawl and which ones to avoid. It’s like a sign for bots saying, “come here for the rules you need to use this website.”

The purpose of these files is to help search engines determine how best to crawl your site. That serves to reduce the burden on the bot and your server. After all, unnecessary requests for data won't benefit anyone in a meaningful way.

For example, there's no reason for Googlebot (or any other bots) to pull up anything but the most recent post on your blog or posts that have gotten an update.

How the Robots.txt File Works

The easiest way to understand how it works is to think of your website as a guest in your house. You have all of these things you want to show off on your walls, but you don't want guests wandering and touching things. So, you tell them: “Hey! Stay out of this room, please.”

That's what the robots.txt file does – it tells search engines where they should go (and where they shouldn't). You can achieve this miracle with simple instructions that follow some pre-defined rules.

Each website can only have a single robots.txt file and must follow that exact name – no more, no less.

Do I Need a Robots.txt File?

The short answer is yes. You should have a robots.txt file on your website.

The longer answer is that you need to know how search engine bots will crawl and index your site and then write your robots.txt file accordingly.

In addition to keeping sensitive information out of the hands of spammers and hackers, having a properly-structured and maintained robots.txt file can help improve your site’s ranking in search engine results.

Relevant Reads

How to Build Your Robots.txt File

The robots.txt file starts life as a simple, blank text document. That means you can create one with a tool as simple as a plain text editor like MS Notepad. You can also use the text editor in your web hosting control panel, but creating the file on your computer is safer.

Once you’ve created the document, it’s time to start filling it with instructions. You need two things for this to happen. First, you must know what you want the robots.txt file to tell bots. Next, you need to understand how to use the instructions bots can understand.

Part 1: What the Robots.txt File Can Do

  • Allow or block specific bots
  • Control the files that bots can crawl
  • Control the directories that bots can crawl
  • Control access to images
  • Define your sitemap

And more.

Part 2: Understanding How Robots.txt Syntax Works

Many people get confused when looking at robots.txt samples because the content seems like tech jargon. That’s reasonably accurate to the average person. The key to understanding robots.txt is to think like a computer.

Computers need instructions to work, and they process things based on them. The same is true for bots. They read instructions one line at a time. Each of those lines has to follow a specific format.

Here are some common commands for the robots.txt file;

CodeAction
User-agent: Googlebot-newsAllow: /
User-agent: *Disallow: /
Only allow Google’s news bot to crawl your website
User-agent: Googlebot-ImageDisallow: /images/dogs.jpgStop the smiley.jpg image from showing on Google image search results.
User-agent: GooglebotDisallow: /*.gif$Block Google’s bot from crawling any image file with the .gif extension.

You can get a more comprehensive list of instructions for your robots.txt file on Google’s developer documentation.

Facebook’s Robots.txt file.
For example, here is Facebook’s Robots.txt file.
Google’s Robots.txt file.
And here is Google’s Robots.txt file.

Best Practices for Robots.txt

Follow instructions for robots.txt, or things can go poorly for your website. (Source: Google)

While, in some ways, robots.txt allows you to customize bot behavior, the requirements for this to work can be pretty rigid. For example, you must place the robots.txt file in the root directory of your website. That generally means public_html or www.

While some rules are negotiable, it’s best to understand some standard guidelines;

Watch Your Order

Instructions in the robots.txt file have sequential priority. That means conflicting instructions will default to the first instance in the file. It’s like comparing a state vs. federal court ruling.

Be Detailed

When creating instructions, be as specific as possible with your parameters. The bots don’t negotiate, so tell them precisely what needs to happen.

Subdomains Are Possible

However, rules for the robots.txt file in each subdomain will only apply to the subdomain where the file resides.

Check The File

Building and dumping a robots.txt file can be a recipe for disaster. Ensure the rules or instructions you’re adding work before letting things loose.

Don’t Noindex Anything

Google says not to do it in robots.txt; hence it must be true.

Final Thoughts

Strictly speaking, you don’t need a robots.txt file. That’s especially true for smaller or static websites that don’t have a lot of content to crawl. However, larger websites will find robots.txt indispensable in reducing resources lost to web crawlers. It gives you much better control over how bots view your website.

Read More

About Jerry Low

Founder of WebHostingSecretRevealed.net (WHSR) - a hosting review trusted and used by 100,000's users. More than 15 years experience in web hosting, affiliate marketing, and SEO. Contributor to ProBlogger.net, Business.com, SocialMediaToday.com, and more.