Resources

Robots.txt File

Robots.txt File: A Beginners Guide

What is a Robots.txt File?

A robots.txt file is a text file located on a website's server that serves as a set of instructions for web crawlers or robots, such as search engine spiders.

It's designed to communicate with these automated agents, guiding them on which parts of the website are open for indexing and which should be excluded. By defining specific rules and directives within the robots.txt file, website administrators can control how search engines and other automated tools access and interact with their site's content.

For example, below is a screenshot of Apple.com's robots.txt file. It's quite long, so we've just included the first part, but it shows how this file looks like if you were to visit it in your browser.

Why are Robots.txt Files Important for SEO?

There are many reasons why Robots.txt files are necessary and useful for your SEO:

Stop Duplicate Content Appearing in SERPs

The Robots.txt file helps prevent duplicate content issues by instructing search engine bots to avoid crawling and indexing duplicate or redundant pages that have been flagged. This ensures that only the most relevant and authoritative version of a page is displayed in search engine results pages (SERPs), improving the overall search ranking and user experience.

Important Note: Ideally, the best practice is to use canonical tags for duplicate content, rather than robots.txt, however, both result in the same outcome.

Keep Certain Website Sections Private

Robots.txt allows website owners to designate specific areas or directories as off-limits to search engine crawlers. This is vital for maintaining the privacy and security of sensitive information, such as login pages or administrative sections, ensuring that confidential content remains hidden from public search results.

Prevent Internal Search URLs from Becoming Public

To avoid cluttering search engine results with internal search results pages, which are often low-quality and irrelevant to users, robots.txt can be used to block the indexing of these URLs. This streamlines the indexing process, prevents search engines from wasting resources, and keeps search results cleaner and more user-friendly.

For example, if there is a search page example.com/search, and it’s configured to display the query in the URL (e.g. example.com/search?query=my+search+keywords) then this can lead to messy or irrelevant pages that shouldn’t be shown in SERPs.

Block Certain Files from Being Indexed (Images, PDFs, CSVs)

Robots.txt files allow webmasters to specify which types of files should be excluded from search engine indexing. By blocking files like images, PDFs, or CSVs, website owners can improve crawl efficiency and focus search engine attention on the most valuable content, potentially boosting SEO performance.

There are some instances where it may be useful to allow some types of files to be indexed, for example, if the PDFs are content pieces that are supposed to be for public consumption.

Prevent Servers From Becoming Abused

In some cases, malicious bots or excessive crawling activity can overwhelm a website's servers, leading to poor site performance and downtime. Robots.txt helps prevent server abuse by restricting access to authorized search engine bots, ensuring that server resources are used efficiently, and protecting the website's overall SEO health. It also allows for managing & optimizing the crawl budget by limiting how many pages can be crawled during a certain period of time.

Show Location of Sitemap File(s)

Robots.txt can be used to indicate the location of a website's XML sitemap(s). This assists search engines in finding and indexing all relevant pages efficiently, which is essential for ensuring that a website's content is properly represented in search results and subsequently improving its SEO visibility.

For example, you'll see a sitemap directive to the sitemap index file in the below robots.txt file:

Where Should You Put Your Robots.txt File?

To ensure that search engine crawlers can locate and read your robots.txt file, you should place it in the root directory of your website's server. The root directory is the main folder where your website's homepage (e.g. index.html) is located. Here's the typical file path:

www.example.com/robots.txt

It’s vital that you make sure that the robots.txt file is accessible via a direct URL, and there are no restrictions or password protection in place that might prevent search engine bots from accessing it. This placement allows web crawlers to easily locate and follow the instructions contained in the robots.txt file to determine which parts of your website they can crawl and index.

Robots.txt Limitations

While this essential file can help with many things when it comes to guiding crawlers, there are certain limitations as well, according to Google:

Not all search engines support robots.txt rules

While robots.txt is a standard protocol for guiding web crawlers, not all spiders fully adhere to its directives. Some search engines may choose to ignore or only partially follow the instructions provided in a robots.txt file, potentially leading to unintended indexing of restricted content.

Crawlers vary in their interpretation of syntax

Different web crawlers may interpret the syntax and directives in a robots.txt file differently. This variance in interpretation can result in inconsistencies in how search engines follow the rules, possibly leading to unexpected content indexing or exclusion, depending on the specific crawler's behavior.

A disallowed page in robots.txt can still be indexed via external links

While robots.txt can prevent search engine bots from crawling a particular page directly, it doesn't prevent external websites from linking to that page. If other websites link to a disallowed page, search engines may discover and index it through those external links, bypassing the robots.txt restrictions and potentially making the page accessible in search results.

Search Engine Bot Examples

Below are some of the most common search engine bots, so when it comes to creating your robots.txt file, you know which ones are out there, in case you want to single any out or block any:

Google Bots

Googlebot
Googlebot-Image
Googlebot-Mobile
Googlebot-News
Googlebot-Video
Googlebot-Extended (for Bard & Vertex AI)‍
Storebot-Google
Mediapartners-Google (for Adsense)‍
AdsBot-Google (for Google Ads)

Baidu Bots

baiduspider
baiduspider-image
baiduspider-mobile
baiduspider-news
baiduspider-video

Bing Bots

bingbot
adidxbot (for Bing Ads)

MSN Bots

msnbot
msnbot-media

Other Bots

slurp (for Yahoo!)‍
yandex
GPTBot (for GPT OpenAI Platform)‍
CCBot (for Common Crawl)

Robots.txt Syntax

Learning the syntax for Robots.txt is quite easy, so let’s explore what each one means and what it does:

User-agent:

This line specifies the web crawler or user agent to which the subsequent rules or directives apply. For example, "User-agent: Googlebot" would target Google's web crawler. A potential danger is accidentally blocking all user agents by omitting a User-agent line, which would effectively prevent all search engines from crawling the specified content. By default, if you want all bots to crawl your website, make sure it’s set to “User-agent: *”

Disallow:

This directive indicates the web pages or directories that should not be crawled or indexed by the user agent mentioned earlier. It specifies the relative paths of the disallowed content. The danger here lies in overusing Disallow, potentially blocking vital pages or entire sections of your site from search engines, which can harm your search visibility.

Allow:

This rule can be used to counteract a broader Disallow rule. It allows you to permit the indexing of certain pages or directories that would otherwise be blocked by a Disallow directive. The danger is that it may not be recognized by non-Google crawlers and could lead to different behavior among search engines.

Crawl-delay:

This command sets a delay (in seconds) between subsequent requests made by the user agent to your server. It can be used to reduce server load caused by excessive crawling. The danger is that not all crawlers support this directive, and excessive delays can negatively affect your site's crawl efficiency.

Sitemap:

The Sitemap directive informs web crawlers about the location of your XML sitemap(s), which lists all the URLs you want to be indexed. There are no inherent dangers with this directive, but its misuse can occur if the provided URLs lead to non-existent or incomplete sitemaps, which can confuse crawlers or lead to incomplete indexing.

There are also two symbols that can also be used within URLs for creating more complex commands:

Star Symbol

* serves as a placeholder for any character sequence, or it can also mean all/any.
For example, Disallow: /page/* prevents crawling of any pages nested within /page/ directory, including the root page itself.

Dollar Sign Symbol

$ denotes the end of a URL string and although rarely used, it selects all pages and subdirectories within a certain directory level.
For example, Disallow: /page/$ prevents crawling of any pages nested within /page/ directory, including the root page itself, but it won’t select any subdirectories below it (or pages within those).

Example Robots.txt Directives

To understand some use cases, we’ve shared some common examples of Robots.txt directives, so you can model them if you require:

Allow all web crawlers access to all content
User-agent: *
Disallow:

Block all web crawlers from all website content:
User-agent: *
Disallow: /

Block all web crawlers from PDF files, site-wide:
User-agent: *
Disallow: /*.pdf

Block all web crawlers from JPG files, only within a specific subfolder:
User-agent: *
Disallow: /subfolder/*.jpg$

Note: URLs and filenames are case-sensitive, so in the above example, .JPG files would still be allowed.

Block a specific web crawler from a specific folder:
User-agent: bingbot
Disallow: /blocked-subfolder/

Block a specific web crawler from a specific web page:
User-agent: baiduspider
Disallow: /subfolder/page-to-block

Note: Be careful with the above one, especially if a slash is added, which could mark it as a directory instead of a single page!

Below, you can see a few ways in which Waterstones.com (a UK-based bookstore chain), uses different directives. You can see it's blocking SemrushBot (an SEO platform) and it also lists its sitemap index file. Finally, it adds a crawl delay of 0.2 seconds for Bingbot specifically.

Robots.txt Best Practices

In general, the following best practices will aid in making sure robots.txt files are created and used correctly, so no unintended issues arise.

To ensure discoverability, place the robots.txt file in the top-level directory of your website (www.example.com/robots.txt)
Remember that robots.txt filenames are case-sensitive; use "robots.txt" for consistent naming, not variations like "Robots.txt" or "robots.TXT."
Be aware that some user agents, especially malicious ones like malware robots or email scrapers, may disregard your robots.txt file.
Keep in mind that the /robots.txt file is publicly accessible, so avoid using it to conceal private user data as anyone can view your directives.
Separate subdomains within a root domain should each have their own robots.txt files, e.g., blog.example.com and example.com should have distinct robots.txt files at their respective locations.
It's advisable to specify the location of associated sitemaps at the end of your robots.txt file to facilitate proper indexing.
Google Search Console advises against blocking CSS and JS files to ensure proper website rendering and indexing.
URLs are case-sensitive, so make sure that directories, pages, and file extensions are in the correct case.

Robots.txt Testing Tool

Once the robots.txt file has been created according to requirements and made live, if not already, it’s important to check to make sure it’s valid.

Google has a free tool to do just that, which can be found here:
https://www.google.com/webmasters/tools/robots-testing-tool

Simply submit a URL to the robots.txt Tester tool and it will flag up any errors or warnings.

For example, you'll see Hike's robots.txt file is error and warning-free:

Robots.txt vs. Meta Robots vs. X-Robots

In addition to the Robots.txt file, you may have already heard of meta robots and x-robots and were wondering what the differences between them are.

Firstly, Robots.txt is an actual text file, that is standalone, while meta robots and x-robots are meta directives that are located on specific pages.

The main difference between them is that Robots.txt defines site-wide or directory-wide crawl behavior (including specific pages), while meta robots and x-robots indicate indexation behavior only at the page level.

More specifically, the differences between meta robots and x-robots are as follows:

Meta Robots is an HTML meta tag that is placed within the head section of individual web pages. It allows you to specify page-specific directives for web crawlers, such as "noindex" to prevent a specific page from being indexed or "nofollow" to instruct crawlers not to follow links on that page. Meta Robots provides more granular control compared to Robots.txt.

For example, on Hike's Learn SEO page, the meta robots tag has "index" and "follow", which tells search engine bots to index this page and follow all the links on this page:

X-Robots is an HTTP header that can be set at the server level or for specific web pages. It offers similar capabilities to Meta Robots but is typically used for advanced scenarios. X-Robots allows you to control how content is indexed and displayed, including specifying canonical URLs, setting page-specific "noindex" directives, or limiting image indexing. It's more flexible and powerful but may require more technical expertise to implement.

Let Kit Take the SEO Wheel

That’s where Hike comes in. Hike is powered by Kit, the world’s first SEO AI agent built specifically for small businesses. Think of Kit as your tireless digital sidekick, the one who knows SEO inside out and never needs a coffee break.

Kit doesn’t just make recommendations. It builds your SEO strategy, prioritises the right tasks, and even does them for you. Whether it’s boosting your local visibility or crafting content that climbs search rankings, Kit handles the hard stuff, while you get on with running your business.

👉 Put your SEO on autopilot with Kit

Continue your reading with these value-packed posts

404 Not Found

In this article, you’ll be learning specifically about the HTTP status 404 not found error. It's a technical SEO issue that needs to be fixed to prevent creating a negative user experience for the user and causing search engine crawling issues.

Website Crawling

In this article, you will learn how website crawling works and why it’s important for your website’s SEO.

Search Engine Indexing

Search engine indexing is the process of discovering, storing, and organizing web page content so that it can be easily & quickly searched, analyzed, and retrieved by search engines.

Mobile-Friendly Test

A Mobile-Friendly Test is a tool to help website owners and developers determine whether their website is optimized for mobile devices.

Duplicate Content

This article details how duplicate content can occur, and how to fix it if it is found.

Website Sitemaps

A website sitemap is a navigational tool that provides a structured list of all the web pages and content within a website as well as provides information about specific types of content on your pages, including video, image, and news content.

Domain Authority

Domain Authority (DA) is a metric developed by Moz that predicts how well a website will rank on search engine result pages (SERPs). It's measured on a scale from 1 to 100, with higher scores indicating a greater potential to rank.

Page Authority

Page Authority (PA) is a metric developed by Moz that predicts how well a single web page will rank on search engine result pages (SERPs).

Bounce Rates

Website bounce rate refers to the percentage of visitors who navigate away from a website after viewing only a single page, without engaging further or interacting with other pages or elements on the site.

301 vs 302 Redirect

This article delves into the nuances of 301 vs. 302 redirects, providing a comprehensive overview of their functionalities, best practices for implementation, and the impact on website performance.

Redirect Chains

In this post, we'll delve into the impact of redirect chains on your SEO efforts and how you can efficiently manage them.

Breadcrumbs in SEO

We'll explore best practices for implementing breadcrumbs, their impact on user engagement, and how they contribute to a more seamless browsing experience.

Website Migration

Website migration involves moving your website from one environment to another, such as changing web hosts, switching domains, or redesigning the site's structure.

Structured Data

Structured data in SEO refers to organizing website content using a specific format. Search engines use this structured data to understand the information better and display it clearly in search results.

URL Structure for SEO

Simplifying your URL structure can make it easier for search engines to crawl your site efficiently, leading to better rankings.

Top-Level Domains

In this post, we'll explore the significance of top-level domains, how they impact your online presence, and tips for selecting the perfect TLD to elevate your brand.

WordPress Redirects

In this post, we'll delve into the world of WordPress redirects, exploring their significance, types, and best practices to help you enhance your website's functionality and user satisfaction.

Shopify Redirects

In this guide, we delve into the intricacies of Shopify redirects, exploring their significance for SEO and offering a step-by-step roadmap for implementation.

Robots Meta Tags

In this article, we’ll reveal the benefits of robots meta tags and the value they add to each web page. We’ll also share the best practices so you can hit the ground running.

Site Speed

In this comprehensive guide, we'll delve into what site speed is, its benefits for SEO, related topics, and practical steps to improve it.

What is Minification?

In this article, we'll explore what minification is, the benefits it offers from an SEO perspective, and how to effectively implement it to enhance page speed.

Lazy Loading

Lazy loading is a web development technique that delays the loading of non-critical resources on a webpage—such as images, videos, or scripts—until they are actually needed by the user.

Mobilegeddon

This article will provide a comprehensive guide on Mobilegeddon: what it is, how it impacts SEO, and actionable strategies to leverage this update for better search rankings.

Hummingbird Update

In this comprehensive guide, we’ll cover everything you need to know about the Hummingbird update, its effects on SEO, and how you can leverage its principles for better rankings.

Penguin Update

This article will explore how Penguin works, its effects on SEO, and how website owners can adapt their strategies to thrive in the post-Penguin era.

Panda Update

In this article, we will take an in-depth look at the Panda Update, exploring what it is, how it has impacted SEO practices over the years, and strategies to best adapt to its requirements.

Speed Up Your Shopify Site

We’ll show you how to speed up your Shopify site with the best practices that anyone can follow.

WordPress Site Speed Optimization

WordPress site speed optimization involves improving the performance and loading time of a WordPress website. The aim is to enhance the user experience, SEO rankings, and overall site functionality.

BERT Algorithm

In this comprehensive guide, we’ll explore what BERT is, how it has affected SEO, and how you can leverage it for your website.

Crawl Depth

In this guide, we will dive deep into the concept of crawl depth, its benefits, and strategies to optimize it for better search visibility.

Core Web Vitals Strategy

In this article, we'll explore what Core Web Vitals are, how they benefit your SEO strategy, and actionable insights for improving your site’s performance.

Accelerated Mobile Pages

We’ll dive deeper into the accelerated mobile pages benefits and what best practices will make all the difference.

First Input Delay

This article will explain what FID is, why it matters for SEO, how to improve it, and how it relates to other performance metrics.

Conversion Rate Optimization

Conversion Rate Optimization (CRO) is the process of enhancing a website or landing page to increase the percentage of visitors who complete a desired action.

Google Discover Feed

Google Discover is a personalized content feed that serves up articles, news, videos, and other content to users based on their interests, previous search behavior, and interactions with Google services.

SERP Volatility

In this article, we will break down the concept of SERP volatility, explore its benefits, and provide actionable insights on how to manage its impact on your SEO strategy.

What Is Largest Contentful Paint?

In this article, we’ll dive into what LCP is, its importance to SEO, and actionable strategies for optimizing it.

Google Knowledge Graph

This article will break down the fundamentals of the Google Knowledge Graph, explore its SEO benefits, and provide actionable steps for optimization.

LSI Keywords

In this article, we'll dive into what LSI keywords are, their benefits for SEO, best practices, and how to incorporate them into your content effectively.

Cumulative Layout Shift

We’ll go into detail on what this core web vitals metric is all about and how it affects your SEO performance.

What Are Heatmaps?

In this article, we’ll dive deep into what heatmaps are, why they’re valuable for SEO, and how you can use them to enhance your website’s search engine performance.

Natural Language Processing

Natural Language Processing (NLP) in SEO refers to the use of AI-driven techniques to understand, interpret, and optimize content for search engines.

What is Black Hat SEO?

This guide explores Black Hat SEO in comprehensive detail, covering its definition, common tactics, the risks it entails, and why ethical practices are the better choice for sustainable growth.

What is White Hat SEO?

This comprehensive guide will explore White Hat SEO in detail, including what it is, its benefits, key strategies, comparisons with other SEO tactics, and best practices for implementation.

What is Grey Hat SEO?

This guide delves into what grey hat SEO is, why it’s better to avoid it, and how focusing on white hat practices can help you achieve sustainable results.

AI SEO Assistants

In this comprehensive guide, we’ll explore everything you need to know about AI SEO assistants, from their core functionalities and benefits to practical applications and advanced use cases.

Leverage Browser Caching

In this guide, we’ll explore everything you need to know about browser caching—what it is, why it matters for SEO, and how to implement it effectively.

Google EEAT

Google’s EEAT stands for Experience, Expertise, Authoritativeness, and Trustworthiness. It’s a key part of Google’s Search Quality Rater Guidelines, used to evaluate the credibility and reliability of content.

SERP AI Summaries

In this article, we’ll explore what SERP AI summaries are, their benefits for SEO, and how you can optimize your content to maximize visibility in this new AI-driven search era.

Content Delivery Network Examples

In this article, we’ll explore the concept of CDNs, discuss their SEO benefits, provide real-world examples of CDN providers, and offer best practices for optimizing their use.

AI SEO Software

We’ll share how AI can help your SEO strategy and some best practices to put into action after reading this article.

HTTP2 vs HTTP3

In this article, we’ll break down the key differences, benefits, and implications of HTTP/3 vs. HTTP/2 to help you understand which version is right for your website or application.

Ranking in AI Overviews

In this article, we’ll help you rank for AI Overview, but the search landscape for this feature is volatile. In fact, 70% of pages ranking in AI Overviews are expected to change in a 2-3 month timeline.

Winning with Zero Click Searches: The New SEO Game-Changer

Unlock the power of zero click searches—discover how instant SERP answers can revolutionize your SEO strategy and boost brand visibility in a no-click world.

What Is Google Merchant Center & How to Get Started

Learn what Google Merchant Center is, how it works, and how to set it up to boost your product visibility across Google Shopping, Search, YouTube, and more.

Ready to grow your business?

We understand small businesses, and those that serve them. We know you need traffic and customers, and we know you don’t have big-business budgets. That’s why we built Hike and Kit. Get started today, risk-free.

Get Started Today

Love it or get a full refund

No long term commitments

14-day money back guarantee