Robots.txt is a text file with instructions for search engine robots that tells them which pages they should and should not crawl. These instructions are made specific by ‘allowing’ or ‘disallowing’ the behaviour of certain – or all – bots.
Why is Robots.txt important?
A robots.txt file helps control and manage web crawler activities to prevent them from overworking your website or indexing sites that are not intended for public viewing. There are multiple reasons why you may want to use a robots.txt file, for example:
1. Optimizing Crawl Budget
2. Blocking Duplicate & Non-Public Pages
3. Hiding Resources
How does a Robots.txt file work?
Robots.txt files provide search engine crawlers with information on which URLs they can and, more importantly, cannot crawl.
A bot’s initial action on a website is to search for a robots.txt file. If one is found, it will first read the file before proceeding with anything else.
You assign rules to bots by stating their user-agent (the search engine bot) followed by directives (the rules). Similar to a code of conduct, Robots.txt files can only provide instructions; they cannot be enforced.
How to Find a Robots.txt File
Just like any other file on your website, the robots.txt file is hosted on your server. You can see the robots.txt file for any given website by typing the full URL for the homepage and then adding /robots.txt.
For example, for ‘www.hikeseo.co’, you would search ‘www.hikeseo.co/robots.txt’. A robots.txt file should always be at the root of your domain. Crawlers will assume you do not have one if it is anywhere else.
Robots.txt Syntax
A robots.txt file is made up of one or more blocks of ‘directives’ (rules), each with a specified ‘user-agent’ (search engine bot) and an ‘allow’ or ‘disallow’ instruction.
The first line of every block of directives is the ‘user-agent’ which identifies the crawler it addresses.
The second line in any block of directives is the ‘Disallow’ line. You can have multiple disallow directives that specify which parts of your site the crawler does not have access to.
An empty ‘Disallow’ line means that you are not disallowing anything – enabling a crawler to access all sections of your site. The ‘Allow’ directive allows search engines to crawl a subdirectory or specific page, even in an otherwise disallowed directory.
The ‘Sitemap’ directive tells search engines, specifically Google, Bing, and Yandex, where they can find your XML sitemap. Pages you want search engines to crawl, and index are generally included in your sitemap.
Want to Learn More About SEO?
Then why not check out the Hike SEO Academy? It’s full of courses that teach you, from scratch, the fundamentals of SEO. From onsite to offsite, and even technical SEO, the academy is a video-lead resource that is easy to follow and action-based. It’s included for free with all Hike products. Sign up today and start improving your SEO knowledge.