A file telling search engine crawlers which pages or sections of a site should not be crawled or indexed.
Robots.txt is a text file placed in a website's root directory that provides instructions to search engine crawlers about which pages or sections of a site they should not access. This file serves as the first point of communication between a website and search engine bots, acting as a digital "Do Not Enter" sign for specific areas of your site.
The file follows the Robots Exclusion Protocol, a standard that's been respected by major search engines since 1994. When a crawler visits your site, it checks for robots.txt at yourdomain.com/robots.txt before crawling any other pages. This makes robots.txt a critical component of technical SEO, allowing site owners to control how search engines interact with their content.
Why It Matters for AI SEO
With AI-powered crawlers becoming more sophisticated and resource-intensive, robots.txt has gained renewed importance. Modern AI systems like Google's crawlers need clear guidance about which content to prioritize, especially as they process sites for AI Overviews and other enhanced search features. A well-configured robots.txt helps preserve crawl budget for your most valuable pages while preventing AI systems from wasting resources on irrelevant content. AI crawlers also respect robots.txt directives when training language models or extracting content for AI-generated responses. This means your robots.txt configuration directly impacts whether your content appears in AI-powered search features. Additionally, many AI SEO tools now analyze robots.txt configurations to identify potential crawling issues that could affect AI content discovery.
How It Works
The robots.txt file uses simple directives written in plain text. The most common directives include "User-agent" (specifying which crawler the rule applies to), "Disallow" (blocking access to specific paths), and "Allow" (explicitly permitting access). You can also include your XML sitemap location using the "Sitemap" directive.
For example, a basic robots.txt might block crawlers from accessing admin areas while allowing access to everything else:
``
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Sitemap: https://example.com/sitemap.xml
``
Tools like Google Search Console show how your robots.txt affects crawling, while technical SEO tools like Screaming Frog and Sitebulb can test your robots.txt configuration against your actual site structure. Always test your robots.txt using Google Search Console's robots.txt Tester before implementing changes, as blocking the wrong pages can severely impact your search visibility.
Common Mistakes
The most dangerous mistake is accidentally blocking important pages or entire sections of your site. A misplaced "/" in a disallow directive can block your entire website from search engines. Many site owners also mistakenly believe robots.txt prevents pages from being indexed – it only prevents crawling. Blocked pages can still appear in search results if they're linked from other sites. For true index blocking, use noindex meta tags or headers instead of relying solely on robots.txt.