Robots.txt Complete Guide
Everything you need to know about robots.txt files
What is Robots.txt?
A robots.txt file is a text file that tells search engine crawlers which pages or sections of your website they can or cannot access. It's part of the Robots Exclusion Protocol (REP), a group of web standards that regulate how robots crawl the web.
The file must be placed at the root of your website (e.g., https://example.com/robots.txt).
Basic Syntax
User-agent: * Disallow: /admin/ Allow: /admin/public/ Sitemap: https://example.com/sitemap.xml
- User-agent: Specifies which crawler the rules apply to (* = all crawlers)
- Disallow: Tells crawlers not to access certain paths
- Allow: Explicitly allows access (can override Disallow rules)
- Sitemap: Points to your XML sitemap location
Common User-Agents
Googlebot
Google's main web crawler
Googlebot-Image
Google's image search crawler
Bingbot
Microsoft Bing's crawler
*
Wildcard - applies to all bots
Best Practices
✅ Do:
- Place robots.txt at your website root
- Include a sitemap reference
- Block admin and private areas
- Test before deploying
- Keep it simple and readable
❌ Don't:
- Use robots.txt for security (it's publicly visible)
- Block your entire site unless intentional
- Forget to allow important content
- Use it as the only crawl control method
- Block CSS/JS files (can hurt SEO)
Common Patterns
Allow Everything
User-agent: * Allow: / Sitemap: https://example.com/sitemap.xml
Block Specific Paths
User-agent: * Disallow: /admin/ Disallow: /private/ Disallow: /temp/ Sitemap: https://example.com/sitemap.xml
Allow Only Specific Bots
User-agent: * Disallow: / User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / Sitemap: https://example.com/sitemap.xml
🌐 Real-World Use Cases
Here's how major websites use robots.txt to control their crawl budget and protect their content:
📰 News Websites (Like CNN, BBC)
News sites need search engines to quickly index breaking news while avoiding duplicate content from archives and print versions.
User-agent: * Disallow: /print/ # Block printer-friendly versions Disallow: /amp/archive/ # Block archived AMP pages Disallow: /search? # Block search result pages Allow: /amp/ # Allow current AMP articles Sitemap: https://news-site.com/sitemap.xml
Why: Focuses crawl budget on current news while preventing duplicate content penalties from print and archive versions.
🛒 E-commerce Sites (Like Amazon, eBay)
E-commerce platforms must balance indexing product pages while blocking duplicate filter/sort combinations and checkout flows.
User-agent: * Disallow: /cart/ # Block shopping cart Disallow: /checkout/ # Block checkout process Disallow: /my-account/ # Block user accounts Disallow: /*?sort= # Block sort parameters Disallow: /*?filter= # Block filter parameters Allow: /products/ # Allow product pages Sitemap: https://shop.com/sitemap-products.xml
Why: Prevents wasting crawl budget on infinite filter combinations while ensuring all unique products are indexed.
📝 WordPress Blogs
WordPress sites need to block admin areas, duplicate content from tags/categories, and trackback URLs while allowing posts and pages.
User-agent: * Disallow: /wp-admin/ # Block WordPress admin Allow: /wp-admin/admin-ajax.php # Allow AJAX (needed) Disallow: /wp-includes/ # Block WP core files Disallow: /wp-content/plugins/ # Block plugins Disallow: /wp-content/themes/ # Block themes Disallow: /*?s= # Block search results Disallow: /*?replytocom= # Block comment replies Disallow: /tag/ # Block tag archives (if using categories) Sitemap: https://blog.com/sitemap_index.xml
Why: Protects admin areas from exposure while avoiding duplicate content from WordPress's various archive types.
🏢 SaaS Platforms (Like Slack, Notion)
SaaS platforms need to allow crawling of marketing pages while completely blocking the application itself and user data.
User-agent: * Disallow: /app/ # Block entire application Disallow: /api/ # Block API endpoints Disallow: /dashboard/ # Block user dashboards Disallow: /workspace/ # Block workspaces Allow: / # Allow marketing site Allow: /pricing Allow: /features Allow: /blog/ Sitemap: https://saas.com/marketing-sitemap.xml
Why: Ensures Google only indexes public marketing content, not user-generated private content behind login.
🎬 Streaming Services (Like Netflix, YouTube)
Video platforms need to index content pages while blocking infinite scroll, recommendations, and personalized feeds.
User-agent: * Disallow: /recommendations/ # Block personalized content Disallow: /my-list/ # Block user lists Disallow: /*?autoplay= # Block autoplay URLs Disallow: /*?t= # Block timestamp URLs Allow: /watch?v= # Allow video pages Allow: /channel/ # Allow channel pages User-agent: Googlebot-Image Allow: /thumbnails/ # Allow thumbnail crawling Sitemap: https://streaming.com/video-sitemap.xml
Why: Indexes individual videos while avoiding duplicate content from recommendations and time-stamped URLs.
💡 Pro Tip: Use our Examples page for more pre-built templates you can customize for your website type.
SEO Impact
Robots.txt directly affects how search engines crawl your site:
- Crawl Budget: Help search engines focus on your important pages
- Privacy: Keep private or duplicate content out of search results
- Performance: Reduce server load by blocking unnecessary crawling
- Control: Manage which bots can access your content
⚠️ Warning: Blocking important pages can hurt your SEO. Always test thoroughly before deploying.
⚡ Performance Tips
Optimize your robots.txt file for better performance and faster processing by search engines:
📏 Keep File Size Under 500KB
Search engines have a 500KB limit for robots.txt files. Files larger than this may not be fully processed. Keep your rules concise and avoid unnecessary repetition.
🎯 Use Wildcards Efficiently
Instead of listing every file individually, use wildcards to match patterns:
⏱️ Minimize Redundant Rules
Each rule takes time to process. Combine similar rules under the same User-agent block instead of creating separate blocks. Remove duplicate or contradictory rules.
🚀 Consider Crawl Budget
Block low-value pages (admin panels, duplicate content, infinite scroll pages) to save your crawl budget for important content. This helps search engines discover your best pages faster.
📊 Monitor Parse Time
Use our Validator to check your file's parse time. Files with hundreds of lines or over 50KB will show estimated processing time. Large files may slow down crawler processing.
🔄 Cache-Friendly Updates
Search engines cache robots.txt files for up to 24 hours. When making critical changes, use Google Search Console to request an immediate re-crawl of your robots.txt file.
Testing Your Robots.txt
🔧 Troubleshooting Common Issues
Having problems with your robots.txt file? Here are solutions to the most common issues:
❌ Problem: "Robots.txt not found" (404 error)
Symptoms: Search Console shows 404 error when accessing robots.txt
Solutions:
- Verify file is named exactly "robots.txt" (lowercase, no spaces)
- Place file in website root directory, not in subdirectories
- Check file permissions - should be readable by web server (644)
- Clear CDN/caching if using Cloudflare or similar services
- Test access directly: curl https://yourdomain.com/robots.txt
⚠️ Problem: Rules not working as expected
Symptoms: URLs are crawled despite Disallow rules, or blocked despite Allow rules
Solutions:
- Rule order matters: More specific rules should come before general ones
- Check for typos in paths (robots.txt is case-sensitive)
- Wait 24 hours - search engines cache robots.txt
- Use our Tester to verify rules match your URLs
- Verify no conflicting rules (Allow overrides Disallow for same path)
⏱️ Problem: Changes not taking effect
Symptoms: Updated robots.txt but bots still use old rules
Solutions:
- Search engines cache robots.txt for up to 24 hours
- Use Google Search Console → Crawl → robots.txt Tester to force refresh
- Clear your CDN cache if using one (Cloudflare, etc.)
- Verify changes actually saved by viewing source in browser
- Check server isn't serving cached version
🚫 Problem: Entire site blocked accidentally
Symptoms: No pages being indexed, traffic dropped significantly
Solutions:
- Check for "Disallow: /" under "User-agent: *" - this blocks everything
- Remove or comment out overly broad Disallow rules
- Use our Validator to check for syntax errors
- Request re-indexing in Google Search Console after fixing
- Consider using Allow rules to explicitly permit important content
📄 Problem: Syntax errors or validation warnings
Symptoms: Search Console reports errors, unexpected bot behavior
Solutions:
- Each User-agent line must have at least one Disallow or Allow directive
- No spaces allowed after "User-agent:", "Disallow:", or "Allow:"
- Use our Validator for real-time syntax checking
- Paths must start with "/" (e.g., "/admin/" not "admin/")
- Remove special characters that might break parsing
Frequently Asked Questions
If you don't have a robots.txt file, search engine crawlers will assume they can access and index all pages on your website. This is equivalent to having an "Allow: /" rule for all user-agents. While not required, having a robots.txt file gives you more control over how search engines crawl your site.
Yes, using "User-agent: *" with "Disallow: /" will tell all crawlers not to access any pages. However, this doesn't guarantee complete removal from search results - URLs may still appear if other sites link to them. For complete removal, you need to use meta robots tags or X-Robots-Tag headers in addition to robots.txt.
Search engines typically cache robots.txt files for up to 24 hours. This means changes may not take effect immediately. For urgent changes, you can use Google Search Console to request a re-crawl of your robots.txt file, which will update Google's cached version faster.
No, you should not block CSS and JavaScript files. Google and other search engines need to render your pages to understand their content and layout. Blocking these resources can negatively impact your SEO and may prevent search engines from properly indexing your site. Only block truly sensitive or unnecessary content.
Robots.txt controls whether crawlers can access pages, while meta robots tags control whether pages can be indexed in search results. A page blocked by robots.txt won't be crawled, so crawlers can't see its meta tags. Use robots.txt for crawl control and meta tags for indexing control. For best results, use both appropriately based on your needs.
Yes, most modern search engines support wildcards: asterisk (*) matches any sequence of characters, and dollar sign ($) matches the end of the URL. For example, "Disallow: *.pdf$" blocks all PDF files, and "Disallow: /private*" blocks all URLs starting with /private. However, not all bots support wildcards, so test thoroughly.
Need Help?
Use our tools to make robots.txt creation easy:
