Robots.txt Complete Guide

Everything you need to know about robots.txt files

What is Robots.txt?

A robots.txt file is a text file that tells search engine crawlers which pages or sections of your website they can or cannot access. It's part of the Robots Exclusion Protocol (REP), a group of web standards that regulate how robots crawl the web.

The file must be placed at the root of your website (e.g., https://example.com/robots.txt).

Basic Syntax

robots.txt
User-agent: *
Disallow: /admin/
Allow: /admin/public/

Sitemap: https://example.com/sitemap.xml
  • User-agent: Specifies which crawler the rules apply to (* = all crawlers)
  • Disallow: Tells crawlers not to access certain paths
  • Allow: Explicitly allows access (can override Disallow rules)
  • Sitemap: Points to your XML sitemap location

Common User-Agents

Googlebot

Google's main web crawler

Googlebot-Image

Google's image search crawler

Bingbot

Microsoft Bing's crawler

*

Wildcard - applies to all bots

Best Practices

✅ Do:

  • Place robots.txt at your website root
  • Include a sitemap reference
  • Block admin and private areas
  • Test before deploying
  • Keep it simple and readable

❌ Don't:

  • Use robots.txt for security (it's publicly visible)
  • Block your entire site unless intentional
  • Forget to allow important content
  • Use it as the only crawl control method
  • Block CSS/JS files (can hurt SEO)

Common Patterns

Allow Everything

allow-all.txt
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Block Specific Paths

block-paths.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /temp/

Sitemap: https://example.com/sitemap.xml

Allow Only Specific Bots

selective-bots.txt
User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Sitemap: https://example.com/sitemap.xml

🌐 Real-World Use Cases

Here's how major websites use robots.txt to control their crawl budget and protect their content:

📰 News Websites (Like CNN, BBC)

News sites need search engines to quickly index breaking news while avoiding duplicate content from archives and print versions.

User-agent: *
Disallow: /print/         # Block printer-friendly versions
Disallow: /amp/archive/   # Block archived AMP pages
Disallow: /search?        # Block search result pages
Allow: /amp/              # Allow current AMP articles

Sitemap: https://news-site.com/sitemap.xml

Why: Focuses crawl budget on current news while preventing duplicate content penalties from print and archive versions.

🛒 E-commerce Sites (Like Amazon, eBay)

E-commerce platforms must balance indexing product pages while blocking duplicate filter/sort combinations and checkout flows.

User-agent: *
Disallow: /cart/          # Block shopping cart
Disallow: /checkout/      # Block checkout process
Disallow: /my-account/    # Block user accounts
Disallow: /*?sort=        # Block sort parameters
Disallow: /*?filter=      # Block filter parameters
Allow: /products/         # Allow product pages

Sitemap: https://shop.com/sitemap-products.xml

Why: Prevents wasting crawl budget on infinite filter combinations while ensuring all unique products are indexed.

📝 WordPress Blogs

WordPress sites need to block admin areas, duplicate content from tags/categories, and trackback URLs while allowing posts and pages.

User-agent: *
Disallow: /wp-admin/      # Block WordPress admin
Allow: /wp-admin/admin-ajax.php  # Allow AJAX (needed)
Disallow: /wp-includes/   # Block WP core files
Disallow: /wp-content/plugins/  # Block plugins
Disallow: /wp-content/themes/   # Block themes
Disallow: /*?s=           # Block search results
Disallow: /*?replytocom=  # Block comment replies
Disallow: /tag/           # Block tag archives (if using categories)

Sitemap: https://blog.com/sitemap_index.xml

Why: Protects admin areas from exposure while avoiding duplicate content from WordPress's various archive types.

🏢 SaaS Platforms (Like Slack, Notion)

SaaS platforms need to allow crawling of marketing pages while completely blocking the application itself and user data.

User-agent: *
Disallow: /app/           # Block entire application
Disallow: /api/           # Block API endpoints
Disallow: /dashboard/     # Block user dashboards
Disallow: /workspace/     # Block workspaces
Allow: /                  # Allow marketing site
Allow: /pricing
Allow: /features
Allow: /blog/

Sitemap: https://saas.com/marketing-sitemap.xml

Why: Ensures Google only indexes public marketing content, not user-generated private content behind login.

🎬 Streaming Services (Like Netflix, YouTube)

Video platforms need to index content pages while blocking infinite scroll, recommendations, and personalized feeds.

User-agent: *
Disallow: /recommendations/  # Block personalized content
Disallow: /my-list/          # Block user lists
Disallow: /*?autoplay=       # Block autoplay URLs
Disallow: /*?t=              # Block timestamp URLs
Allow: /watch?v=             # Allow video pages
Allow: /channel/             # Allow channel pages

User-agent: Googlebot-Image
Allow: /thumbnails/          # Allow thumbnail crawling

Sitemap: https://streaming.com/video-sitemap.xml

Why: Indexes individual videos while avoiding duplicate content from recommendations and time-stamped URLs.

💡 Pro Tip: Use our Examples page for more pre-built templates you can customize for your website type.

SEO Impact

Robots.txt directly affects how search engines crawl your site:

  • Crawl Budget: Help search engines focus on your important pages
  • Privacy: Keep private or duplicate content out of search results
  • Performance: Reduce server load by blocking unnecessary crawling
  • Control: Manage which bots can access your content

⚠️ Warning: Blocking important pages can hurt your SEO. Always test thoroughly before deploying.

⚡ Performance Tips

Optimize your robots.txt file for better performance and faster processing by search engines:

📏 Keep File Size Under 500KB

Search engines have a 500KB limit for robots.txt files. Files larger than this may not be fully processed. Keep your rules concise and avoid unnecessary repetition.

🎯 Use Wildcards Efficiently

Instead of listing every file individually, use wildcards to match patterns:

# ❌ Inefficient:
Disallow: /temp/file1.html
Disallow: /temp/file2.html
Disallow: /temp/file3.html
# ✅ Better:
Disallow: /temp/

⏱️ Minimize Redundant Rules

Each rule takes time to process. Combine similar rules under the same User-agent block instead of creating separate blocks. Remove duplicate or contradictory rules.

🚀 Consider Crawl Budget

Block low-value pages (admin panels, duplicate content, infinite scroll pages) to save your crawl budget for important content. This helps search engines discover your best pages faster.

📊 Monitor Parse Time

Use our Validator to check your file's parse time. Files with hundreds of lines or over 50KB will show estimated processing time. Large files may slow down crawler processing.

🔄 Cache-Friendly Updates

Search engines cache robots.txt files for up to 24 hours. When making critical changes, use Google Search Console to request an immediate re-crawl of your robots.txt file.

Testing Your Robots.txt

  1. Create your robots.txt file using our Generator
  2. Validate it with our Validator
  3. Test specific URLs to ensure they're allowed/blocked correctly
  4. Upload to your website root
  5. Verify it's accessible at yourdomain.com/robots.txt
  6. Use Google Search Console to test (Crawl → robots.txt Tester)

🔧 Troubleshooting Common Issues

Having problems with your robots.txt file? Here are solutions to the most common issues:

❌ Problem: "Robots.txt not found" (404 error)

Symptoms: Search Console shows 404 error when accessing robots.txt

Solutions:

  • Verify file is named exactly "robots.txt" (lowercase, no spaces)
  • Place file in website root directory, not in subdirectories
  • Check file permissions - should be readable by web server (644)
  • Clear CDN/caching if using Cloudflare or similar services
  • Test access directly: curl https://yourdomain.com/robots.txt

⚠️ Problem: Rules not working as expected

Symptoms: URLs are crawled despite Disallow rules, or blocked despite Allow rules

Solutions:

  • Rule order matters: More specific rules should come before general ones
  • Check for typos in paths (robots.txt is case-sensitive)
  • Wait 24 hours - search engines cache robots.txt
  • Use our Tester to verify rules match your URLs
  • Verify no conflicting rules (Allow overrides Disallow for same path)

⏱️ Problem: Changes not taking effect

Symptoms: Updated robots.txt but bots still use old rules

Solutions:

  • Search engines cache robots.txt for up to 24 hours
  • Use Google Search Console → Crawl → robots.txt Tester to force refresh
  • Clear your CDN cache if using one (Cloudflare, etc.)
  • Verify changes actually saved by viewing source in browser
  • Check server isn't serving cached version

🚫 Problem: Entire site blocked accidentally

Symptoms: No pages being indexed, traffic dropped significantly

Solutions:

  • Check for "Disallow: /" under "User-agent: *" - this blocks everything
  • Remove or comment out overly broad Disallow rules
  • Use our Validator to check for syntax errors
  • Request re-indexing in Google Search Console after fixing
  • Consider using Allow rules to explicitly permit important content

📄 Problem: Syntax errors or validation warnings

Symptoms: Search Console reports errors, unexpected bot behavior

Solutions:

  • Each User-agent line must have at least one Disallow or Allow directive
  • No spaces allowed after "User-agent:", "Disallow:", or "Allow:"
  • Use our Validator for real-time syntax checking
  • Paths must start with "/" (e.g., "/admin/" not "admin/")
  • Remove special characters that might break parsing

Frequently Asked Questions

If you don't have a robots.txt file, search engine crawlers will assume they can access and index all pages on your website. This is equivalent to having an "Allow: /" rule for all user-agents. While not required, having a robots.txt file gives you more control over how search engines crawl your site.

Yes, using "User-agent: *" with "Disallow: /" will tell all crawlers not to access any pages. However, this doesn't guarantee complete removal from search results - URLs may still appear if other sites link to them. For complete removal, you need to use meta robots tags or X-Robots-Tag headers in addition to robots.txt.

Search engines typically cache robots.txt files for up to 24 hours. This means changes may not take effect immediately. For urgent changes, you can use Google Search Console to request a re-crawl of your robots.txt file, which will update Google's cached version faster.

No, you should not block CSS and JavaScript files. Google and other search engines need to render your pages to understand their content and layout. Blocking these resources can negatively impact your SEO and may prevent search engines from properly indexing your site. Only block truly sensitive or unnecessary content.

Robots.txt controls whether crawlers can access pages, while meta robots tags control whether pages can be indexed in search results. A page blocked by robots.txt won't be crawled, so crawlers can't see its meta tags. Use robots.txt for crawl control and meta tags for indexing control. For best results, use both appropriately based on your needs.

Yes, most modern search engines support wildcards: asterisk (*) matches any sequence of characters, and dollar sign ($) matches the end of the URL. For example, "Disallow: *.pdf$" blocks all PDF files, and "Disallow: /private*" blocks all URLs starting with /private. However, not all bots support wildcards, so test thoroughly.

Need Help?

Use our tools to make robots.txt creation easy: