Technical SEO

Robots.txt for AI Crawlers in 2026: Allow, Block, or Negotiate?

Updated 3 min read Daniel Shashko
Robots.txt for AI Crawlers in 2026: Allow, Block, or Negotiate?
AI Summary
To maximize AI visibility in 2026, most brands should allow both search-time and training crawlers via robots.txt. Blocking training crawlers like GPTBot or CCBot is advisable only for paywalled content, proprietary data, or image-heavy creative work, as Common Crawl ingests content anyway. A practical robots.txt template allows all major AI crawlers, including OAI-SearchBot and Google-Extended, to ensure citation and entity recognition.

There are now over 30 distinct AI-related crawlers hitting your site. Some retrieve content for live AI answers (and citing you in the process). Some scrape for model training (with no citation in return). Some do both. The robots.txt decisions you make in 2026 either compound your AI visibility or quietly cut you out of half the citation ecosystem.

The 4 categories of AI crawler

  1. Search-time retrieval crawlers. OAI-SearchBot, PerplexityBot, Google-Extended (when used by AI Overviews). These are the ones that get you cited. Block them and you disappear from AI search results.
  2. Live user-fetch agents. ChatGPT-User, Perplexity-User, Claude-User. These fetch a page in real time when a user asks a question that requires fresh data. Blocking these makes you invisible for live queries.
  3. Training crawlers. GPTBot, ClaudeBot, Google-Extended (when used for model training), CCBot (Common Crawl, used by everyone). These ingest content for future model training. No direct citation benefit; brand-strategy decision.
  4. Misclassified or hostile bots. Anthropic’s old anthropic-ai, scraping bots impersonating legitimate ones. These need separate handling.

The default-allow stance

For most brands, the right default is to ALLOW search and live-fetch crawlers, and ALLOW training crawlers as well. Three reasons:

  • Citation upside outweighs training-data downside in almost all categories.
  • Training-data ingestion compounds your entity recognition across future model versions.
  • Blocking training crawlers does not actually remove you from training data because Common Crawl ingests you anyway and feeds many models.

When blocking training crawlers makes sense

  • Paywalled or members-only content. If your business model depends on gated access, blocking training crawlers is a legitimate defense.
  • Original research and proprietary data. Independent research firms often block training crawlers to preserve the value of their data.
  • Image-heavy creative work. Photographers, illustrators, and designers have legitimate concerns about training-data appropriation. Blocking GPTBot via robots.txt is reasonable.
  • Legal or regulatory restrictions. Some jurisdictions and contracts require opt-out from AI training.

Even in these cases, allow the search-time crawlers (OAI-SearchBot, PerplexityBot) so you remain discoverable.

The robots.txt template that handles 2026

A practical default for a brand that wants maximum AI visibility:

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

If you want to block training only:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

# Still allow search-time
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Beyond robots.txt: meta tags and HTTP headers

  • noai and noimageai meta tags. Less universally honoured than robots.txt but signal opt-out for compliant scrapers.
  • X-Robots-Tag HTTP header. Apply to non-HTML resources (PDFs, images) where meta tags don’t fit.
  • Cloudflare AI Audit and similar services. Provide per-bot allow/block at the CDN edge with logging. Useful for high-traffic sites.

Auditing what your robots.txt actually does

  1. Pull your server access logs from the last 30 days. Filter for known AI user-agents.
  2. Check which crawlers actually hit you and at what frequency. Verify your robots.txt rules match your intent.
  3. Test reachability per crawler with curl: curl -A 'OAI-SearchBot' https://yoursite.com/robots.txt.
  4. Spot-check that your most important pages return 200 to all allowed crawlers.
  5. Re-audit quarterly as new crawlers emerge.

Frequently Asked Questions

Will blocking GPTBot remove me from ChatGPT entirely?
No. Blocking GPTBot only opts you out of OpenAI’s training corpus. You still appear in ChatGPT search results as long as OAI-SearchBot is allowed and you’re indexed in Bing.
Should I block CCBot (Common Crawl)?
Only if you have specific reasons to opt out of broad training data. CCBot is used by dozens of academic, commercial, and open-source projects, and blocking it has wide ripple effects.
Are robots.txt rules legally binding on AI companies?
Robots.txt is a voluntary standard, not law. Major AI companies (OpenAI, Anthropic, Google, Perplexity, Microsoft) honor it. Smaller scrapers and hostile actors often don’t. For hard enforcement, use CDN-level blocking.

Want this implemented for your brand?

I help growth-stage companies own their category in AI search. Audit your AI crawler footprint.