Grounding Citation Analysis

A research project that reverse-engineers Google AI Mode’s sentence-level citation behavior using #:~:text= URL fragments. The first reproducible study to decode Web Text Fragments at scale, analyzing 42,971 AI citations across 6 platforms.

The Core Insight

Every Google AI Mode and Gemini citation URL contains a hidden Web Text Fragment anchor. By decoding these fragments, we can see exactly which sentences Google extracted from source pages – no guesswork required.

Research Pipeline

  1. Collect – ‘s Google AI Mode Scraper gathers citation URLs for 100 queries across 12 categories
  2. Parse – Decode #:~:text= fragments to extract the exact cited sentence from every URL
  3. Scrape – Fetch source pages and locate each cited sentence within the document (positional analysis)
  4. Analyze – Statistical tests for positional bias, sentence length preferences, structured content advantage
  5. Visualize – Publication-quality charts for the accompanying article

Research Questions

  • Do cited sentences cluster in the top 30% of documents? (Positional bias)
  • Are cited sentences shorter than average page text? (Length preferences)
  • Are structured pages (lists/tables) cited more frequently?
  • Do AI Mode and Gemini cite overlapping or distinct URLs?
  • Does sentence length vary by query category?

Tech Stack

  • Python 3.11+
  • AI Mode Scraper + Web Unlocker
  • SciPy for statistical analysis
  • Matplotlib/Seaborn for visualizations
  • Jupyter Notebooks for methodology documentation

Links

Frequently Asked Questions

What data does this analysis cover?
42,971 AI citations across Google AI Mode, Gemini, ChatGPT, Perplexity, Copilot, and Grok. The citations span 100 queries across 12 categories, providing broad coverage of how different AI models cite web content.
Can I reproduce this research?
Yes, the entire pipeline is open-source. You’ll need a account (approximately $6-25 for data collection) and Python 3.11+. The scripts run sequentially from collection through analysis and chart generation.
How does this help with SEO/GEO?
Understanding exactly which sentences AI models cite – and where they appear on the page – gives you actionable optimization targets. If cited sentences cluster in the top 30% of pages, front-loading your key insights becomes a data-backed strategy, not just a best practice.