A research project that reverse-engineers Google AI Mode’s sentence-level citation behavior using #:~:text= URL fragments. The first reproducible study to decode Web Text Fragments at scale, analyzing 42,971 AI citations across 6 platforms.
The Core Insight
Every Google AI Mode and Gemini citation URL contains a hidden Web Text Fragment anchor. By decoding these fragments, we can see exactly which sentences Google extracted from source pages – no guesswork required.
Research Pipeline
- Collect – ‘s Google AI Mode Scraper gathers citation URLs for 100 queries across 12 categories
- Parse – Decode #:~:text= fragments to extract the exact cited sentence from every URL
- Scrape – Fetch source pages and locate each cited sentence within the document (positional analysis)
- Analyze – Statistical tests for positional bias, sentence length preferences, structured content advantage
- Visualize – Publication-quality charts for the accompanying article
Research Questions
- Do cited sentences cluster in the top 30% of documents? (Positional bias)
- Are cited sentences shorter than average page text? (Length preferences)
- Are structured pages (lists/tables) cited more frequently?
- Do AI Mode and Gemini cite overlapping or distinct URLs?
- Does sentence length vary by query category?
Tech Stack
- Python 3.11+
- AI Mode Scraper + Web Unlocker
- SciPy for statistical analysis
- Matplotlib/Seaborn for visualizations
- Jupyter Notebooks for methodology documentation