GitHub Project

Grounding Citation Analysis

A reproducible study that decodes Google AI Mode's #:~:text= citation fragments to reveal the exact sentences AI engines quote, across 42,971 citations and 6 platforms. Powered by an AI web scraping API.

7 stars 2 forks MIT

GitHub

42,971 AI citations analyzed

6 AI platforms

100 queries

12 query categories

Quick Start

git clone https://github.com/danishashko/grounding-citation-analysis

Grounding Citation Analysis is an open, reproducible study that reverse-engineers how Google AI Mode and Gemini decide what to quote. Every AI Mode citation URL hides a #:~:text= Web Text Fragment anchor; by decoding those fragments at scale we can see the exact sentence Google extracted from each source page, with no guesswork. The dataset spans 42,971 AI citations across six platforms.

The core insight

If you can see which sentence was cited and where it sat on the page, AI search optimization stops being a guessing game. Front-loading key insights, writing quotable sentences and structuring content for extraction all become data-backed strategies rather than best-practice folklore.

Research questions

Do cited sentences cluster in the top 30% of documents? (positional bias)
Are cited sentences shorter than average page text? (length preference)
Are structured pages (lists and tables) cited more often?
Do AI Mode and Gemini cite overlapping or distinct URLs?
Does sentence length vary by query category?

Key features

Decodes #:~:text= Web Text Fragments to extract the exact cited sentence from every URL

Sentence-level positional analysis: where on a page AI engines actually quote from

Cross-platform coverage: Google AI Mode, Gemini, ChatGPT, Perplexity, Copilot and Grok

Statistical tests for positional bias, sentence length and structured-content advantage

Fully reproducible open-source pipeline: collect, parse, scrape, analyze, visualize

Publication-quality charts generated for the accompanying write-up

Built with

Python 3.11
AI web scraping API
SciPy
Matplotlib
Seaborn
Jupyter

Frequently Asked Questions

What data does this analysis cover?

42,971 AI citations across Google AI Mode, Gemini, ChatGPT, Perplexity, Copilot and Grok. The citations span 100 queries across 12 categories, giving broad coverage of how different AI engines cite web content.

Can I reproduce this research?

Yes, the entire pipeline is open-source. You'll need an AI web scraping API account (roughly $6-25 for the data collection) and Python 3.11+. The scripts run sequentially from collection through analysis and chart generation.

How does this help with SEO and GEO?

Knowing exactly which sentences AI engines cite, and where they sit on the page, gives you concrete optimization targets. When cited sentences cluster in the top 30% of a page, front-loading your key insights becomes a data-backed strategy rather than a hunch.

Want help getting cited by AI search?

These tools are free to use. If you would rather have it done for you, let us put your brand in front of ChatGPT, Perplexity and Google AI.

Book a free call All tools