Retrieval, Not Rankings: How AI Engines Choose What to Cite

A page can rank #1 on Google and never get cited by a single AI engine. I watched this happen on my own properties before I understood why. Rankings and citations come out of two different machines. Rankings sort pages by relevance to a query. Citations attribute specific claims to specific sources. Once you see the pipeline that produces a citation, the gap stops being mysterious.

This post walks through that pipeline stage by stage. The stages are durable: they describe how retrieval-augmented systems work, not how one product behaves this quarter. The per-engine details near the end are dated on purpose, because those change.

The question you typed is not the question that gets searched

When someone asks an AI engine “what's the best CRM for a small sales team,” the engine doesn't search that sentence. It rewrites the question into several smaller queries: CRM comparisons, pricing for specific tools, reviews, feature lists for small teams. Google has documented this for AI Mode under the name query fan-out. The other engines do versions of the same thing.

This is the first place intuition from SEO breaks. You're not competing for one query. You're competing for sub-queries you never see, generated by a model, in shapes you didn't predict. A page that's a decent answer to the whole question can lose to five pages that are each the exact answer to one fragment of it.

The pipeline

Six stages sit between a user's question and your domain appearing as a source. Here's the whole machine, then each stage in detail.

Stage	What happens	What you control
1. Query fan-out	The question is rewritten into multiple sub-queries	Content that matches question fragments, not just head terms
2. Candidate retrieval	Each sub-query pulls candidate pages from an index	Being indexed by whichever index the engine uses
3. Fetch and access	Live engines fetch current page content	Crawler access: robots.txt, WAF rules, server-rendered HTML
4. Passage selection	Pages are split into chunks and scored per sub-query	Extractable blocks: one clean answer per section
5. Grounding and re-ranking	Passages are re-scored for relevance and source confidence	Entity signals, consistency, corroboration
6. Synthesis and citation	The model writes the answer and attaches sources to claims	Owning the specific fact, not just the topic

Stage 2: candidate retrieval

Every engine starts from an index. If your pages aren't in that index, nothing downstream can save you. The index is not always Google's, which surprises people. Which index each engine uses is in the dated section below, because that detail has already changed once and will change again.

Stage 3: fetch and access

Engines with live web access also fetch pages at answer time. This stage kills more sites than anyone admits, silently. A WAF rule that challenges bots, a CDN-managed robots.txt you didn't know you had, content that only exists after JavaScript runs. I wrote up one specific failure here: Cloudflare's managed robots.txt was blocking AI crawlers on sites whose owners never opted into anything. The page looks fine in a browser. To the retrieval system, it doesn't exist.

Stage 4: passage selection

The model never reads your page. It reads chunks of your page: passages a few hundred tokens long, embedded and scored against each sub-query. Most of your page is invisible to the system at any given moment.

This is the mechanical reason answer-first writing wins, and it has nothing to do with style. A direct answer in a self-contained block scores well as a passage. The same answer spread across four paragraphs of context-setting scores poorly, because no single chunk contains it. People treat “lead with the answer” as a writing tip. It's a chunking constraint.

Stage 5: grounding and re-ranking

Candidate passages get re-scored before synthesis: relevance to the sub-query, but also confidence in the source. This is where entity resolution enters the pipeline. The system is checking whether the claim's source is a coherent, corroborated entity or a fragmented one. I've written about that check in detail in Entity Authority Engineering: How AI Decides Who to Cite; in pipeline terms, that whole discipline lives at stage 5.

When no retrieved passage grounds a claim the model wants to make, the model makes the claim anyway, from training data, and sometimes gets it wrong. That failure has a cost I've named elsewhere: the Hallucination Tax. Stage 5 is where you either pay it or don't.

Stage 6: synthesis and citation attachment

The model writes the answer and attaches sources claim by claim. The unit of citation is the individual sentence the model just wrote, paired with the source that backs it.

This is the core difference between the two machines. Google's ranking answers “which pages are most relevant to this query.” Citation attachment answers “which source backs this specific sentence.” A topically relevant page with no extractable claims gets read and discarded. A page that owns one precise, verifiable fact gets cited every time that fact is needed.

ChatGPT answer for a Makati barangay zonal value query with five separate REN.PH citation chips, one attached to each individual claim — Captured 4 June 2026, logged-out session. Stage 6 made visible: each chip is one claim grounded to the source that holds its fact, five separate attachments in one answer.

It also explains citation stacking. When one domain owns the answers to several sub-queries from the same fan-out, it gets attached to several claims in the same answer.

A citation stack, explained mechanically

REN.PH holds 35,700+ barangay-level pages of Philippine property valuation data, each one resolved to an official PSGC code, the government's geographic standard. That's roughly 85% of the country's 42,011 barangays; the gap is mostly abolished or renamed barangays and source-data artifacts that don't survive verification. Each page is one clean answer to one narrow question.

Search Google for “taguig zonal value” and open the AI Overview. In my captures from May and June 2026, the headline claim (residential values from ₱7,000 to ₱523,000 per square meter) attributes to REN.PH's Taguig city page, and the barangay-level claims stack multiple REN.PH URLs on a single sentence: five in the May capture, three in June (the East Rembo page, the Titulok page, and the city hub, all attached to one claim about transitional areas).

Google AI Overview for taguig zonal value showing REN.PH cited for the headline value range and a stack of three REN.PH pages attached to a single barangay-level claim — Captured 4 June 2026. The expanded citation popover on the transitional-areas claim: three REN.PH URLs attached to one sentence. One, the Titulok page, is actually a Sultan Kudarat barangay, a stage 5 grounding collision on the shared name Bagumbayan.

For months I would have explained that with trust language: the engines like the site. The pipeline gives a better explanation. “Taguig zonal value” fans out into sub-queries about specific barangays, classifications, and rates. Each REN.PH page is a passage-sized answer to exactly one of those fragments. The domain wins stage 4 several times in the same answer, so stage 6 attaches it several times.

That same stack also caught the engine making a mistake, and it's worth being honest about it. One of the three URLs on the transitional-areas claim, the Titulok page, isn't Taguig at all. Titulok is a barangay in the municipality of Bagumbayan, Sultan Kudarat, down in Mindanao. Taguig also has a barangay named Bagumbayan, and the fan-out almost certainly collided on that shared name and grounded the Sultan Kudarat page onto a Taguig claim. REN.PH's own page is correctly resolved to its Sultan Kudarat PSGC code; the error is in Google's grounding, not the data. That's a stage 5 slip happening live, in the same answer the rest of the stack got right, and it's the clearest demonstration I have of why the stage exists. Ambiguous place names are exactly the noise grounding has to resolve, and strong, consistent entity signals are what tip it the right way more often than not.

The June capture shows the inverse in the same answer. The claim about BGC tower pricing attributes to Ayala Land's leasing site, because Ayala Land owns those specific facts: they're Ayala's buildings. One answer, each claim routed to the source that holds its fact. Stage 6 working exactly as described.

The pattern carries across engines. The same day, a ChatGPT query for a single Quezon City barangay (Immaculate Concepcion, Cubao) attributed the barangay-level value, the street-level range, the commercial ceiling, and the catch-all classification to REN.PH in one answer, four separate citation chips, and reproduced a street-by-street table from the underlying data.

ChatGPT answer for a Cubao barangay zonal value query with four REN.PH citation chips and a street-level value table — Captured 4 June 2026. Claim-level attribution on a different engine: each chip marks a claim grounded to a REN.PH page.

You can verify this yourself, with the caveat that AI answers vary between runs. Search “taguig zonal value” on Google and expand the AI Overview sources, or ask ChatGPT, Gemini, or Perplexity for the zonal value of a specific Taguig or Quezon City barangay. The full build is documented in the REN.PH case study.

The Philippines turned out to be a useful lab for this: government data is messy, entity resolution is hard, and almost nothing was structured before. If the pipeline mechanics work there, they're not an artifact of an easy environment.

What each engine does differently (as of mid-2026)

Last verified: June 2026.

Everything above is durable. This section isn't, so it's dated, and I'll update it as the engines change.

ChatGPT search has historically retrieved through Bing's index alongside OpenAI's own crawling (OAI-SearchBot), with OpenAI visibly building out independent retrieval. Perplexity runs its own crawler and index (PerplexityBot) and leans hardest on freshness. Gemini and Google AI Overviews retrieve from Google's index, and AI Overviews draws candidates from systems adjacent to normal search ranking, which makes it the one surface where rankings and citations still correlate strongly. If you only remember one engine-specific fact: ranking well helps you most in AI Overviews and least in ChatGPT, because of where each one gets its candidates.

I don't know how long any of this paragraph stays true. The six stages above will outlast it.

What this changes about how you build

Three of my earlier posts turn out to be descriptions of single stages. The piece on why AI search gives different answers is about variance inside stages 2 and 5: different retrieval configs, different cohorts, different snapshots. The Entity Authority Engineering work is stage 5. The crawler access research is stage 3. The pipeline is the map they all sit on.

The practical move: take the one query that matters most to your business, run it through two engines, and read the sources. Then figure out which stage you're dying in. Not indexed is stage 2. Indexed but blocked is stage 3. Fetched but never extracted is stage 4. Extracted but outranked by corroborated competitors is stage 5. Each failure has a different fix, and most “GEO checklists” sell you all the fixes without telling you which stage is your problem.

Mine was stage 3 for the first stretch after launch. I found out by reading crawler logs, not by reading advice.

FAQ

Does ranking #1 on Google mean AI engines will cite me?

No. Rankings sort pages by query relevance. Citations attach sources to individual claims. A #1 page with no extractable, claim-level answers can be retrieved and still never cited. The correlation is strongest in Google AI Overviews and weakest in ChatGPT.

Why does the same site get cited multiple times in one AI answer?

The engine rewrites one question into several sub-queries. If one domain holds the best passage-level answer to several sub-queries, the synthesis stage attaches it to several claims. One page per narrow question is the structural pattern that produces stacking.

Why is my page indexed but never cited?

Usually stage 3 or stage 4: AI crawlers can't fetch the page (robots.txt, WAF, JavaScript-only content), or the answers exist but no single passage contains them. Check crawler access in your server logs first; it's the cheaper test.

What is query fan-out?

Query fan-out is the technique where an AI engine rewrites one user question into multiple smaller sub-queries and retrieves results for each in parallel, then synthesizes one answer from the combined results. Google has documented the technique by this name for AI Mode; other engines use similar approaches. The practical consequence: your content competes against sub-queries the user never typed.

How do I check if AI crawlers can access my site?

Search your server or CDN logs for the bot user agents: GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended. Zero or near-zero fetches from these bots on a content site is a stage 3 failure. Then read your live robots.txt, including anything your CDN manages on your behalf, since CDN-managed rules can block AI crawlers without you opting in. A quick manual test: request a page with curl using one of the bot user agent strings and confirm you get the full HTML back, not a challenge page.

About the Author

Aaron Zara is the founder of Godmode Digital and the engineer behind REN.PH (60,000+ verified Philippine real estate data nodes). He holds a PRC real estate broker license and has 18 years of building across digital marketing and business operations.

godmode.ph · REN.PH · github.com/GodModeArch

Retrieval, Not Rankings: What Happens Between a Question and a Citation