Back to Blog

Answer Engine Optimization

Entity-First Information Architecture: Why AI Engines Resolve Entities Before They Rank Pages

A page can carry perfect schema markup and never get cited. Two controlled studies just confirmed it. What moves the needle is entity strength, and that is an architecture problem.

June 5, 2026 · 8 min read

Entity resolution as a left-to-right pipeline: a scattered cluster of dim, duplicate, unresolved entities on the left resolving into a single bright canonical entity node on the right

Entity-first information architecture organizes a site around resolvable real-world entities: people, places, organizations, datasets. One canonical URL per entity, each grounded against an authoritative identifier. AI engines retrieve by resolving entities first, then selecting passages. Sites with ambiguous or fragmented entities lose citations regardless of content quality, and regardless of how much schema markup they carry.

That last clause used to be my opinion. As of last month, it has third-party data behind it.

Two studies changed what we can claim

On May 11, 2026, Ahrefs published a study by Louise Linehan and Xibeijia Guan titled “We Tracked 1,885 Pages Adding Schema. AI Citations Barely Moved.” The setup matters. Their first pass looked at 6 million URLs and found that AI-cited pages were almost three times more likely to carry JSON-LD than non-cited pages. That correlation is the statistic the schema-for-AI pitch has been built on. So they ran a second test designed to isolate causation: 1,885 pages that added JSON-LD between August 2025 and March 2026, matched against 4,000 control pages with similar pre-treatment citation levels, difference-in-differences analysis 30 days before and after the markup went live.

The result: +2.4 percent in Google AI Mode and +2.2 percent in ChatGPT, both indistinguishable from random noise. AI Overviews citations on the treated pages declined 4.6 percent, small but statistically significant, and Ahrefs itself declines to confidently attribute the decline to schema. The correlation was real. The causation was not there.

This was not a one-off. Search Atlas published an analysis comparing citation rates across domains with full, partial, and zero schema coverage. Flat across the board, on OpenAI, Gemini, and Perplexity. Domains with complete coverage performed no better than domains with none.

Two caveats, and I want them in the body of this article, not buried.

First, the Ahrefs dataset only included pages that were already being cited heavily. Every page in it had 100+ AI Overview citations before any schema was added. The study cannot tell us whether markup helps an uncited page get crawled, parsed, and indexed in the first place. The authors say this themselves.

Second, the platforms complicate the picture, in both directions. Bing's Fabrice Canel said on stage at SMX Munich in March 2025 that schema markup helps Microsoft's LLMs understand content for Copilot. That is the only first-party, on-the-record confirmation from a major platform. Google now points the other way: its official AI features guide states you do not need new machine-readable files or markup to appear in generative AI search. Understanding content and citing it are different operations, and the one platform that confirmed using schema confirmed it for the first, not the second.

So the defensible read is narrow: schema markup, added to content, does not move citation frequency. The Ahrefs team's follow-up commentary points at what does correlate with citations: strong organic performance, entity strength in the Knowledge Graph and across platforms, and information gain, meaning the page contains something the engine cannot get elsewhere.

Entity strength. That is the part this article is about.

Resolution happens before ranking

Here is the mechanical claim, and it is the reason keyword-first sites underperform in AI answers even when their content is good.

When a query hits a generative engine, the engine does not start by ranking pages. It starts by resolving the entities in the query. “Zonal value of Barangay San Antonio, Quezon City” contains three entities: a concept (zonal value), a barangay (San Antonio), and a city (Quezon City). Before any passage gets selected, the system has to decide which San Antonio. The Philippines has dozens of barangays with that exact name. If your site mentions San Antonio in fifteen blog posts with no canonical page, no parent-place context, and no stable identifier, the engine cannot ground the reference. It will pull from a source that can be grounded, even a worse-written one.

I covered the retrieval side of this in Retrieval, Not Rankings: engines answering from live retrieval pull a candidate set, then extract. Entity resolution is the gate in front of that candidate set. Keyword matching gets you considered. Entity resolution gets you trusted as the referent.

This is also why the Philippines turned out to be a useful place to test the idea. The country has 42,011 barangays, heavy name collisions, multiple renames per decade, and government records published as free text with no stable public identifiers. If entity resolution works here, it works in easier markets.

Entity Debt

A term from the Entity Authority Engineering framework:

Entity Debt is the accumulated inference gap between what an entity actually is and what AI models currently believe about it. An entity with high Entity Debt is either unknown to the models, misrepresented by them, or filled in with hallucinated details. It compounds over time as competitors build their own entity profiles and the gap widens.

The architecture lens on that gap is what this article is about. In information-architecture terms, Entity Debt is the accumulated cost of publishing content about entities your site has never canonically defined, disambiguated, or grounded against an authoritative identifier. Every new page that references an ambiguous entity raises the cost of resolving it later.

This post and Entity Authority Engineering: How AI Decides Who to Cite cover two different lanes of the same problem. That post is about the confidence check: how an engine evaluates whether a source's signals are consistent enough to trust. This one is about information architecture: how you structure the site so the check has a single resolvable referent to land on in the first place.

You can spot it without tooling. The same entity named three different ways across the site. Entity information smeared across blog posts instead of living at one stable URL. No grounding against any external authority. And the symptom that usually gets a founder's attention: AI engines describing your own entities wrong, or citing a competitor for queries about data you published first.

Integrity Gap

The companion term. The full treatment is in The Integrity Gap whitepaper; the working definition:

Integrity Gap is the measurable distance between what AI engines say about an entity and what the authoritative record says. It is measured by running a fixed query set against each engine on a schedule and scoring the answers against ground truth.

Brief: this is the diagnostic metric. Entity Debt is the disease, Integrity Gap is the blood test. The measurement method, including query set construction and per-engine citation tracking, is the subject of a companion piece on measuring AI visibility. If you only take one action from this article, it should be running that test on your three most ambiguous entities before changing anything.

What paying it down looked like at 37,000-barangay scale

REN.PH is a Philippine real estate data platform I run: 60,000+ pages, 234,337 zonal value rows, 1,626 cities and municipalities, as of June 2026. The raw input was BIR zonal value records. Anyone who has worked with that data knows the state of it: barangay names as free text, subdivisions renamed between revisions, phantom rows, inconsistencies across district offices for the same physical place.

The grounding work was entity resolution against the PSGC, the Philippine Standard Geographic Code, which assigns every region, province, city, municipality, and barangay a canonical identifier. Current production state: 37,660 barangay rows, 35,743 of them PSGC-grounded. 94.9 percent. The remaining 1,917 are residuals, mostly renamed or dissolved units that need manual review.

Every grounded barangay gets one canonical URL, its PSGC code, its parent hierarchy, and its zonal value data in one place. The naming is consistent sitewide. Name variants live on the entity page, not scattered through content.

The observable outcome: Gemini answers barangay-level zonal value queries with citation stacks pointing at REN.PH pages, frequently multiple REN.PH citations in a single answer. The queries are in the verification block below so you can check the current behavior yourself rather than take a screenshot's word for it. The full build is documented in the REN.PH case study.

One thing I cannot claim cleanly, so I will say it plainly. REN.PH also carries Dataset schema across those pages, implemented at the same time as the grounding work. On a single property, I cannot fully separate the two variables. What I can say is that the Ahrefs and Search Atlas results make the markup a weak explanation for the citations, and the grounding a strong one. The entity resolution is also the part that was hard. The JSON-LD took days. The PSGC grounding took months.

The entity-first pattern

What this looks like as an implementation checklist, in priority order:

  1. One canonical URL per entity, stable permanently. The entity page is the unit of architecture, not the article.
  2. Ground every entity against an external authority where one exists. PSGC for Philippine places. PRC license numbers for professionals. SEC registrations for companies. ISBNs, ticker symbols, whatever your vertical offers. The identifier is what lets an engine confirm your San Antonio is the right San Antonio.
  3. One canonical name per entity, used everywhere. Variants get listed once, on the entity page.
  4. Connect the graph. @id-linked Organization, Person, and Dataset nodes, with sameAs pointing at off-site profiles. Note what this is for after the May studies: not a citation lever, connective tissue. It declares which entities you mean. The declaring is the value.
  5. Disambiguation pages for collision-heavy names, the same way Wikipedia handles them.
  6. Reinforce off-site. The same entity, same canonical name, same facts on GitHub, LinkedIn, third-party mentions. The Ahrefs follow-up work points at off-site mentions as one of the stronger citation predictors, and that matches what I see in client audits.

Keyword-first vs entity-first

Keyword-first IAEntity-first IA
Organizing unitThe target keywordThe real-world entity
URL strategyOne page per query variantOne canonical page per entity
Success metricRankings per keywordCitations and answer accuracy per entity
Typical failureCannibalization between near-duplicate pagesResidual entities that resist grounding
What the engine seesMany pages competing to match a stringOne resolvable referent with an identifier

Verify this yourself

Run these against Gemini and ChatGPT with browsing on:

  1. What is the zonal value of properties in Western Bicutan, Taguig City?
  2. What is the zonal value of properties in Dagatan, Lipa City?
  3. What is the zonal value of properties in Loyola Heights, Quezon City?

Look at the citation stack, not just the answer. Then run the same pattern on your own site: pick your most ambiguous entity, ask three engines about it, score the answers against your authoritative record. That score is your Integrity Gap. If you want the full methodology, start with how to measure AI visibility. If you want it done for you, that is the work.

Where this article fits

This piece is itself an entity deposit. Entity Debt and Integrity Gap are defined above, marked up as DefinedTerms, attributed to one author entity with a stable identifier graph. Whether engines adopt the terms is not something I will guess at. It is a line item in the same monthly citation report I run for clients, and the next data point lands in July.

FAQ

What is entity-first information architecture?

Organizing a site around resolvable real-world entities instead of keyword-targeted pages: one canonical URL per entity, grounded against an authoritative identifier, with consistent naming sitewide.

Does schema markup improve AI citations?

The available evidence says no. Ahrefs' May 2026 controlled study of 1,885 pages found no citation lift after adding JSON-LD (+2.4% AI Mode, +2.2% ChatGPT, both noise), and Search Atlas found flat citation rates across schema coverage levels. Markup still has a role in declaring entity relationships and supporting parsing. It is not a citation lever.

What is Entity Debt?

The accumulated inference gap between what an entity actually is and what AI models currently believe about it. In information-architecture terms, it is the cost of publishing content about entities a site has never canonically defined, disambiguated, or grounded. It compounds with every new page that references an ambiguous entity.

How do you measure AI visibility for an entity?

Run a fixed query set against each engine on a schedule, record which sources get cited, and score answer accuracy against the authoritative record. The distance between the engine's answer and the record is the Integrity Gap.


About the Author

Aaron Zara is the founder of Godmode Digital and the engineer behind REN.PH (60,000+ verified Philippine real estate data nodes). He holds a PRC real estate broker license and has 18 years of building across digital marketing and business operations.

godmode.ph · REN.PH · github.com/GodModeArch

Measure the Integrity Gap on Your Worst Entities

30 minutes. We'll take your three most ambiguous entities, ask Gemini, ChatGPT, and Perplexity about them, read the citation stacks, and score the answers against your real record. You leave knowing exactly where your Entity Debt is.