500M AI searches: the signals that drive citations
A 500-million-search dataset shows AI citations concentrate around a small set of trusted sources. Most enterprise content is structurally invisible to the models.
Key takeaways
- Traditional SEO authority is necessary but not sufficient for AI citations.
- Citations cluster around sources LLMs already trust: Reuters, Bloomberg, Wikipedia, regulators, Reddit.
- PDF-first publishing is the single biggest citation killer for multilaterals and policy institutions.
- PR and analyst relations are now AEO infrastructure, not adjacent functions.
- Citation share is becoming a proxy for thought-leadership share in any given category.
What happened
Per Search Engine Journal, an analysis spanning 500 million AI searches points to a small set of signals that consistently determine whether a brand gets cited inside ChatGPT, Perplexity, Gemini, and Google's AI Overviews. The headline finding: traditional SEO authority still matters, but it is not sufficient. Citation-worthy content is structurally different from content that ranks.
The piece, published this week, argues that the brands winning AI citations are doing three things at once: matching prompt intent at the passage level, earning third-party validation that LLMs already trust, and publishing in formats the models can extract cleanly. Volume of content is not on the list.
That reframing matters. For two years, marketing teams have been told that "great SEO is great AEO." The 500-million-search dataset suggests that is a polite oversimplification.
Why it matters for your brand
If you run content for a bank, an insurer, a UN agency, or an industrial group, the practical implication is that your existing content library is probably misaligned with how LLMs select sources. Most enterprise content is written to brand guidelines, not to prompts. It opens with mission language, buries the answer in paragraph four, and uses internal terminology no buyer types into ChatGPT. Models skip it.
The fix is not more content. It is restructuring what already exists so that the answer to a likely prompt sits in the first 60 words of a page, with the entity (your firm, your fund, your standard, your framework) named explicitly nearby. For a CGAP or an UNDRR, this means the policy briefs and technical notes need a top-of-page synthesis written in the language a finance ministry official or a resilience officer would actually use when querying an LLM. The PDF-first habit is the single biggest citation killer in the multilateral sector. LLMs cite HTML pages with clean structure. They under-cite PDFs, even authoritative ones.
For financial services, the second signal (third-party validation the model already trusts) is where the gap is widest. Citations cluster around sources the model has been trained to weight: Reuters, the FT, Bloomberg, the BIS, Moody's, the regulators themselves, and increasingly Reddit and Wikipedia for retail-facing queries. A bank's owned content rarely gets cited in a query like "best private banks for ultra-high-net-worth clients in Singapore" because the model defaults to third-party rankings and forum threads. The implication: PR and analyst relations are now AEO infrastructure. Getting named in a Reuters explainer or a Wikipedia entry on your category does more for citation share than a year of blog output.
For major industrial groups, including a Holcim or a comparable building materials peer, the citation battleground is technical authority on sustainability, materials science, and standards. The companies winning citations on "low-carbon cement" or "EPD-compliant concrete" are the ones whose technical white papers are indexed in HTML, cross-referenced by trade press, and aligned with ISO and IEEE language. If your engineering teams publish to a gated portal, the model cannot see you, and your competitor with an open technical library wins the citation by default.
For philanthropic and policy institutions, the risk is subtler. LLMs are increasingly the first stop for journalists, programme officers, and grantees doing background research. If a foundation's framework is not cited in the LLM's answer to "what are the leading approaches to financial inclusion," that framework loses agenda-setting power, regardless of how many times it has been downloaded. Citation share is becoming a proxy for thought-leadership share.
The signal in context
The 500-million-search analysis aligns with what a growing number of independent studies have found over the past 18 months: AI citations concentrate. A small number of sources capture a disproportionate share of references on any given topic, and the composition of that small group is determined more by training-data trust and structural clarity than by recent publishing volume. This is the opposite of the long-tail logic that shaped enterprise content strategy from 2015 to 2022.
The strategic consequence for senior marketers is that the unit of competition has shifted. It is no longer "do we rank for this keyword." It is "are we one of the three to five sources a model will name when asked about our category." That is a much smaller, harder-won list, and it is built through a combination of editorial-grade owned content, earned coverage in publications the model trusts, and structural discipline on the pages themselves. Brands that treat AEO as a tactical SEO update will keep losing citation share to competitors who treat it as a repositioning exercise.