Technical Deep-Dive

How ChatGPT picks sources — and how to be one of them

ChatGPT cites sources via two completely different paths: training data and live browsing. Here is what governs each, and what you can actually influence.

by Robert Langner·Published: 2026-04-01·8 min read

ChatGPTCitationsTechnical

ChatGPT looks like a single product but contains at least two different retrieval mechanisms. Understanding which one is operating in any given answer is the difference between gaming the system and actually showing up.

Path 1: pure language-model recall

When ChatGPT answers without invoking the browse tool, it is responding from training data — the snapshot of the public web that OpenAI used to train the underlying model. This snapshot has a hard cutoff (today, somewhere around late 2024 or early 2025 depending on the model variant). Brands and facts that came online after the cutoff cannot appear via this path.

What governs visibility here is entity prevalence: how often your brand co-occurs with the topic across the training corpus. The corpus is dominated by Wikipedia, news outlets, structured directories, GitHub, Stack Overflow, official documentation and a long tail of indexed web content. A brand that is a stub on Wikipedia, has a Crunchbase entry, three positive industry-list inclusions and a coherent owned site outranks a brand with 1000 obscure backlinks every time.

Path 2: browsing-grounded retrieval

When ChatGPT decides to invoke browsing (it does this for queries with explicit time markers — „latest", „2026", „today" — or for fact-checks where it is uncertain), the model issues real-time queries to a search backend, scrapes results, and feeds the snippets back into its answer generation. This is similar to Perplexity's mode but invoked selectively.

Two things govern visibility here: (a) whether your URL ranks in the underlying search backend for the query OpenAI synthesised, and (b) whether the page is parseable on first fetch — no JS-only rendering, no auth wall, no excessive lazy-loading. ChatGPT's browsing tool is impatient.

What you can actually do

For training-data path

Get into Wikipedia if at all possible. A short, neutral, well-cited stub is enough. This is the single highest-leverage move for AI visibility.
Get into structured directories that are likely included in training data — Crunchbase, G2, Capterra, Producthunt, Stack Overflow tags, GitHub topics if you are dev-tooling.
Generate consistent third-party mentions in news, podcasts and industry reports. Volume matters less than consistency: the same canonical name in the same factual context across sources.
Maintain canonical brand documentation at a stable URL (/about, /company, /press). This page is what gets indexed and re-cited.

For browsing-grounded path

Allow GPTBot in robots.txt. OpenAI uses GPTBot for training and OAI-SearchBot for the browse tool. Allow both.
Render content server-side — at minimum, render the H1, lead paragraph, key facts and FAQ in initial HTML.
Add `Article`, `FAQPage` and `Organization` schema to the highest-traffic pages. Browsing-grounded retrieval favors pages with unambiguous structure.
Keep `lastModified` dates fresh and visible. Fresh is a strong signal for the browse tool.

What does not work

Stuffing your brand name into hidden text. Buying mentions on low-trust networks. Writing AI-targeted content with no human readability. AI engines weight quality and consistency, not surface optimisation. The detection mechanisms (cross-reference checks, factual coherence checks, source-trust scores) are good and getting better.

Diagnostic test: which path are you missing?

Run the same prompt three times in ChatGPT: once with browsing on, once with browsing off, once on Perplexity. If you appear with browsing off, you have training-data presence. If you appear only with browsing on, you have web-index presence but no entity recall — your brand is too new or too thinly mentioned. If you appear on Perplexity but not on browsing-on ChatGPT, your domain probably has crawl issues for the OpenAI infrastructure.

Frequently asked questions

Does ChatGPT use Bing or Google?

ChatGPT's browsing tool used Bing historically and now uses OpenAI's own search infrastructure (sometimes blended). For your purposes the practical answer is: optimise for retrieval-grounded engines in general — robots.txt, schema, server-side rendering — and the specific backend doesn't matter much.

How often is the training data updated?

OpenAI ships new model variants roughly every 6–12 months, each with an updated training cutoff. Between releases, language-model recall is fixed. This is why third-party trust signals (Wikipedia, structured directories) are so valuable — they get included in the next training pass.

Can I check whether ChatGPT „knows" my brand?

Yes. Ask: „What do you know about [Brand Name]?" with browsing turned off. If the answer is detailed and accurate, you have language-model presence. If the answer is generic, hallucinated, or „I don't know", you do not.