Claude and GPT Disagree: Why a Universal Agent-Readiness Score Is Unreachable

A 908-brand correlation study published 2026-04-26 by Respectarium found four agent-readiness checks where the same signal predicts opposite outcomes in Claude and GPT. Sites with a sitemap are more likely to be listed by Claude. Sites with a sitemap are less likely to be listed by GPT. Both directions cleared FDR significance in the same dataset.
Two LLMs disagreeing on which brands matter is not a measurement quirk. It is two LLMs selecting by structurally different criteria. And Gemini, the third LLM in the study, does not fit either pattern. Three LLMs read by three different rules. A universal agent-readiness score that optimizes across all three at once is unreachable in the data. This piece is the breakdown of why.
Key Takeaways
- Four agent-readiness checks (
sitemap-exists,oauth-discovery,robots-txt-exists,markdown-negotiation) predict opposite outcomes in Claude vs GPT, all four FDR-significant in both directions in Respectarium's 908-brand correlation study. - The strongest single correlation in the entire study is
sitemap-existspredicting GPT non-listing at ρ = -0.172. A check designed to help agents find a site predicts the opposite outcome for one of the three LLMs measured. - Gemini reads by yet another pattern.
markdown-url-support(ρ = +0.163) andredirect-behavior(ρ = -0.141) dominate Gemini's brand selection. Three LLMs, three different criteria. - Effect sizes are small to medium across the entire study. Best Cohen's d = 0.64; no effects > 0.8. Real, but not silver bullets.
- A universal agent-readiness score that optimizes across Claude, GPT, and Gemini simultaneously is unreachable in the current data. Per-LLM strategies are different problems.
- Action: stop chasing a single agent-readiness score. Track Claude, GPT, and Gemini visibility separately. The signals that move one may move another the wrong way.
Why does the same agent-readiness check predict opposite outcomes in Claude and GPT?
A 908-brand correlation study found four agent-readiness checks where the same signal moves Claude and GPT in opposite directions on which brands each LLM picks. Both directions cleared FDR significance. Same checks. Two LLMs. Opposite verdicts.
The full correlation study was published 2026-04-26 by Respectarium. It tested 50 predictors against five LLM-visibility outcomes across Claude, GPT, and Gemini. Two outcomes anchor most of the analysis. The first is the rank a brand holds (1 to 20) when a given LLM lists it. The second is the binary "listed by this LLM at all" outcome across all 908 brands. The reversal lives in the binary view: which brands an LLM picks at all, not where the brand ranks once picked. FDR-corrected p values are reported throughout (false-discovery-rate adjustment, the standard correction when many statistical tests run together). A finding that survives FDR at p < 0.05 across hundreds of comparisons holds up against the kind of statistical fishing that would otherwise inflate false positives.
The four checks are agent-readiness baseline. The kind of thing every site is supposed to do. sitemap-exists checks whether the site publishes an XML sitemap at /sitemap.xml or as declared in robots.txt. robots-txt-exists checks whether the site publishes a robots.txt file at all (separately from what is in it). oauth-discovery checks for an OAuth authorization-server metadata document at /.well-known/oauth-authorization-server. markdown-negotiation checks whether the server returns text/markdown when an agent requests it via an Accept header instead of returning HTML. Standard stuff.
Four signals reverse direction between Claude and GPT
| Signal | Claude ρ | GPT ρ | Claude FDR p | GPT FDR p |
|---|---|---|---|---|
sitemap-exists | +0.134 | -0.172 | 0.0009 | <0.0001 |
oauth-discovery | +0.135 | -0.113 | 0.0009 | 0.0097 |
robots-txt-exists | +0.125 | -0.104 | 0.0018 | 0.0212 |
markdown-negotiation | -0.106 | +0.119 | 0.0111 | 0.0082 |
Source: Respectarium correlation study, 02-per-llm.md, per-LLM binary analysis. All eight entries are FDR-significant within their respective LLM cell.
Reading row one. Brands with a sitemap are MORE likely to make Claude's list. Brands with a sitemap are LESS likely to make GPT's list. Both correlations clear FDR significance in their respective LLM analysis. Across the 300 per-LLM tests in the binary analysis, the GPT side of this row is the strongest single correlation in the study.
Reading row four. Brands that ship markdown content negotiation are LESS likely to make Claude's list. Brands that ship markdown negotiation are MORE likely to make GPT's list. Same signal. Reversed pattern.
The shape of the finding matters more than any single number. Same checks. Two LLMs. Opposite verdicts. Both directions FDR-significant. This is not noise canceling out across LLMs. It is two LLMs selecting brands by structurally different criteria.
![]()
Gemini reads by yet another pattern
Gemini's strongest binary predictors are different signals entirely. Two checks dominate Gemini's brand selection at FDR significance:
| Signal | Gemini ρ | FDR p |
|---|---|---|
markdown-url-support | +0.163 | <0.0001 |
redirect-behavior | -0.141 | 0.0009 |
Neither of these signals appears in Claude's or GPT's top picks. On the four signals where Claude and GPT reverse, Gemini is essentially null. sitemap-exists shows ρ = -0.071 in Gemini's binary outcome (raw p = 0.038, but does not survive FDR correction). The other three reversed signals do not reach Gemini's top ten predictors.
The story is not "Claude versus GPT, with Gemini siding with one of them." It is three LLMs reading by three genuinely different criteria. A unified agent-readiness score has to reconcile three patterns, not two. And on the four signals where Claude and GPT reverse, Gemini does not weigh in either way.
![]()
What might explain the reversal?
The data establishes that the reversal exists. It does not establish why. Two testable hypotheses fit the pattern. Picking between them needs experiments the cross-sectional data cannot run. The structural fact (opposite directions on the same four signals at FDR significance) is what survives any explanation.
Hypothesis 1: brand vintage
GPT's older training cutoff means it can name household-brand candidates from training-data distribution at inference time, without crawling the live web. Sites that ship a sitemap and a robots.txt skew toward established players GPT already knows. Sites without skew newer or scrappier and overrepresent in GPT's "fresh recall" surface area. The correlation runs the wrong way for the readiness signal because the readiness signal acts as a proxy for established-player status. GPT's response set is already biased toward established players from training in the first place.
A supporting fact in the same study: GPT over-includes bot-protected brands by +8.1 percentage points versus unblocked brands. Bot-protection is consistent with name-recognition over crawl-ability. As one Hacker News commenter put it on a related discussion about LLM training behavior: "Training is not inference, there is no reasoning happening then either". The brand-vintage hypothesis turns on exactly that distinction. The brand candidates GPT surfaces at inference time were largely set during the training crawl, and the crawl involves no reasoning about which brands are best. The practical implication for SEO and agent-readiness work: ChatGPT's behavior on search-shaped queries differs structurally from a Google fetch, and the brand candidates surfaced may already be locked in before any live web read happens.
Hypothesis 2: crawler policy
Sites that bother shipping a robots.txt are also more likely to ship explicit disallow rules. Some of those rules exclude GPT's crawler specifically (User-agent: GPTBot, Disallow: / configurations). Sites with no robots.txt at all are implicitly permissive across every crawler. The inversion: well-configured sites under-represent in GPT's reachable set, exactly as observed in the study. The data records whether each site has a robots.txt file but not which crawlers each robots.txt configuration disallows. The hypothesis is testable but not yet tested. A follow-up study would parse the robots.txt of every brand in the sample, classify by GPTBot disallow status, and re-test the correlation conditional on policy.
What survives
The cross-sectional data does not pick between these two candidates. Both predict the observed direction. Neither is asserted here as a finding. Effect sizes are small to medium across the entire study. The largest Cohen's d is 0.64 (oauth-discovery against Claude rank); no effect anywhere reaches d > 0.8. The reversal holds up on the data measured. It is not large in absolute size.
What does this mean for agent-readiness strategy?
A universal agent-readiness score that optimizes across Claude, GPT, and Gemini at the same time is unreachable in the current data. Per-LLM tuning is the only honest play. And per-LLM tuning can mean directionally opposite advice for the same signal. This is not a limitation of one scanner or one weighting scheme. It is a feature of the data: the LLMs themselves disagree.
The empirical anchor
The Respectarium scanner's own aggregate score had mean |ρ| = 0.016 across the five LLM-visibility outcomes in the study. FDR-adjusted p = 0.69. Zero predictive power. Respectarium published this null result openly in the preprint methodology. The mechanism behind a null aggregate is exactly what the four-signal reversal explains. Opposing per-LLM correlations cancel in any cross-LLM average. Build an aggregate from a Claude-positive signal and a GPT-negative signal and the result hovers near zero. Visible the moment the data is split by LLM.
This generalizes beyond Respectarium's scanner. Any single-number agent-readiness score that aggregates across LLMs will produce the same kind of canceling. The marketing claim "ship these signals and improve your AI visibility" is a cross-LLM average of effects. The data shows the cross-LLM average is the part with no signal.
The narrow per-LLM prescription
Three actions, each load-bearing.
- Pick the LLM that matters to your buyer. Claude tends to matter most in coding-tool-adjacent buying audiences. GPT in mass-market and consumer surfaces. Gemini in Workspace-bundled accounts. The choice is a buyer call, not an SEO call.
- Tune for that LLM's signals. Expect the choice to age as the ecosystem matures and crawlers shift behavior. The reversal observed in 2026-04 is not guaranteed to hold in 2027.
- Ignore agent-readiness scoring that averages across LLMs. The averaging hides the reversal. Whether the product calls it an Agent Readiness Score or an AI Visibility Index, the cross-LLM aggregate is the part the data says has no signal.
The ecosystem context
The agent-readiness ecosystem itself does not yet match where the marketing reads. 20 of 66 measured per-check signals had less than 5% real-world adoption among the 908-brand sample. The bleeding-edge protocol family (MCP server cards, A2A agent cards, OAuth protected resource declarations, AGENTS.md, web-bot-auth, x402, mpp, ucp, acp, ap2) clusters near zero adoption. The specs exist. Practice has not yet caught up. The only mainstream-brand-level spec ship in the last 90 days was Yoast's llms.txt for Shopify on 2026-03-31. One ship across roughly 14 spec announcements.
CompetLab AI Visibility tracking reports Claude, GPT, and Gemini outcomes separately by default, exactly because the cross-LLM averaging that hides the signal in a universal score does not work for users either. Per-LLM is the only frame that survives the data.
How was this measured, and what are the limits?
The Respectarium correlation study (study-2026-04) is a cross-sectional snapshot of 908 brands across 50 B2B SaaS categories. Three independent agent-readiness scanners (Respectarium, Cloudflare, and Fern) produced 72 measurements per brand. Each brand's listing position across Claude, GPT, and Gemini was captured from Respectarium's tracked leaderboards over those 50 categories. Sample sizes per binary test are typically 845 to 906 (some predictors are null for ~60 brands). Per-LLM rank-only analyses use smaller samples, since each LLM only lists a subset of the 908 brands in its category responses. Claude n ≈ 640. Gemini n ≈ 380. GPT n ≈ 378. Pre-registered analytical thresholds were committed in writing on 2026-04-24, two days before data analysis began. All analyses are deterministic. Source data and all 11 stats scripts are public on GitHub at signals.csv.
Selection effect
Every brand in the dataset was already LLM-mentioned by construction. The brand universe was assembled from brands that appeared in at least one LLM's response to a category-level "top brands" prompt. Findings characterize relative rank among already-mentioned brands. The headline marketing question (does shipping these signals get an LLM to mention me at all) cannot be answered from this data. There are no non-mentioned brands in the sample. The data tells you which signals shift ranking among brands LLMs already know.
Critical scope limit on the reversal
All four sign-reversals are measured by Respectarium's scanner specifically. No Cloudflare or Fern check shows the same Claude/GPT divergence at FDR significance. Cross-scanner correlation on same-named checks is ρ ≈ 0.03 across three pairs (fern.redirect-behavior ↔ respectarium.redirect-behavior, fern.markdown-url-support ↔ respectarium.markdown-url-support, fern.rendering-strategy ↔ respectarium.rendering-strategy). The implementations of the "same" check differ enough that the reversal cannot be replicated by simply running the other two scanners on the same brand set. Treat the reversal as supported by the data available, not as settled across the agent-readiness field. Scanner divergence is its own published finding.
Independent COI disclosure
Respectarium operates one of the three scanners evaluated. The preprint reports the zero-predictive-power result on the scanner's aggregate score (mean |ρ| = 0.016, FDR-adjusted p = 0.69) as a primary finding rather than a footnote. Pre-registered thresholds are auditable in the open research repo. Per-scanner reporting throughout, so readers can examine each scanner's signals independently. Full COI section at methodology.md §4.
The companion specification (the open Agent-Adoption Specification v1) is published under CC-BY 4.0 and is implementable by independent scanners. CompetLab maintains a separate implementation: scan your own site against the open spec. Plain-language interpretation of the findings lives at the same domain.
The signals that move Claude, GPT, and Gemini brand selection in 2026 may shift as crawler behavior shifts. The current cross-sectional data establishes the gap. Two LLMs, same checks, opposite verdicts. Three LLMs, three different patterns. The next quarterly re-run will say whether the reversal holds. Plan per-LLM until the data says otherwise.
Frequently Asked Questions
Does this mean optimizing for agent-readiness hurts my GPT visibility?
Not exactly. The reversal is in relative ranking among brands GPT already lists. The data does not say 'if you implement these signals, GPT removes you from its list.' It says 'among brands GPT already mentions, those without a sitemap rank higher in GPT's response than those with a sitemap, all else equal.' The mechanism behind the inversion is unsettled. Two hypotheses fit, neither tested. Do not tear out your sitemap based on a cross-sectional observation.
Are the effect sizes large enough to act on?
The largest Cohen's d in the entire study is 0.64 (medium). The largest |ρ| is 0.172. These are small-to-medium effects, not silver bullets. They are real (FDR-significant in both directions) and they are honest. Respectarium's preprint reports the sizes plainly. The takeaway: agent-readiness signals shift relative ranking among already-listed brands by a few positions on average, and the direction depends on which LLM you measure.
What about Gemini? Should I tune for Gemini separately too?
Yes, if Gemini matters to your buyer. Different signals dominate Gemini's brand selection: markdown-url-support (ρ = +0.163) and redirect-behavior (ρ = -0.141), both FDR-significant. Neither is in Claude's or GPT's top picks. Gemini reads by a third pattern. The general implication: track each LLM's visibility separately. A unified agent-readiness score has to reconcile three patterns, and the data shows the three are not converging.
Can I trust a finding when it is only on one of the three scanners?
This is the key methodological limit. All four sign-reversals are on the Respectarium scanner. No Cloudflare or Fern check shows the same Claude/GPT divergence at FDR significance. Cross-scanner correlation on same-named checks is ρ ≈ 0.03 across three pairs. Implementations differ enough that simple replication is not possible. Treat the reversal as robust on the data available, not as settled across the agent-readiness field. Respectarium discloses this scope limit in the methodology.
Is a universal agent-readiness score impossible, or just hard right now?
Unreachable in the current data, in the precise sense that any cross-LLM aggregate cancels the four FDR-significant reversals between Claude and GPT. Until LLM crawling and brand-recognition behaviors converge across vendors, optimization across all three at once is a contradiction. Any weighting that helps Claude visibility on these four signals hurts GPT visibility on the same signals. The quarterly correlation re-runs are designed to track whether the reversal stabilizes or shifts.
Share this article