00 · Methodologyaxo-v0-haiku-4.5-2026-05-05

What the score knows, and what it doesn't.

AXO is an empirical, per-page agent-readiness score. The number comes from probing real LLM agents against real pages and watching what happens. This document is the long-form answer to every question that starts with "how do you know."

5,000

Corpus

HN-sourced URLs scored

4,541

Probes evaluated

govt archetype excluded

Haiku 4.5

Calibration model

claude-haiku-4-5-20251001

2026-05-05

Frozen

version tag immutable

One · The probe

An agent reads the page.

Every URL in the corpus is scraped once and then handed to a Claude Haiku 4.5 agent in BROWSER_USE modality on a READtask. The agent is asked to extract the page's primary content. We record three things: did it succeed (binary), how many tokens it used (input and output), and the resulting USD cost.

Failure is not a single thing. It can be an anti-bot wall, a paywall, a JavaScript-only renderer that the scraper missed, a layout the agent couldn't parse, or a page that simply has no content where its title promised some. We record the outcome and the cost; we don't try to classify the cause at probe time.

The corpus

5,000 URLs.

Hacker News submissions form the seed list. The corpus skews long-tail and English-language; it is not a representative sample of the whole web. Government pages (archetype = govt) are excluded from validation because their probe outcomes are dominated by access policy, not page structure.

N=4,541 is the validation sample. The full 5,000 corpus carries scores; only the 4,541 is what the published ρ refers to.

Two · The score

Eleven dimensions, equal weight.

The AXO score is the count of how many of eleven boolean dimensions a page satisfies, mapped to a 0–100 scale. The dimensions are arranged so true means agentically good: js-independent, structured-data present, heading hierarchy valid, semantic HTML present, machine-readable blocks present, answer-first, navigable, selector-rich, prose-heavy, text-container is prose, and references are contextual.

Equal weight is a choice, not a fallback. We tested learned weights against binary success across five seeds: held-out ρ fell from +0.322 (equal-weight) to +0.257 (learned, mean over 5 seeds). The static-feature space simply does not benefit from learning on this label. v0 is frozen at the equal-weight composite because that is the strongest defensible scorer the data supports.

Three · Validation

ρ = +0.322.

The published correlation is Spearman ρ = +0.322 between the equal-weight v0 score and binary probe success on the N=4,541 validation sample. This is a direct rank-correlation, not a regression fit. It is also the strongest figure across every static- feature alternative we have tried: HTTPA page features failed a 40% coverage gate before regression; HTTPA origin features came back at Δρ=−0.084 (actively harmful); declarative dims as a block gave Δρ=+0.019 (below our +0.03 ship gate); and a cherry-picked best single feature collapsed across seeds to mean Δρ=−0.023.

We treat +0.322 as the structural ceiling for static-feature agent-readiness on this corpus. Future lift, if any, must come from probe-behavior signals or multi-model calibration — not more dimensions.

The cost surface

Measured, not modeled.

Every probe records its USD cost (and input/output tokens) directly from the Anthropic API. The score-card surface reports observed mean cost per URL — not a prediction.

As an aside on what the v0 dimensions could predict: ρ against cost is +0.259 (equal-weight), rising to +0.333 (learned, σ=0.003 across 5 seeds). Cost is a more stable training signal than success. We have not switched the headline label because the published v0 number is success-aligned and the lift from re-fitting on cost (+0.011 over equal-weight against success) does not justify breaking methodology continuity.

Four · What this is not

Limitations, stated plainly.

honest

This is not a declarative checklist. Cloudflare's Agent Readiness Score and Fern's Agent Score both audit a site for the presence of llms.txt, MCP cards, robots.txt directives, structured-data tags, and similar. AXO does not. We score what an agent actually experiences when probing the page, not what the page declares. The two are correlated but not identical.

This is not multi-agent. v0 is calibrated exclusively against Claude Haiku 4.5. A page that scores 80 here may score differently when probed by Sonnet, Opus, GPT-5, or Gemini. Multi-agent score vectors are on the roadmap; v0 is a single-model snapshot.

This is not site-level. AXO scores individual URLs, not domains. The same site can have a score-95 article and a score-40 product page. Aggregate at your own risk.

This is not stationary. Probe outcomes depend on the agent (Haiku 4.5 today; future model versions), the scraper (in-house Playwright; not a stealth fetcher), and the corpus (long-tail, English, HN-sourced). When any of those change, ρ changes. We re-validate; we do not pretend the number is fixed.

Version

axo-v0-haiku-4.5-2026-05-05

frozen

This tag identifies the scorer that produced every AXO score on this site. The algorithm, the dimension set, the calibration model, and the corpus are all bound to this string. Future versions (multi-agent vectors, probe- behavior signals, broader corpora) will carry distinct tags. Old scores will not be silently rewritten.

API responses include this tag at axo_score.version. Quote it when comparing scores across time.