methodology

Benchmark v2 is explicit about what is inferred, what is measured, and why confidence moves.

WhatsMy.AI is meant to answer one question quickly: what kind of local AI should this desktop or laptop realistically handle, and how much confidence should you give that answer? The benchmark spec is now versioned so the public explanation and the implementation can stay aligned.

Benchmark contract

Manifest version: 2026.03-browser-manifest-v2

Benchmark generation ID: 2026.03-browser-benchmark-v2

Prompt bundle version: 2026.03-reference-lm-prompts-v1

Current score version: 2026.03-browser-v6

Release timestamp: 2026-03-13T00:00:00Z

Read current benchmark manifest JSON

Read benchmark v2 spec and workload matrix

Read reproducibility bundle v1 contract

The linked manifest and spec are the public release record for this recommendation lineage. Any benchmark-sensitive change now requires both a manifest diff and a changelog entry before release.

Latest manifest change: 2026-03-13T00:00:00Z. Adds deterministic WebGPU fallback lineage, adapter-class telemetry, and degraded-path confidence guardrails to the active benchmark release.

The linked versioned spec includes the workload matrix, changelog notes, and the review checklist for future benchmark changes.

Benchmark v2 separates browser-visible capability detection from score-bearing browser-native workloads.

Capability detection stays browser-native and runtime-agnostic. The score-bearing reference workload standardizes on ONNX Runtime Web over WebNN, WebGPU, or WASM so results stay repeatable across browsers while still tracking real accelerator backends.

WebLLM remains an important compatibility target, but it is not the scoring harness until its browser-to-browser repeatability can be held to the same standard as the current reference path.

What is inferred before scoring

Hardware identity. CPU, GPU, and NPU labels are inferred from WebGPU, WebGL, UA hints, and open hardware references.

Memory headroom. Installed RAM is taken directly when exposed, otherwise inferred conservatively from browser limits and bounded probes.

Runtime path coverage. WebNN, WebGPU, WebGL, and worker/runtime features define the browser-side ceiling before any score-bearing model work runs.

What is measured in the benchmark

CPU and graphics throughput screen. Median-of-samples quick pass with adaptive extra sampling when variance is high.

Time to first token. Measures reply-start latency after prompt ingest on the browser reference LM.

Short and long prefill. Measures prompt-ingest speed on short chat prompts and longer context-heavy prompts.

Structured output and tool-call emission. Uses a deterministic structured prompt so browser backends are checked on JSON-like/tool-call style emission, not only free-form text.

Sustained decode. Measures both burst decode speed and slower steady-state decode under a longer run.

Context stability and bounded memory probes. Sweeps longer prompts and bounded allocations to find where browser-native execution stops feeling stable.

Repeatability spread. Tracks TTFT, prefill, and decode spread across repeated short runs so noisy environments lose confidence.

Repeatability and confidence

Warm-up first. A non-scored warm-up run executes before the scored prompt suite to reduce one-time compilation and cache noise.

Median before headline numbers. Quick benchmarks use median-style aggregation, while repeatability spread exposes when reruns are likely to move around.

Fixed budget and bounded probes. Context and memory probes stop early when latency or allocation behavior becomes unstable, keeping the benchmark safe and repeatable.

High confidence. Use this when a recommended desktop browser completed the measured reference workload cleanly with stable repeatability.

Medium confidence. Useful for shortlisting, but some runtime coverage is weaker, partially inferred, or noisier than ideal.

Low confidence. Directional only. Re-run in a stronger browser/runtime path before making hardware or deployment decisions.

Anti-gaming controls

Deterministic seeded prompt bundles. Each benchmark generation pins a seeded prompt bundle so repeated runs stay reproducible while still making score chasing harder than memorizing one static prompt set.

Rotating holdout task. A holdout task rotates every 14 days and is validated separately from the headline prompt trio so obvious replay patterns and tuned one-prompt runs are easier to spot.

Aggregate anomaly exclusion. Leaderboard and known-device aggregates exclude replay fingerprints and throughput spikes by default instead of treating every shared run as equally trustworthy evidence.

Holdout rotation cadence. The current holdout task rotates every 14 days and the stored share payload records the benchmark generation ID, browser, OS, GPU adapter class, accelerator mode, and model hash used for that run.

Trust and monetization boundaries

Current disclosure policy: 2026-03-12-disclosure-v1. Referral payouts, sponsorship inventory, and sales goals cannot change model order, runtime order, upgrade guidance, or any confidence wording.

Ranking stays sealed off from money. Benchmark score, model order, runtime order, and confidence bands stay derived from benchmark logic only. No sponsor, payout rate, or partner inventory can change that answer.

Revenue lives in separate, labeled surfaces. Any affiliate, sponsor, or premium offer must appear after the recommendation or on a guide page with plain labeling so users can distinguish advice from monetization.

No paid boosts inside recommendation rows. Sponsored runtimes or paid placements can never appear inside the ranked model list, runtime list, or confidence explanation.

Eligibility is chosen before payout rate. An offer can show only when the benchmark outcome makes it relevant first, such as upgrade-heavy machines or low-confidence local outcomes.

Experiment flags default off and only gate optional post-result CTAs, never the benchmark answer itself.

Pricing holdout prompts, when enabled, stay in a separate result-page card and are logged independently from trust feedback.

Read the sponsorship and referral disclosure policy

Model catalog provenance

Every published model entry carries a source URL, source-normalized release date, license, modality, runtime availability, a last-verified timestamp, machine-readable evidence URLs for any `origin: US` claim, and a per-runtime verification block that records the checked target, source URL, timestamp, and caveats for Ollama, LM Studio, and llama.cpp.

Ownership sits with What'sMy.AI benchmark maintainers. Entries are reviewed every 30 days, and the project verify flow now fails if required provenance metadata is missing or drifts away from the tracked runtime paths.

Compare cards surface those runtime verification badges and the last-checked date directly from the registry.

Current US-origin eligibility policy: 2026-03-12-us-origin-v2. The public provenance guide now carries the model-family decision changelog, linked evidence bundles, and current edge-case decisions so the catalog rules stay auditable.

Latest provenance policy update: 2026-03-12. Published the US-origin eligibility policy, added machine-readable evidence URLs plus per-runtime verification metadata to the catalog, and made the verification gate fail when a recommended entry is missing policy evidence or a current downloadable runtime path.

When a family lands near the inclusion line, it stays out of the visible shortlist until the provenance guide records the edge-case rule and links that rationale from the public decision changelog.

Review the American-model decision changelog

Read edge-case decisions

Read the full model provenance guide

Interpretation limits

Browsers still do not expose full VRAM, thermals, power state, or native runtime scheduler behavior.

The measured browser matrix currently tracks 1 cohort for score version 2026.03-browser-v6. Confidence now follows those cohorts, not a fixed browser list.

TFLOPs-eq is a site-local comparison estimate, not a manufacturer specification.

Important purchase or deployment decisions should still be validated in the exact local runtime you plan to use.

Measured browser cohorts

Chrome. 1 measured run across 1 cohort, effective confidence low, paths: Full WebGPU. Reason codes: Full WebGPU cohort, Thin coverage, Stale cohort, Primary browser path.

Public proof release gate

Release criteria stay public even while the live evidence surfaces remain gated off.

Benchmark coverage. Require at least 40 completed browser benchmarks before public proof becomes more than a handful of anecdotes. Release gate: 40 completes.

Paired outcome feedback. Require at least 15 matched or not-matched answers so the public matched-rate reflects real use, not soft sentiment only. Release gate: 15 paired answers.

Redacted trace coverage. Require at least 12 opt-in redacted traces so repeated errors can be reviewed without exposing raw personal telemetry. Release gate: 12 traces.

Chrome and Edge coverage. Require at least 10 paired Chrome or Edge feedback samples before making purchase-grade claims on the highest-confidence browser path. Release gate: 10 Chrome/Edge pairs.

Firefox coverage. Require at least 5 paired Firefox samples before publishing cross-browser comparisons that include Firefox. Release gate: 5 Firefox pairs.

Known-device clusters. Require at least 3 publishable device clusters so public evidence reflects multiple hardware families instead of one standout machine. Release gate: 3 clusters.

Known-device confidence rules

Publish known-device page. A device page only publishes after at least 5 loaded shared runs on the current score version land in the same hardware cluster.

Medium confidence cluster. Medium confidence requires 5 or more shared runs plus at least 58% average deployment confidence across that cluster.

High confidence cluster. High confidence requires 10 or more shared runs plus at least 72% average deployment confidence across that cluster.

Public proof privacy boundary

Public. Aggregated benchmark and feedback counts. Publish benchmark completes, paired feedback counts, trace coverage, and matched-rate only as rolled-up totals.

Public. Cluster summaries and representative runs. Publish known-device averages, browser mix, confidence band, and a representative share result closest to the cluster average.

Public. Mismatch review workflow. Publish hotspot labels, mismatch rate, and the next calibration action once a repeated issue crosses the review threshold.

Public. Sanitized native validation comparisons. Publish browser-to-native comparison deltas only after the paired run set is large enough to avoid turning one machine into a public case study.

Admin only. Raw telemetry exports. Session IDs, full event payloads, and exact timestamps remain behind the admin export token.

Admin only. Free-form notes and unaggregated traces. Unstructured feedback notes and individual opt-in traces stay private unless they were explicitly shared as public result links.

Admin only. Internal calibration decisions. Pre-public review notes, rejected hypotheses, and unpublished calibration experiments remain admin-only until promoted into aggregated public evidence.

Benchmark vs native validation

No publishable native validation comparisons exist yet. Keep browser-to-native validation private until multiple paired runs are available.

Native validation is meant to compare browser predictions against the exact local runtime path before stronger purchase guidance is claimed publicly.