# What'sMy.AI benchmark v2 spec and workload matrix

Status: active
manifestVersion: 2026.03-browser-manifest-v2
benchmarkVersion: 2026.03-browser-benchmark-v2
promptBundleVersion: 2026.03-reference-lm-prompts-v1
scoreVersion: 2026.03-browser-v6
releaseTimestamp: 2026-03-13T00:00:00.000Z

## 1. Purpose

Benchmark v2 defines the first versioned benchmark contract for representative browser-native local LLM workloads.
It exists to keep the public methodology, the implementation, and stored result metadata aligned.

This version standardizes on ONNX Runtime Web as the score-bearing browser harness.
Browser capability detection still looks at WebNN, WebGPU, WebGL, workers, and browser-exposed memory/runtime signals,
but those are separated from the measured reference-model loop so the output is explicit about what was inferred,
what was measured, and what was derived from the combination.

## 2. Scope boundary

- Benchmark v2 is browser-native. It does not claim native-runtime parity for Ollama, LM Studio, or llama.cpp.
- The score-bearing reference path runs a deterministic browser reference LM through ONNX Runtime Web.
- WebLLM compatibility remains important, but it is not the normative scoring harness for v2.
- The benchmark is intended for screening and shortlist guidance. Hardware purchasing or deployment decisions still require native validation on the target runtime.

## 3. Output signal boundary

| Output area | Classification | Boundary |
| --- | --- | --- |
| `snapshot` | Inferred / browser-visible | Hardware identity, memory exposure, browser/runtime features, and accelerator labels come from browser APIs plus conservative heuristics. |
| `quick` | Measured | CPU and graphics preflight values are measured in-browser and aggregated with stabilized sampling. |
| `deep.prompts` | Measured | TTFT, per-prompt prefill, decode, prompt match rate, and output previews come from the reference LM run. |
| `deep.repeatability` | Measured | Relative spread is computed from repeated short-prompt runs. |
| `deep.memoryProbe` | Measured | Bounded ArrayBuffer, GPUBuffer, and model-load-class probes are direct browser measurements. |
| `score` | Derived | Overall score, tier, fit guidance, and confidence band are derived from measured results plus inferred browser/runtime context. |
| `estimatedComputeTflopsEq` | Derived / inferred | This is a site-local normalization, not a vendor specification. |

## 4. Workload matrix

### 4.1 Normative browser-native LLM workloads

| Workload ID | Representative browser-native task | Current harness shape | Primary outputs | Scoring role |
| --- | --- | --- | --- | --- |
| `ttft` | Reply-start latency after prompt ingest | First decode step from the short chat prompt | `ttftMs` | Direct deep-score input and confidence input |
| `prefill-short` | Short chat prompt ingest | Deterministic short prompt | `prefillShortTokensPerSecond` | Direct deep-score input |
| `prefill-long` | Long-context prompt ingest | Deterministic long prompt repeated by profile | `prefillLongTokensPerSecond` | Direct deep-score input |
| `structured-output` | Structured output / tool-call emission | Deterministic JSON-like tool prompt | `matchRate`, `decodeTokensPerSecond`, `outputPreview` | Contributes through correctness and decode averages |
| `decode-burst` | Short autoregressive generation | Prompt-suite decode loop | `decodeTokensPerSecond` | Direct deep-score input |
| `decode-sustained` | Longer generation under browser pressure | Long-prompt sustained decode loop | `sustainedTokensPerSecond`, `stabilityDropPct` | Direct deep-score input and stability signal |
| `stability-loops` | Repeatability and bounded context stability | Repeated short runs plus context sweep | `prefillSpread`, `decodeSpread`, `ttftSpread`, `maxStableContextTokens` | Confidence gating and context-fit adjustment |
| `memory-probes` | Bounded browser memory headroom | ArrayBuffer, GPUBuffer, and model-load attempts | `arrayBufferMb`, `gpuBufferMb`, `modelLoadClass` | Memory/context guidance, not a standalone score |

### 4.2 Auxiliary measured preflight

The quick screen is still versioned with this benchmark generation even though it is not a browser-native LM loop.
It remains part of the score input surface because it stabilizes CPU, graphics, and memory-path expectations before the reference LM runs.

| Workload ID | Task | Aggregation | Primary outputs | Scoring role |
| --- | --- | --- | --- | --- |
| `quick-throughput` | CPU worker loop plus graphics probe | Robust median with adaptive extra samples | `cpuScore`, `cpuOpsPerSecond`, `memoryGbps`, `gpuScore`, `gpuGflops` | Direct score input |

## 5. Execution protocol

### 5.1 Runtime selection

- Browser capability detection runs first.
- The score-bearing reference LM uses an explicit ONNX Runtime Web provider chain of `webgpu`, then `wasm`.
- WebNN, WebGL, secure-context state, adapter acquisition, and browser feature support are still captured in the capability snapshot, but they are not alternate score-bearing execution providers in this benchmark generation.
- The `webgpu` session-init attempt uses a `2500 ms` timeout with no retry before the benchmark records the failure reason and moves on.
- The `wasm` fallback attempt uses a `3500 ms` timeout and allows one retry so degraded runs still have a deterministic CPU-class recovery path.
- If the browser is missing secure context, `navigator.gpu`, adapter availability, or required runtime features, the `webgpu` attempt is skipped with an explicit fallback reason before the `wasm` path is tried.
- If the reference LM cannot initialize on any path, the benchmark keeps the quick screen and emits a more inferred result.

### 5.2 Warm-up

- The quick CPU worker path performs an untimed JIT and memory-path warm-up inside each worker before the timed loop.
- The deep reference-LM suite performs one non-scored warm-up pass before the measured prompt suite.
- The warm-up result is never emitted as a headline metric.

### 5.3 Profile-specific run budgets

| Profile | Quick base CPU samples | Prompt-suite decode steps | Repeatability runs | Sustained decode steps |
| --- | --- | --- | --- | --- |
| Phone | 2 | 12 | 3 | 48 |
| Tablet | 2 | 16 | 3 | 64 |
| Desktop / laptop default | 3 | 20 | 4 | 72 |
| Desktop with high-parallelism deep profile (`hardwareConcurrency >= 16`) | 3 to 4 | 24 | 4 | 96 |

### 5.4 Prompt suite

- Short prompt: reply-start and short-ingest screening.
- Long prompt: longer context ingest plus sustained decode staging.
- Structured prompt: deterministic tool-call / JSON-like emission.
- Repeatability prompt: repeated short prompt with shorter decode depth to measure variance cheaply.

### 5.5 Aggregation and variance

- Quick CPU outputs use `robustMedian(...)`. When enough samples exist, outliers are filtered with a MAD-based pass before taking the median.
- Quick CPU sampling adds 1 to 2 extra samples when relative spread exceeds `0.10`.
- Quick GPU sampling adds 1 to 2 extra samples when relative spread exceeds `0.12`.
- Deep headline `ttftMs` uses the median across the measured prompt suite.
- Deep repeatability spread uses `relativeSpread = (max - min) / median`.
- Current v2 output emits spread values instead of a formal statistical confidence interval. Confidence is expressed as product confidence bands, not a 95% CI.

## 6. Stability and bounded safety rules

- Context probing stops when prefill cost rises above roughly 18 ms per token or a browser failure occurs.
- ArrayBuffer and GPUBuffer probes are bounded to browser-safe candidate sizes and stop on first failure.
- Model-load probing attempts progressively larger hidden sizes and records the largest successful class.
- The benchmark prefers early stop behavior over aggressive probing when the browser becomes unstable.

## 7. Score-version inputs

`scoreVersion: 2026.03-browser-v6` currently consumes these inputs:

| Input family | Current contribution |
| --- | --- |
| Quick CPU | 20% with deep run, 31% without deep run |
| Quick GPU | 20% with deep run, 31% without deep run |
| Memory estimate / probe-adjusted memory class | 18% with deep run, 25% without deep run |
| Runtime-path coverage | 10% with deep run, 13% without deep run |
| Deep suite composite | 32% when a measured deep run exists |

The deep-suite composite is currently weighted as:

| Deep input | Weight |
| --- | --- |
| Short prefill | 10% |
| Long prefill | 18% |
| Decode throughput | 20% |
| Sustained decode | 16% |
| KV-cache append proxy | 8% |
| Correctness / structured prompt match | 18% |
| TTFT | 10% |

Any change to the workload list, aggregation rule, threshold, or score weighting requires a benchmark-version review and usually a version bump.

## 8. Confidence bands

The confidence band is a derived product signal, not a direct measurement.

| Band | Current rule of thumb |
| --- | --- |
| High | Measured reference run completed on a non-hardware-estimate path, deployment confidence is at least `0.78`, correctness is at least `0.99`, prefill spread is at most `0.12`, decode spread is at most `0.12`, and TTFT spread is at most `0.18`. |
| Medium | Deployment confidence is at least `0.60`, but the run is noisier, more partial, or less fully covered than the high-confidence path. |
| Low | Mostly inferred or too incomplete / noisy for more than directional guidance. |

## 9. Changelog notes

### `benchmarkVersion: 2026.03-browser-benchmark-v2`

- First versioned benchmark contract for browser-native local LLM screening.
- Separates inferred browser capability detection from measured score-bearing workloads.
- Standardizes the reference browser harness on ONNX Runtime Web.
- Adds an explicit structured output / tool-call prompt to the measured workload set.
- Makes repeatability spreads, bounded context probing, and bounded memory probes part of the published contract.
- Publishes the score-version input surface alongside the workload matrix so methodology and stored result metadata stay aligned.

### `manifestVersion: 2026.03-browser-manifest-v2`

- Extends the active release lineage with adapter-class capability telemetry and deterministic WebGPU to WASM fallback policy metadata.
- Aligns the published methodology with the explicit session-init timeout and retry behavior used by the runtime benchmark harness.
- Tightens degraded-path guidance so weak fallback-only evidence cannot surface as high-confidence shortlist output.

## 10. Review checklist for future benchmark-version changes

- Confirm the workload matrix still represents real browser-native local LLM behavior instead of a microbenchmark only.
- Confirm measured vs inferred vs derived output boundaries are still explicit and accurate.
- Bump `benchmarkVersion` when prompt classes, workload IDs, aggregation rules, thresholds, or confidence-band rules change.
- Bump `scoreVersion` when score weights, score inputs, or score normalization formulas change.
- Bump the benchmark manifest version and changelog entry when recommendation or score behavior changes.
- Update methodology copy and the linked spec document in the same change.
- Reject the review if any sponsor, referral CTA, or paid placement can change model order, runtime order, upgrade guidance, or confidence language.
- Confirm any monetized CTA still renders in a separately labeled surface after the recommendation rationale and the disclosure policy, privacy page, and FAQ stay aligned.
- Verify stored result payloads, share pages, telemetry exports, and public methodology surfaces all agree on the current versions.
- Re-run benchmark-related tests plus the documented repo verification gate before review.
