# What'sMy.AI reproducibility bundle v1

Status: active
format: reproducibility-bundle-v1
Published: 2026-03-12

## Purpose

This bundle lets third-party reviewers validate published cohort-level benchmark behavior without receiving user-identifying traces.

The admin export route emits this format with:

```text
GET /api/admin/export?format=reproducibility-bundle-v1
```

The default admin export response remains the raw telemetry export used by internal growth and calibration tooling.

## Privacy boundary

Bundle v1 intentionally excludes or transforms values that could reasonably identify one machine or person.

- Stripped:
  - `userAgent`
  - `userAgentHints`
  - share slugs
  - exact `createdAt` / `capturedAt` timestamps
  - viewport dimensions
  - quick/deep notes
  - deep prompt `outputPreview`
- Hashed:
  - cohort key
  - CPU label
  - GPU label
  - NPU label
- Bucketed:
  - `hardwareConcurrency`
  - `deviceMemoryGb`

Only published cohorts are exported. In practice that means the underlying shared benchmark runs already met the internal publication threshold for a device cohort.

## Top-level shape

```ts
interface ReproducibilityBundleV1 {
  format: "reproducibility-bundle-v1";
  generatedAt: string;
  benchmarkManifest: {
    schemaVersion: number;
    benchmarkVersion: string;
    scoreVersion: string;
    benchmarkSpecPath: string;
    bundleSpecPath: string;
    runtimeDecision: { label: string; summary: string };
    inferredSignals: Array<{ id: string; label: string }>;
    measuredWorkloads: Array<{ id: string; label: string }>;
    repeatability: Array<{ id: string; label: string }>;
  };
  anonymization: {
    strippedFields: string[];
    hashedFields: string[];
    bucketedFields: string[];
  };
  deviceCohorts: ReproducibilityBundleCohortV1[];
  runs: ReproducibilityBundleRunV1[];
}
```

## Cohorts

Each `deviceCohorts[]` entry represents one published device cohort after anonymization.

```ts
interface ReproducibilityBundleCohortV1 {
  cohortId: string;
  anonymizedMetadata: {
    platform: string;
    deviceClass: "desktop" | "tablet" | "phone";
    hardwareConcurrencyBucket: string;
    memoryBucket: string;
    browserMix: string[];
    webgpuSupported: boolean;
    webnnSupported: boolean;
    cpuSignature: string;
    gpuSignature: string;
    npuSignature: string;
  };
  publishedSummary: {
    runs: number;
    averageScore: number;
    averageComputeTflopsEq: number | null;
    averageDecodeTokensPerSecond: number | null;
    averageSustainedTokensPerSecond: number | null;
    averageTtftMs: number | null;
    averageBrowserConfidence: number;
    averageDeploymentConfidence: number;
    bestTier: string;
    stabilityLabel: string;
    confidenceBand: "high" | "medium" | "low";
    browserMix: string[];
    representativeSampleId: string;
  };
}
```

## Runs

Each `runs[]` entry is one anonymized benchmark sample from a published cohort.

```ts
interface ReproducibilityBundleRunV1 {
  sampleId: string;
  cohortId: string;
  browser: string;
  metricSummary: {
    quick: {
      durationMs: number;
      cpuMsPerOp: number;
      cpuScore: number;
      cpuWorkers: number;
      cpuOpsPerSecond: number;
      memoryGbps: number;
      gpuMode: "webgpu" | "webgl" | "none";
      gpuScore: number;
      gpuGflops: number | null;
      gpuMemoryScore: number | null;
      mobileReducedMode: boolean;
    };
    deep: null | {
      provider: "webgpu" | "webnn" | "wasm" | "unavailable";
      durationMs: number;
      compileMs: number;
      ttftMs: number;
      prefillShortTokensPerSecond: number;
      prefillLongTokensPerSecond: number;
      prefillTokensPerSecond: number;
      decodeTokensPerSecond: number;
      sustainedTokensPerSecond: number;
      stabilityDropPct: number;
      kvCacheAppendTokensPerSecond: number;
      correctnessRate: number;
      maxStableContextTokens: number;
      repeatability: {
        prefillSpread: number;
        decodeSpread: number;
        ttftSpread: number;
        sustainedSpread: number;
      };
      memoryProbe: {
        arrayBufferMb: number;
        gpuBufferMb: number | null;
        modelLoadClass: "tiny" | "small" | "medium" | "large";
      };
      promptSummaries: Array<{
        id: string;
        promptTokens: number;
        generatedTokens: number;
        ttftMs: number;
        prefillTokensPerSecond: number;
        decodeTokensPerSecond: number;
        matchRate: number;
        passed: boolean;
      }>;
    };
    scoreInputs: {
      overallScore: number;
      estimatedComputeTflopsEq: number | null;
      browserConfidence: number;
      deploymentConfidence: number;
      confidenceBand: "high" | "medium" | "low";
      cpuClass: string;
      gpuClass: string;
      memoryClass: string;
      pairedCalibrationSamples: number;
      stabilityLabel: string;
      subscores: {
        cpu: number;
        gpu: number;
        memory: number;
        runtime: number;
        deep: number;
      };
    };
  };
  recommendationOutput: {
    tier: string;
    estimatedModelFit: string;
    experienceLabel: "barely usable" | "comfortable" | "strong";
    confidenceBand: "high" | "medium" | "low";
    runtimeRecommendations: Array<{
      id: "ollama" | "lm-studio" | "llama-cpp";
      priority: "best-start" | "best-gui" | "best-performance";
      starter: string;
    }>;
    compatibleModelIds: string[];
    stretchModelIds: string[];
    upgradeGuidance: {
      path: string;
      title: string;
    };
    blockers: string[];
    caveats: string[];
  };
}
```

## Reproducing published cohort stats

To re-run a cohort summary:

1. Filter `runs[]` to one `cohortId`.
2. Recompute averages from `metricSummary`.
3. Round using the same public display precision:
   - `averageScore`: nearest integer after averaging one decimal place
   - `averageComputeTflopsEq`: 2 decimals
   - `averageDecodeTokensPerSecond`: 1 decimal
   - `averageSustainedTokensPerSecond`: 1 decimal
   - `averageTtftMs`: 1 decimal
   - `averageBrowserConfidence`: 2 decimals
   - `averageDeploymentConfidence`: 2 decimals
4. Rebuild `browserMix` from unique `browser` values.
5. Rebuild `bestTier` from the highest tier in `recommendationOutput.tier`.
6. Rebuild `confidenceBand` from `runs.length` plus the averaged deployment confidence.
7. Select the representative sample as the run whose `overallScore` is closest to the cohort average score, then take its `stabilityLabel`.

The included `deviceCohorts[].publishedSummary` is the canonical output produced by the current implementation. Reviewers can compare their recomputed summary against that value to verify the export.

## Versioning policy

- Change the `format` string when the bundle structure or anonymization contract changes.
- Keep the `benchmarkManifest` aligned with the active benchmark and score versions.
- Publish a new Markdown contract alongside any future bundle version so old exports remain auditable.
