What's My AIRuntime fit page built from tracked setup paths and model-fit coverage.

best model

Best local models for llama.cpp

Best for people who care about low-level control, serving flags, and GGUF tuning. This page ranks cleaner starting points first, then links you into the model pages when you need exact memory and hardware guidance.

starter pickgpt-oss-20b
tracked models11
tier span3B to Frontier MoE
runtime tradeoffIt asks for more setup fluency than Ollama or LM Studio before you get to the first answer.

benchmark first

Benchmark before you commit to a llama.cpp download.

It asks for more setup fluency than Ollama or LM Studio before you get to the first answer. The benchmark is still the fastest way to confirm whether this machine belongs in the size band that makes llama.cpp feel worth using.

start here

Best first models for llama.cpp

why this runtime

What you are choosing with llama.cpp

  • Best for: People who want the most control over quantization, serving shape, and local inference knobs.
  • Tradeoff: It asks for more setup fluency than Ollama or LM Studio before you get to the first answer.
  • Benchmark flow: Use the benchmark first when the question is about your machine, then use this page to choose the cleanest first pull inside llama.cpp.

broader catalog

More tracked models for llama.cpp

Granite 4.0 Micro

3B class • 2.5 GB minimum

IBM's smallest Granite 4.0 instruct release is a pragmatic US-origin starter for local chat, extraction, and agent scaffolding.

Open model page

OLMo 3 Instruct 7B

7B class • 5.0 GB minimum

Ai2's 7B instruct release is the clearest Apache-licensed American alternative to Llama when you want a smaller fully open local model.

Open model page

Llama 3.1 8B

7B class • 6.5 GB minimum

Meta's 8B instruct release remains the safest broad-compatibility US local model when you want maximum runtime coverage.

Open model page

Phi-4-reasoning

13B class • 8.5 GB minimum

Phi-4-reasoning is the clearest text-first American recommendation around the 13B class when you care about reasoning quality more than multimodal extras.

Open model page

Gemma 3 12B

13B class • 11.0 GB minimum

Gemma 3 12B stays interesting when you want a smaller multimodal American model, but it is less turnkey than Phi-4-reasoning for plain text work.

Open model page

Granite 4.0 H-Small

34B class • 19.5 GB minimum

Granite 4.0 H-Small is a credible American midrange choice for RAG-heavy work, but it is more specialized than the general-purpose winners above it.

Open model page

runtime smoke

Monthly runtime smoke matrix

Each row installs or updates the tracked runtime, downloads the starter model, and proves one local inference with the pinned prompt bundle.

These rows use hosted CPU runners so stale guidance is visible before the public install copy drifts too far from reality.

llama.cpp

Runtime guidance currently needs review

Last verified: Not yet verified

Tested runtime version: Not yet verified

Monthly smoke cadence (31-day review window)

Prompt bundle: 2026.03-reference-lm-prompts-v1

Linux

GitHub-hosted Ubuntu x64 CPU runner

Install recipe: Install the latest llama.cpp prebuilt CPU release for Ubuntu before each run.

Last verified: Not yet verified

Tested version: Not yet verified

Model pull: Granite 4.0 Micro GGUF

Stale: No successful monthly smoke run recorded yet.

macOS

GitHub-hosted macOS CPU runner

Install recipe: Install the latest llama.cpp prebuilt binary release for macOS before each run.

Last verified: Not yet verified

Tested version: Not yet verified

Model pull: Granite 4.0 Micro GGUF

Stale: No successful monthly smoke run recorded yet.

Windows

GitHub-hosted Windows x64 CPU runner

Install recipe: Install the latest llama.cpp prebuilt CPU release for Windows before each run.

Last verified: Not yet verified

Tested version: Not yet verified

Model pull: Granite 4.0 Micro GGUF

Stale: No successful monthly smoke run recorded yet.