Which llama.cpp model should I try first?

gpt-oss-20b is the cleanest first stop here because it has a tracked llama.cpp path and a practical hardware target.

When should I choose llama.cpp?

People who want the most control over quantization, serving shape, and local inference knobs.

Should I benchmark before using llama.cpp?

Yes. The benchmark tells you which size band is realistic for your exact machine, then this page tells you which model is the cleanest first pull inside that runtime.

best model

Best local models for llama.cpp

Best for people who care about low-level control, serving flags, and GGUF tuning. This page ranks cleaner starting points first, then links you into the model pages when you need exact memory and hardware guidance.

starter pickgpt-oss-20b

tracked models11

tier span3B to Frontier MoE

runtime tradeoffIt asks for more setup fluency than Ollama or LM Studio before you get to the first answer.

benchmark first

Benchmark before you commit to a llama.cpp download.

It asks for more setup fluency than Ollama or LM Studio before you get to the first answer. The benchmark is still the fastest way to confirm whether this machine belongs in the size band that makes llama.cpp feel worth using.

Benchmark this device for llama.cpp Browse model pages

start here

Best first models for llama.cpp

gpt-oss-20b

34B class • 15.5 GB minimum

Community GGUF packaging gives llama.cpp a direct path.

Open model page Open source

OLMo 3.1 Instruct 32B

34B class • 19.5 GB minimum

Community GGUF import path for llama.cpp.

Open model page Open source

why this runtime

What you are choosing with llama.cpp

Best for: People who want the most control over quantization, serving shape, and local inference knobs.
Tradeoff: It asks for more setup fluency than Ollama or LM Studio before you get to the first answer.
Benchmark flow: Use the benchmark first when the question is about your machine, then use this page to choose the cleanest first pull inside llama.cpp.

broader catalog

More tracked models for llama.cpp

Granite 4.0 Micro

3B class • 2.5 GB minimum

IBM's smallest Granite 4.0 instruct release is a pragmatic US-origin starter for local chat, extraction, and agent scaffolding.

Open model page

OLMo 3 Instruct 7B

7B class • 5.0 GB minimum

Ai2's 7B instruct release is the clearest Apache-licensed American alternative to Llama when you want a smaller fully open local model.

Open model page

Llama 3.1 8B

7B class • 6.5 GB minimum

Meta's 8B instruct release remains the safest broad-compatibility US local model when you want maximum runtime coverage.

Open model page

Phi-4-reasoning

13B class • 8.5 GB minimum

Phi-4-reasoning is the clearest text-first American recommendation around the 13B class when you care about reasoning quality more than multimodal extras.

Open model page

Gemma 3 12B

13B class • 11.0 GB minimum

Gemma 3 12B stays interesting when you want a smaller multimodal American model, but it is less turnkey than Phi-4-reasoning for plain text work.

Open model page

Granite 4.0 H-Small

34B class • 19.5 GB minimum

Granite 4.0 H-Small is a credible American midrange choice for RAG-heavy work, but it is more specialized than the general-purpose winners above it.

Open model page

Model pages to open next

P1Static

Model page

Can I run gpt-oss-20b locally?

Search intent: gpt-oss-20b can i run it

34B class start • 15.5 GB minimum

OpenAI

Open page

P1Static

Model page

Can I run OLMo 3.1 Instruct 32B locally?

Search intent: olmo 3.1 instruct 32b can i run it

34B class start • 19.5 GB minimum

Ai2

Open page

runtime smoke

Monthly runtime smoke matrix

Each row installs or updates the tracked runtime, downloads the starter model, and proves one local inference with the pinned prompt bundle.

These rows use hosted CPU runners so stale guidance is visible before the public install copy drifts too far from reality.

llama.cpp

Runtime guidance currently needs review

Last verified: Not yet verified

Tested runtime version: Not yet verified

Monthly smoke cadence (31-day review window)

Prompt bundle: 2026.03-reference-lm-prompts-v1

Linux

GitHub-hosted Ubuntu x64 CPU runner

Install recipe: Install the latest llama.cpp prebuilt CPU release for Ubuntu before each run.

Last verified: Not yet verified

Tested version: Not yet verified

Model pull: Granite 4.0 Micro GGUF

Stale: No successful monthly smoke run recorded yet.

Open install source Open model source

macOS

GitHub-hosted macOS CPU runner

Install recipe: Install the latest llama.cpp prebuilt binary release for macOS before each run.

Last verified: Not yet verified

Tested version: Not yet verified

Model pull: Granite 4.0 Micro GGUF

Stale: No successful monthly smoke run recorded yet.

Open install source Open model source

Windows

GitHub-hosted Windows x64 CPU runner

Install recipe: Install the latest llama.cpp prebuilt CPU release for Windows before each run.

Last verified: Not yet verified

Tested version: Not yet verified

Model pull: Granite 4.0 Micro GGUF

Stale: No successful monthly smoke run recorded yet.

Open install source Open model source