Skip to content

← All writing

How Much VRAM Do You Need to Run an LLM Locally?

· by Andergrove Software

Running an open model locally — with Ollama, LM Studio, llama.cpp or vLLM — comes down to one nervous question: will it fit on my GPU? Get it right and the model runs fast, entirely on the card. Get it wrong and it either refuses to load or spills into system RAM and crawls.

The good news is that VRAM use is mostly predictable from a few numbers. Here's where the memory actually goes, what quantization saves you, and a rough sizing table. Plug your own model into the LLM GPU Memory (VRAM) Calculator for an exact estimate.

The three things that eat VRAM

When you load a model for inference, your card's memory is spent on three things:

  1. Model weights. The parameters themselves — by far the biggest chunk for short prompts. Size depends on the parameter count and the precision (more on that below).
  2. The KV cache. As the model generates, it caches a key and value vector for every token in the context, on every layer. This grows with your context length — a long prompt or a long chat can consume gigabytes here alone.
  3. Overhead. Activations, the CUDA context, framework buffers, and fragmentation. Budget roughly 1–2 GB on top of everything else.

Total VRAM ≈ weights + KV cache + overhead. The weights are fixed once you pick a model and precision; the KV cache is the part people forget.

Quantization: the size dial

A model's weights can be stored at different precisions, and that choice scales the weight memory almost linearly:

PrecisionBytes per parameter7B model13B model70B model
FP16 / BF16 (full)2~14 GB~26 GB~140 GB
8-bit (Q8)~1~7 GB~13 GB~70 GB
4-bit (Q4)~0.5~3.5 GB~6.5 GB~35 GB

The headline: 4-bit quantization roughly quarters the weight memory versus full precision, which is what makes big models runnable on consumer cards at all. The trade-off is a small quality loss — usually minor at 8-bit and modest at 4-bit, and often well worth it for the memory you get back. Below 4-bit (3-bit, 2-bit) the quality drop becomes noticeable.

A handy mental formula: weight VRAM ≈ parameters (in billions) × bytes-per-param. A 13B model at 4-bit is 13 × 0.5 ≈ 6.5 GB of weights.

Why context length costs memory too

The KV cache is the sneaky one. Its size grows with the context window you actually use:

KV cache ≈ 2 × layers × context length × key/value dimension × bytes-per-value

The exact figure depends on the model's architecture (number of layers and key/value heads), but the shape is what matters: double the context, double the cache. On a large model with a long context, the KV cache can rival or exceed the weights. This is why a model that loads fine on a short prompt can run out of memory deep into a long conversation — and why "how much VRAM" has no single answer until you fix a context length.

"Will it fit?" by GPU

Rough guidance for inference at 4-bit, leaving headroom for a reasonable context. Treat these as starting points, not guarantees — exact fit depends on the model and your context length:

Your GPU VRAMComfortableTight / short context
8 GB7–8B at Q413B at Q4
12 GB13B at Q414–20B at Q4
16 GB13B at Q8, 20B at Q432B at Q4
24 GB32–34B at Q470B at low-bit + short context
48 GB+70B at Q4larger / longer context

If you're picking hardware, 24 GB (e.g. an RTX 3090/4090) is the sweet spot for hobbyist local LLMs — it comfortably runs the popular mid-size models with room for context.

When it doesn't fit: CPU offloading

If a model is slightly too big, runtimes like llama.cpp can offload some layers to system RAM and run them on the CPU. It works — but every offloaded layer is dramatically slower than GPU compute, so tokens-per-second can fall off a cliff as soon as you spill out of VRAM. A model that's 90% on the GPU is usually fine; one that's 50% offloaded often isn't worth it. The goal is almost always to fit the whole model (and your working context) in VRAM.

Putting it together

To size a local LLM:

  1. Weights: parameters × bytes-per-param for your chosen precision (Q4 ≈ 0.5).
  2. KV cache: scale with the context length you actually need.
  3. Overhead: add ~1–2 GB.
  4. Compare the total to your card's VRAM, leaving a little headroom.

Rather than do that by hand for every model and quantization level, drop the parameter count, precision and context into the VRAM Calculator — it adds up the weights, KV cache and overhead and tells you whether it fits your GPU. Pair it with the LLM Token Counter when you're reasoning about how long your prompts (and therefore your KV cache) really are.