How Much VRAM Do You Need to Run an LLM Locally?
· by Andergrove Software
Running an open model locally — with Ollama, LM Studio, llama.cpp or vLLM — comes down to one nervous question: will it fit on my GPU? Get it right and the model runs fast, entirely on the card. Get it wrong and it either refuses to load or spills into system RAM and crawls.
The good news is that VRAM use is mostly predictable from a few numbers. Here's where the memory actually goes, what quantization saves you, and a rough sizing table. Plug your own model into the LLM GPU Memory (VRAM) Calculator for an exact estimate.
The three things that eat VRAM
When you load a model for inference, your card's memory is spent on three things:
- Model weights. The parameters themselves — by far the biggest chunk for short prompts. Size depends on the parameter count and the precision (more on that below).
- The KV cache. As the model generates, it caches a key and value vector for every token in the context, on every layer. This grows with your context length — a long prompt or a long chat can consume gigabytes here alone.
- Overhead. Activations, the CUDA context, framework buffers, and fragmentation. Budget roughly 1–2 GB on top of everything else.
Total VRAM ≈ weights + KV cache + overhead. The weights are fixed once you pick a model and precision; the KV cache is the part people forget.
Quantization: the size dial
A model's weights can be stored at different precisions, and that choice scales the weight memory almost linearly:
| Precision | Bytes per parameter | 7B model | 13B model | 70B model |
|---|---|---|---|---|
| FP16 / BF16 (full) | 2 | ~14 GB | ~26 GB | ~140 GB |
| 8-bit (Q8) | ~1 | ~7 GB | ~13 GB | ~70 GB |
| 4-bit (Q4) | ~0.5 | ~3.5 GB | ~6.5 GB | ~35 GB |
The headline: 4-bit quantization roughly quarters the weight memory versus full precision, which is what makes big models runnable on consumer cards at all. The trade-off is a small quality loss — usually minor at 8-bit and modest at 4-bit, and often well worth it for the memory you get back. Below 4-bit (3-bit, 2-bit) the quality drop becomes noticeable.
A handy mental formula: weight VRAM ≈ parameters (in billions) × bytes-per-param. A 13B model at 4-bit is 13 × 0.5 ≈ 6.5 GB of weights.
Why context length costs memory too
The KV cache is the sneaky one. Its size grows with the context window you actually use:
KV cache ≈ 2 × layers × context length × key/value dimension × bytes-per-value
The exact figure depends on the model's architecture (number of layers and key/value heads), but the shape is what matters: double the context, double the cache. On a large model with a long context, the KV cache can rival or exceed the weights. This is why a model that loads fine on a short prompt can run out of memory deep into a long conversation — and why "how much VRAM" has no single answer until you fix a context length.
"Will it fit?" by GPU
Rough guidance for inference at 4-bit, leaving headroom for a reasonable context. Treat these as starting points, not guarantees — exact fit depends on the model and your context length:
| Your GPU VRAM | Comfortable | Tight / short context |
|---|---|---|
| 8 GB | 7–8B at Q4 | 13B at Q4 |
| 12 GB | 13B at Q4 | 14–20B at Q4 |
| 16 GB | 13B at Q8, 20B at Q4 | 32B at Q4 |
| 24 GB | 32–34B at Q4 | 70B at low-bit + short context |
| 48 GB+ | 70B at Q4 | larger / longer context |
If you're picking hardware, 24 GB (e.g. an RTX 3090/4090) is the sweet spot for hobbyist local LLMs — it comfortably runs the popular mid-size models with room for context.
When it doesn't fit: CPU offloading
If a model is slightly too big, runtimes like llama.cpp can offload some layers to system RAM and run them on the CPU. It works — but every offloaded layer is dramatically slower than GPU compute, so tokens-per-second can fall off a cliff as soon as you spill out of VRAM. A model that's 90% on the GPU is usually fine; one that's 50% offloaded often isn't worth it. The goal is almost always to fit the whole model (and your working context) in VRAM.
Putting it together
To size a local LLM:
- Weights: parameters × bytes-per-param for your chosen precision (Q4 ≈ 0.5).
- KV cache: scale with the context length you actually need.
- Overhead: add ~1–2 GB.
- Compare the total to your card's VRAM, leaving a little headroom.
Rather than do that by hand for every model and quantization level, drop the parameter count, precision and context into the VRAM Calculator — it adds up the weights, KV cache and overhead and tells you whether it fits your GPU. Pair it with the LLM Token Counter when you're reasoning about how long your prompts (and therefore your KV cache) really are.