How much VRAM does an LLM need?
A practical breakdown of GPU memory for language models — weights by precision, the KV cache for inference, and gradients plus optimizer states for training — with the formulas the calculator uses.
Open the VRAM Calculator →Weights: the starting point
The base memory is the number of parameters times the bytes used per weight:
weights = parameters × bytes_per_weight
fp32 = 4 bytes fp16/bf16 = 2 bytes int8 = 1 byte int4 = 0.5 byte
So a 7-billion-parameter model is about 14 GB in fp16, 7 GB in int8, or 3.5 GB in int4. Quantization is the quickest way to make a model fit a smaller GPU, trading a little accuracy for a lot of memory.
Inference overhead and the KV cache
On top of the weights, inference needs working memory for activations, the CUDA context and framework buffers — usually 15–25%. For long contexts the KV cache can dominate. It stores attention keys and values for every past token:
kv_cache = 2 × layers × hidden_size × context_tokens × batch × bytes
The factor of two is for keys and values. Because it scales with context length and batch, a long prompt or many concurrent requests can need more memory than the weights themselves.
Training: weights, gradients and optimizer state
Training stores far more. With the Adam optimizer in mixed precision you keep:
- fp16 weights (2 bytes) and fp16 gradients (2 bytes),
- fp32 optimizer momentum and variance (4 + 4 bytes),
- an fp32 master copy of the weights (4 bytes).
That is about 16 bytes per parameter, so a 7B model needs roughly 112 GB before activations — which is why training uses model, tensor or pipeline parallelism across many GPUs. SGD with momentum stores one optimizer state instead of two; plain SGD stores none. Activation memory depends on batch and sequence length and is extra.
How to fit a bigger model
- Quantize for inference — int8 halves and int4 quarters the weights.
- Use LoRA / QLoRA for fine-tuning so only small adapters are trained.
- Shard with tensor/pipeline parallelism or ZeRO/FSDP across GPUs.
- Offload optimizer state or weights to CPU/NVMe when you must.
- Shorten context or batch to shrink the KV cache.
Privacy
The calculator is pure arithmetic that runs entirely in your browser — nothing is uploaded.
FAQ
How much VRAM to run a 7B model?
About 14 GB in fp16 plus overhead, or ~4 GB in int4 — so a 7B int4 model fits a consumer GPU.
Why is training so much heavier?
It also stores gradients and optimizer state — roughly 16 bytes/param for Adam mixed precision.
Are the numbers exact?
No — they're planning estimates; leave headroom for activations and framework overhead.
Ready to try it? Open the VRAM Calculator →