How much VRAM do I need to run an LLM?

For inference, start with parameters × bytes-per-weight: a 7B model is about 14 GB in fp16, 7 GB in int8, or 3.5 GB in int4, plus 15–25% runtime overhead and the KV cache for long contexts. The calculator works this out for you.

Why does training need so much more memory than inference?

Training also stores gradients and optimizer state. With Adam in mixed precision you keep fp16 weights and gradients plus fp32 momentum, variance and a master copy — roughly 16 bytes per parameter, so a 7B model needs on the order of 112 GB before activations.

What is the KV cache?

During generation, attention keys and values for past tokens are cached so they aren't recomputed. Its size grows with the number of layers, hidden size, context length and batch size: 2 × layers × hidden × tokens × batch × bytes-per-element.

Are these numbers exact?

No. They are planning estimates. Actual usage depends on the framework, attention kernels, activation memory, and fragmentation. Always leave headroom.

How Much VRAM Does an LLM Need? — Andergrove Software

Weights: the starting point

The base memory is the number of parameters times the bytes used per weight:

weights = parameters × bytes_per_weight

fp32 = 4 bytes   fp16/bf16 = 2 bytes   int8 = 1 byte   int4 = 0.5 byte

So a 7-billion-parameter model is about 14 GB in fp16, 7 GB in int8, or 3.5 GB in int4. Quantization is the quickest way to make a model fit a smaller GPU, trading a little accuracy for a lot of memory.

Inference overhead and the KV cache

On top of the weights, inference needs working memory for activations, the CUDA context and framework buffers — usually 15–25%. For long contexts the KV cache can dominate. It stores attention keys and values for every past token:

kv_cache = 2 × layers × hidden_size × context_tokens × batch × bytes

The factor of two is for keys and values. Because it scales with context length and batch, a long prompt or many concurrent requests can need more memory than the weights themselves.

Training: weights, gradients and optimizer state

Training stores far more. With the Adam optimizer in mixed precision you keep:

fp16 weights (2 bytes) and fp16 gradients (2 bytes),
fp32 optimizer momentum and variance (4 + 4 bytes),
an fp32 master copy of the weights (4 bytes).

That is about 16 bytes per parameter, so a 7B model needs roughly 112 GB before activations — which is why training uses model, tensor or pipeline parallelism across many GPUs. SGD with momentum stores one optimizer state instead of two; plain SGD stores none. Activation memory depends on batch and sequence length and is extra.

How to fit a bigger model

Quantize for inference — int8 halves and int4 quarters the weights.
Use LoRA / QLoRA for fine-tuning so only small adapters are trained.
Shard with tensor/pipeline parallelism or ZeRO/FSDP across GPUs.
Offload optimizer state or weights to CPU/NVMe when you must.
Shorten context or batch to shrink the KV cache.

Privacy

The calculator is pure arithmetic that runs entirely in your browser — nothing is uploaded.

FAQ

How much VRAM to run a 7B model?

About 14 GB in fp16 plus overhead, or ~4 GB in int4 — so a 7B int4 model fits a consumer GPU.

Why is training so much heavier?

It also stores gradients and optimizer state — roughly 16 bytes/param for Adam mixed precision.

Are the numbers exact?

No — they're planning estimates; leave headroom for activations and framework overhead.

Ready to try it? Open the VRAM Calculator →

How much VRAM does an LLM need?