Skip to content

How much VRAM does an LLM need?

A practical breakdown of GPU memory for language models — weights by precision, the KV cache for inference, and gradients plus optimizer states for training — with the formulas the calculator uses.

Open the VRAM Calculator →

Weights: the starting point

The base memory is the number of parameters times the bytes used per weight:

weights = parameters × bytes_per_weight

fp32 = 4 bytes   fp16/bf16 = 2 bytes   int8 = 1 byte   int4 = 0.5 byte

So a 7-billion-parameter model is about 14 GB in fp16, 7 GB in int8, or 3.5 GB in int4. Quantization is the quickest way to make a model fit a smaller GPU, trading a little accuracy for a lot of memory.

Inference overhead and the KV cache

On top of the weights, inference needs working memory for activations, the CUDA context and framework buffers — usually 15–25%. For long contexts the KV cache can dominate. It stores attention keys and values for every past token:

kv_cache = 2 × layers × hidden_size × context_tokens × batch × bytes

The factor of two is for keys and values. Because it scales with context length and batch, a long prompt or many concurrent requests can need more memory than the weights themselves.

Training: weights, gradients and optimizer state

Training stores far more. With the Adam optimizer in mixed precision you keep:

  • fp16 weights (2 bytes) and fp16 gradients (2 bytes),
  • fp32 optimizer momentum and variance (4 + 4 bytes),
  • an fp32 master copy of the weights (4 bytes).

That is about 16 bytes per parameter, so a 7B model needs roughly 112 GB before activations — which is why training uses model, tensor or pipeline parallelism across many GPUs. SGD with momentum stores one optimizer state instead of two; plain SGD stores none. Activation memory depends on batch and sequence length and is extra.

How to fit a bigger model

  • Quantize for inference — int8 halves and int4 quarters the weights.
  • Use LoRA / QLoRA for fine-tuning so only small adapters are trained.
  • Shard with tensor/pipeline parallelism or ZeRO/FSDP across GPUs.
  • Offload optimizer state or weights to CPU/NVMe when you must.
  • Shorten context or batch to shrink the KV cache.

Privacy

The calculator is pure arithmetic that runs entirely in your browser — nothing is uploaded.

FAQ

How much VRAM to run a 7B model?

About 14 GB in fp16 plus overhead, or ~4 GB in int4 — so a 7B int4 model fits a consumer GPU.

Why is training so much heavier?

It also stores gradients and optimizer state — roughly 16 bytes/param for Adam mixed precision.

Are the numbers exact?

No — they're planning estimates; leave headroom for activations and framework overhead.

Ready to try it? Open the VRAM Calculator →

Related guides