VRAM Sizing for LLMs: How Much GPU Memory You Actually Need

Last reviewed on 2026-04-30 · 9 min read

Picking a GPU for a large-language-model workload almost always comes down to one number: VRAM. Get it wrong and the job either refuses to start (CUDA out of memory) or runs at a fraction of expected throughput because the framework starts swapping or recomputing. The hourly rate is irrelevant if the GPU cannot hold your model.

This page works through the budget the model itself imposes, the additional memory training adds on top, and the headroom that practical workloads need. The goal is to give you a single number — bytes of VRAM — that you can compare against the cards in the hardware guide.

The base formula

Model weights are the floor. The size of the weights in memory is straightforward:

VRAM_weights ≈ parameters × bytes-per-parameter

Bytes-per-parameter depends on the precision the weights are stored in:

Precision	Bytes per parameter	Notes
FP32	4	Rare for inference; mostly seen during training as the master copy.
FP16 / BF16	2	Default for most modern inference and mixed-precision training.
FP8	1	Hopper- and Blackwell-class GPUs only; tools and frameworks still maturing.
INT8	1	Common quantized inference target; small accuracy hit on most models.
INT4 (e.g. GPTQ, AWQ)	0.5	Aggressive inference quantization; quality varies by model and method.

So a 7-billion-parameter model needs roughly 14 GB in FP16, 7 GB in INT8, and 3.5 GB in INT4 — for the weights alone.

Inference: weights plus a small overhead

For inference, the dominant cost beyond the weights themselves is the KV cache: the attention key/value tensors stored for every token in the running context. The KV cache scales with batch size, context length, the number of attention heads, and the per-head dimension. For most production-style serving, a useful rule of thumb is:

VRAM_inference ≈ VRAM_weights × 1.2 to 1.5

The lower end is for small-context, low-batch use cases; the upper end is for long-context serving with parallel requests. If you are pushing 32k or 128k context, the KV cache can rival the weights themselves and the multiplier rises further.

That gives the following inference budgets at common precisions:

Model size	FP16 weights	FP16 inference (≈1.3×)	INT8 inference	INT4 inference
7B	14 GB	~18 GB	~10 GB	~6 GB
13B	26 GB	~34 GB	~18 GB	~10 GB
34B	68 GB	~88 GB	~46 GB	~24 GB
70B	140 GB	~180 GB	~94 GB	~48 GB

Training: a 4× multiplier is a good first cut

Training is much more memory-hungry than inference because the GPU has to keep:

The weights (1×).
The gradients (same size as the weights, so 1×).
The optimizer state (Adam keeps two moments per parameter, ~2×).
Activations stored for the backward pass (varies wildly with sequence length, batch size, and recomputation policy).

For a vanilla mixed-precision Adam training loop without aggressive sharding or activation checkpointing:

VRAM_{train, single GPU} ≈ parameters × 16 bytes (weights, grads, Adam states) + activations

That is the source of the "16 bytes per parameter" rule. A 7B model in this regime needs 112 GB just for the model state — already past a single H100 80 GB — before activations enter the picture.

Two tools push this number back down:

ZeRO / FSDP sharding. Splitting optimizer state, gradients, and weights across N GPUs reduces the per-GPU memory roughly N-fold for the largest stage (ZeRO-3 / FSDP full-shard).
Activation checkpointing. Recomputing activations during the backward pass instead of storing them trades extra compute for big memory savings — typically 30–50% lower activation memory at the cost of ~30% more compute.

Worked example: fine-tuning a 13B model

Suppose you want to fine-tune a 13-billion-parameter model with full-parameter training, BF16 mixed precision, Adam, and standard activation checkpointing.

Model state (no sharding): 13B × 16 bytes ≈ 208 GB.
Activations at sequence length 2048, batch 8 with checkpointing: ~30–40 GB.
Total per GPU, no sharding: ~240 GB. That is more than any single GPU on the market.
With FSDP across 4 H100 80GB: model state divides to ~52 GB per GPU plus shared activations of ~30 GB → roughly 80 GB per GPU, which is at the limit. Move to 8 GPUs and the budget becomes comfortable.

If you switch to LoRA (only training a small low-rank adapter) the same model fits trivially on a single H100 80GB or even an A100 40GB, because the optimizer state and gradients only apply to the adapter weights — typically tens of millions of parameters, not 13 billion.

Common mistakes

Forgetting the optimizer. Quoting only weight size and concluding "70B fits on an A100 80GB" is wrong for full training — it ignores 6× more memory needed for gradients and Adam states.
Conflating inference and training budgets. A model that fits comfortably for inference may need 4–8× more memory to train or fully fine-tune.
Ignoring context length. For long-context inference, the KV cache can dwarf the weights. A 7B model serving 128k-token windows is no longer a 14 GB workload.
Underestimating per-layer overhead. Frameworks reserve memory for kernel workspaces, communicators, and CUDA caching allocators. Plan for 5–10% headroom on top of the calculated number.
Picking the smallest GPU that "should fit". A model that fits at 95% utilization will hit fragmentation issues. Prefer the next size up unless cost is the binding constraint.

Sizing checklist

Pick the precision you actually plan to run in (FP16/BF16 for most cases).
Multiply parameters × bytes-per-parameter for the weight floor.
For inference, multiply by 1.2 to 1.5 to cover the KV cache and runtime overhead. Push higher for long context.
For training, start from 16 bytes per parameter for full-precision Adam, then divide by the number of GPUs you are sharding across, and add an activation budget.
Cross-check the result against the VRAM column in the hardware guide.
Plug the GPU count and hours into the cost calculator to get a workload cost.
If the model only just fits, step up one tier (or shard wider) — fragmentation alone can push real usage 10% above the napkin estimate.

H100 vs A100 — VRAM is one of the dimensions where Hopper-class cards pull away from Ampere.
Glossary — definitions for VRAM, HBM, FP16, FP8, mixed precision, and quantization.
GPU cost calculator — once you know how many GPUs the model needs, the calculator turns that into a dollar figure.