VRAM Sizing for LLMs: How Much GPU Memory You Actually Need

Last reviewed on 2026-04-30 · 9 min read

Picking a GPU for a large-language-model workload almost always comes down to one number: VRAM. Get it wrong and the job either refuses to start (CUDA out of memory) or runs at a fraction of expected throughput because the framework starts swapping or recomputing. The hourly rate is irrelevant if the GPU cannot hold your model.

This page works through the budget the model itself imposes, the additional memory training adds on top, and the headroom that practical workloads need. The goal is to give you a single number — bytes of VRAM — that you can compare against the cards in the hardware guide.

The base formula

Model weights are the floor. The size of the weights in memory is straightforward:

VRAMweights ≈ parameters × bytes-per-parameter

Bytes-per-parameter depends on the precision the weights are stored in:

PrecisionBytes per parameterNotes
FP324Rare for inference; mostly seen during training as the master copy.
FP16 / BF162Default for most modern inference and mixed-precision training.
FP81Hopper- and Blackwell-class GPUs only; tools and frameworks still maturing.
INT81Common quantized inference target; small accuracy hit on most models.
INT4 (e.g. GPTQ, AWQ)0.5Aggressive inference quantization; quality varies by model and method.

So a 7-billion-parameter model needs roughly 14 GB in FP16, 7 GB in INT8, and 3.5 GB in INT4 — for the weights alone.

Inference: weights plus a small overhead

For inference, the dominant cost beyond the weights themselves is the KV cache: the attention key/value tensors stored for every token in the running context. The KV cache scales with batch size, context length, the number of attention heads, and the per-head dimension. For most production-style serving, a useful rule of thumb is:

VRAMinference ≈ VRAMweights × 1.2 to 1.5

The lower end is for small-context, low-batch use cases; the upper end is for long-context serving with parallel requests. If you are pushing 32k or 128k context, the KV cache can rival the weights themselves and the multiplier rises further.

That gives the following inference budgets at common precisions:

Model sizeFP16 weightsFP16 inference (≈1.3×)INT8 inferenceINT4 inference
7B14 GB~18 GB~10 GB~6 GB
13B26 GB~34 GB~18 GB~10 GB
34B68 GB~88 GB~46 GB~24 GB
70B140 GB~180 GB~94 GB~48 GB

Training: a 4× multiplier is a good first cut

Training is much more memory-hungry than inference because the GPU has to keep:

For a vanilla mixed-precision Adam training loop without aggressive sharding or activation checkpointing:

VRAMtrain, single GPU ≈ parameters × 16 bytes (weights, grads, Adam states) + activations

That is the source of the "16 bytes per parameter" rule. A 7B model in this regime needs 112 GB just for the model state — already past a single H100 80 GB — before activations enter the picture.

Two tools push this number back down:

Worked example: fine-tuning a 13B model

Suppose you want to fine-tune a 13-billion-parameter model with full-parameter training, BF16 mixed precision, Adam, and standard activation checkpointing.

  1. Model state (no sharding): 13B × 16 bytes ≈ 208 GB.
  2. Activations at sequence length 2048, batch 8 with checkpointing: ~30–40 GB.
  3. Total per GPU, no sharding: ~240 GB. That is more than any single GPU on the market.
  4. With FSDP across 4 H100 80GB: model state divides to ~52 GB per GPU plus shared activations of ~30 GB → roughly 80 GB per GPU, which is at the limit. Move to 8 GPUs and the budget becomes comfortable.

If you switch to LoRA (only training a small low-rank adapter) the same model fits trivially on a single H100 80GB or even an A100 40GB, because the optimizer state and gradients only apply to the adapter weights — typically tens of millions of parameters, not 13 billion.

Common mistakes

Sizing checklist

  1. Pick the precision you actually plan to run in (FP16/BF16 for most cases).
  2. Multiply parameters × bytes-per-parameter for the weight floor.
  3. For inference, multiply by 1.2 to 1.5 to cover the KV cache and runtime overhead. Push higher for long context.
  4. For training, start from 16 bytes per parameter for full-precision Adam, then divide by the number of GPUs you are sharding across, and add an activation budget.
  5. Cross-check the result against the VRAM column in the hardware guide.
  6. Plug the GPU count and hours into the cost calculator to get a workload cost.
  7. If the model only just fits, step up one tier (or shard wider) — fragmentation alone can push real usage 10% above the napkin estimate.

Related