Inference vs Training: How the GPU Choice Differs
The same conversation comes up on every team that has shipped a model: the GPU that trained it is not the GPU that should serve it. Training and inference have different bottlenecks, different sensitivities to memory and bandwidth, and different cost structures. Treating them as the same workload locks teams into either an over-priced inference fleet or an under-sized training cluster.
This page lays out where the two diverge, which GPU each one prefers, and how to think about the cost of each.
Why the workloads are different
Training is dominated by throughput. The metric that matters is samples-per-second across the entire dataset for a fixed wall-clock budget. Each step does a forward pass, a backward pass, and an optimizer update; memory holds the weights, gradients, optimizer state, and activations stored for backprop. The job is roughly compute-bound when you can keep the GPU's tensor cores busy with large matrix multiplies, which is why batch size and sequence length matter so much.
Inference is dominated by latency at fixed batch shape. The metric that matters is time-to-first-token and tokens-per-second per request, sometimes alongside aggregate throughput across many concurrent requests. Memory only needs to hold the weights plus the KV cache for the active requests; there are no gradients or optimizer states. The job is often memory-bandwidth bound, especially for autoregressive decoding where each generated token reads the full weights from VRAM.
What each workload prefers
| Dimension | Training | Inference |
|---|---|---|
| Dominant bottleneck | Compute (FP8/FP16 throughput) | Memory bandwidth, then compute |
| VRAM driver | Weights + grads + optimizer + activations | Weights + KV cache |
| Best precision | BF16 / FP16 mixed; FP8 where supported | FP16 / BF16, then INT8 / INT4 quantized |
| Batch size | As large as memory allows | Often 1 (single user) up to dynamic batching for throughput |
| Multi-GPU | Routine — FSDP, DeepSpeed, Megatron | Only when model exceeds single-GPU VRAM (tensor parallelism) |
| Pricing model fit | Spot / preemptible viable with checkpoints | On-demand or reserved; spot rarely fits SLAs |
This is why a 70B fine-tune might end up on 8x H100 80GB while serving the same 70B model in production might run on 2x H100 (or even 1x H200 141GB) with INT8 quantization. The training cluster is sized for memory + compute; the inference deployment is sized for "smallest config that hits the latency target".
How memory differs in practice
For training, plan for roughly 16 bytes per parameter for a vanilla mixed-precision Adam loop (weights + grads + 2 Adam moments) before activations enter the picture. Sharding across GPUs with FSDP or DeepSpeed ZeRO-3 reduces the per-GPU number proportionally. Activation checkpointing trades extra compute for lower activation memory.
For inference, plan for roughly 2 bytes per parameter at FP16, 1 byte at INT8, and 0.5 bytes at INT4, plus a KV cache that scales with batch and context length. A 7B model that needed 8 GPUs to train can serve from a single GPU once it is quantized.
The detailed math, including a sizing checklist, lives on the VRAM sizing page.
How throughput differs
Training throughput scales nearly linearly with the GPU's tensor-core throughput at the precision you can use. An H100 trains in roughly a third of the time an A100 needs on the same model, because its FP16/BF16 throughput is roughly three times higher and FP8 unlocks another step. The benchmarks in H100 vs A100 show how that maps to wall-clock time.
Inference throughput is harder to read from the spec sheet. Single-stream decoding spends a lot of its time waiting on memory bandwidth, so an H200 (4.8 TB/s) often beats an H100 (3.35 TB/s) on tokens-per-second per request even though their compute throughput is similar. For batched inference with many concurrent requests, the picture flips back toward compute throughput.
Cost economics
Training is a burst workload. You spend a lot for a defined number of hours and then walk away. The total bill is dominated by GPU-hours, and the path to lower cost is faster training, spot capacity (see the spot guide), or both.
Inference is a sustained workload. Once you ship the model, the meter runs forever. Total cost is dominated by hourly utilization, and the path to lower cost is smaller / quantized models, batching across requests, autoscaling to zero where the SLA permits, and committed-use or reservation discounts (see pricing models) for the always-on baseline.
Two consequences worth internalising:
- It is cheap to be wrong about training GPU choice — a 30% inefficiency for a one-week job costs you a few thousand dollars. It is expensive to be wrong about inference GPU choice — the same 30% compounds for the entire serving lifetime of the model.
- Inference cost is largely a function of model size and quantization, not GPU choice, once you are past the "does it fit" threshold. Picking a smaller or quantized model is usually a bigger win than picking a fancier GPU.
Decision criteria
- Will the model fit on a single GPU at the precision you plan to serve in? If yes, choose the cheapest GPU on which it fits comfortably. If no, you are picking a multi-GPU node and tensor-parallelism support enters the picture.
- What is the latency target? Tight TTFT or strict tokens-per-second targets push toward higher-bandwidth GPUs (H200, B200) or smaller / quantized models.
- How does load vary? Spiky traffic punishes large reserved fleets; flat traffic rewards them.
- Is the workload restartable? Training: usually yes. Inference: rarely, unless you can fail over to another instance instantly.
Worked example: same model, two deployments
Consider a 13B-parameter model that the team has just fine-tuned and wants to ship.
- Training the fine-tune: 4–8x A100 80GB or 2–4x H100 80GB, FSDP, BF16 mixed precision, large batch with activation checkpointing. The deciding factor is wall-clock time vs hourly rate.
- Serving the fine-tune: a single A100 40GB, RTX 4090 (for self-hosted dev), or H100 80GB depending on context length, with INT8 or INT4 quantization to reduce both VRAM and per-token bandwidth pressure. The deciding factor is latency target and request volume.
Common mistakes
- Putting inference on the same SKU you trained on "to keep things simple". The training cluster usually has 5–10× more compute than serving needs, and you pay for it idle.
- Picking the inference GPU before deciding on a quantization strategy. Quantization decisions change the VRAM budget by 2–4×, and that changes the GPU.
- Optimizing only the model, not the serving stack. Switching from a generic loop to a batched runtime (vLLM, TensorRT-LLM, TGI) often beats moving up a GPU tier.
- Putting an inference SLA on spot capacity. Spot can absorb interruptions for some serving patterns, but most production endpoints cannot tolerate the variance.
Related reading
- VRAM sizing for LLMs — the memory math behind both workloads.
- H100 vs A100 comparison — how training-relevant numbers compare.
- Pricing models compared — different discount structures fit training vs inference.
- GPU cost calculator — model the cost of either workload.