Inference vs Training: How the GPU Choice Differs

Last reviewed on 2026-04-30 · 8 min read

The same conversation comes up on every team that has shipped a model: the GPU that trained it is not the GPU that should serve it. Training and inference have different bottlenecks, different sensitivities to memory and bandwidth, and different cost structures. Treating them as the same workload locks teams into either an over-priced inference fleet or an under-sized training cluster.

This page lays out where the two diverge, which GPU each one prefers, and how to think about the cost of each.

Why the workloads are different

Training is dominated by throughput. The metric that matters is samples-per-second across the entire dataset for a fixed wall-clock budget. Each step does a forward pass, a backward pass, and an optimizer update; memory holds the weights, gradients, optimizer state, and activations stored for backprop. The job is roughly compute-bound when you can keep the GPU's tensor cores busy with large matrix multiplies, which is why batch size and sequence length matter so much.

Inference is dominated by latency at fixed batch shape. The metric that matters is time-to-first-token and tokens-per-second per request, sometimes alongside aggregate throughput across many concurrent requests. Memory only needs to hold the weights plus the KV cache for the active requests; there are no gradients or optimizer states. The job is often memory-bandwidth bound, especially for autoregressive decoding where each generated token reads the full weights from VRAM.

The mental model: training likes lots of compute and lots of VRAM headroom. Inference likes lots of memory bandwidth and the smallest VRAM that still fits the model and KV cache. The two recommendations rarely point at the same SKU.

What each workload prefers

DimensionTrainingInference
Dominant bottleneckCompute (FP8/FP16 throughput)Memory bandwidth, then compute
VRAM driverWeights + grads + optimizer + activationsWeights + KV cache
Best precisionBF16 / FP16 mixed; FP8 where supportedFP16 / BF16, then INT8 / INT4 quantized
Batch sizeAs large as memory allowsOften 1 (single user) up to dynamic batching for throughput
Multi-GPURoutine — FSDP, DeepSpeed, MegatronOnly when model exceeds single-GPU VRAM (tensor parallelism)
Pricing model fitSpot / preemptible viable with checkpointsOn-demand or reserved; spot rarely fits SLAs

This is why a 70B fine-tune might end up on 8x H100 80GB while serving the same 70B model in production might run on 2x H100 (or even 1x H200 141GB) with INT8 quantization. The training cluster is sized for memory + compute; the inference deployment is sized for "smallest config that hits the latency target".

How memory differs in practice

For training, plan for roughly 16 bytes per parameter for a vanilla mixed-precision Adam loop (weights + grads + 2 Adam moments) before activations enter the picture. Sharding across GPUs with FSDP or DeepSpeed ZeRO-3 reduces the per-GPU number proportionally. Activation checkpointing trades extra compute for lower activation memory.

For inference, plan for roughly 2 bytes per parameter at FP16, 1 byte at INT8, and 0.5 bytes at INT4, plus a KV cache that scales with batch and context length. A 7B model that needed 8 GPUs to train can serve from a single GPU once it is quantized.

The detailed math, including a sizing checklist, lives on the VRAM sizing page.

How throughput differs

Training throughput scales nearly linearly with the GPU's tensor-core throughput at the precision you can use. An H100 trains in roughly a third of the time an A100 needs on the same model, because its FP16/BF16 throughput is roughly three times higher and FP8 unlocks another step. The benchmarks in H100 vs A100 show how that maps to wall-clock time.

Inference throughput is harder to read from the spec sheet. Single-stream decoding spends a lot of its time waiting on memory bandwidth, so an H200 (4.8 TB/s) often beats an H100 (3.35 TB/s) on tokens-per-second per request even though their compute throughput is similar. For batched inference with many concurrent requests, the picture flips back toward compute throughput.

Cost economics

Training is a burst workload. You spend a lot for a defined number of hours and then walk away. The total bill is dominated by GPU-hours, and the path to lower cost is faster training, spot capacity (see the spot guide), or both.

Inference is a sustained workload. Once you ship the model, the meter runs forever. Total cost is dominated by hourly utilization, and the path to lower cost is smaller / quantized models, batching across requests, autoscaling to zero where the SLA permits, and committed-use or reservation discounts (see pricing models) for the always-on baseline.

Two consequences worth internalising:

Decision criteria

  1. Will the model fit on a single GPU at the precision you plan to serve in? If yes, choose the cheapest GPU on which it fits comfortably. If no, you are picking a multi-GPU node and tensor-parallelism support enters the picture.
  2. What is the latency target? Tight TTFT or strict tokens-per-second targets push toward higher-bandwidth GPUs (H200, B200) or smaller / quantized models.
  3. How does load vary? Spiky traffic punishes large reserved fleets; flat traffic rewards them.
  4. Is the workload restartable? Training: usually yes. Inference: rarely, unless you can fail over to another instance instantly.

Worked example: same model, two deployments

Consider a 13B-parameter model that the team has just fine-tuned and wants to ship.

Common mistakes

Related reading