Inference Optimization: Serving AI Models at Scale

Inference cost per token — not training cost — dominates the lifetime economics of a production LLM serving stack once traffic is non-trivial.
Continuous batching, introduced with vLLM’s PagedAttention, lifts GPU utilisation by 2-4x over static batching by scheduling at the token level rather than the request level.
The KV cache — not the model weights — is usually the binding memory constraint for long-context serving, which is why paged attention, prefix caching, and eviction policies matter.
Quantization (INT8, INT4, FP8) reduces memory bandwidth pressure and lets larger models fit on fewer GPUs, with accuracy losses typically under 1% for well-tuned schemes.
Tensor and pipeline parallelism are complementary: tensor parallelism reduces per-GPU compute inside a layer, pipeline parallelism partitions layers across GPUs, and both are standard in frameworks like vLLM, TensorRT-LLM, and SGLang.

Why inference is the hard part

Training a frontier model is a capital expense that happens a handful of times per year. Serving that model to users happens millions of times per hour and never stops. For most production systems the cumulative inference spend overtakes training spend within weeks, which is why the serving stack — not the model architecture — determines whether a product is economically viable. See the large language models primer for background on the models themselves.

LLM inference has two phases with very different computational profiles. The prefill phase processes the entire input prompt in parallel and is compute-bound — GPUs are doing matrix multiplies at high utilisation. The decode phase generates tokens one at a time and is memory-bound — each new token requires streaming the full KV cache plus model weights through the GPU, and the arithmetic intensity is low. Optimisations for prefill and decode look different, and a serving stack tuned for one can be disastrous for the other.

Batching strategies

Static batching

The naive approach groups N requests, pads them to the longest sequence, runs them through the model together, and returns all outputs at once. This wastes compute on padding tokens and — critically — stalls the entire batch until the slowest sequence finishes decoding. For chat workloads with highly variable output lengths, static batching leaves 50-80% of GPU time idle.

Dynamic batching

Dynamic batching improves on this by grouping whatever requests are in the queue at the moment the previous batch finishes. It reduces queuing latency but still suffers from the slowest-sequence problem within a batch. NVIDIA Triton Inference Server and most early serving stacks implement dynamic batching.

Continuous batching

Continuous batching, sometimes called in-flight batching or iteration-level scheduling, schedules at the token level rather than the request level. When a sequence finishes, a new one slots in on the next iteration without waiting for the rest of the batch. The technique was popularised by the vLLM paper (Kwon et al., 2023) alongside PagedAttention, and is now the default in vLLM, TensorRT-LLM, SGLang, and TGI. Reported throughput gains are typically 2-4x over dynamic batching for chat-style workloads.

KV cache management

Every token generated by a transformer references the keys and values of all previous tokens via attention. Recomputing them on every step would be prohibitive, so serving stacks cache them. For a 70B model with 8K context and FP16, the KV cache per sequence is on the order of gigabytes — quickly larger than the model weights once batch size climbs.

PagedAttention

Traditional KV cache allocation reserved contiguous memory for the maximum sequence length, wasting space on shorter sequences and causing fragmentation. PagedAttention borrows the virtual-memory paging idea from operating systems: the cache is split into fixed-size blocks, allocated on demand, and referenced through a block table. Memory utilisation rises from around 20-40% to 95%+ in published benchmarks.

Prefix caching

When many requests share a common prefix — a long system prompt, a RAG-retrieved document, a few-shot example set — recomputing the prefix’s KV cache for each request is pure waste. Prefix caching stores and reuses the KV cache for repeated prefixes. Frameworks like SGLang and vLLM implement it natively, with radix-tree or hash-based lookup. Realistic deployments with a shared system prompt see 30-70% prefill-time reductions.

Eviction and offloading

When KV cache memory is exhausted, something has to give. Policies include evicting least-recently-used sequences, swapping cache to CPU memory, or recomputing from scratch. The right choice depends on whether the workload is latency-sensitive (evict) or throughput-oriented (swap).

Quantization

Quantization reduces the numerical precision of model weights and activations — typically from FP16 or BF16 down to INT8, INT4, or FP8. Memory footprint shrinks proportionally, memory bandwidth demand drops, and on hardware with dedicated low-precision units (NVIDIA H100, B200; AMD MI300) throughput rises. A 70B model that needs two 80GB GPUs in FP16 can fit on one in INT4.

Weight-only vs. activation quantization

Weight-only quantization (GPTQ, AWQ, bitsandbytes NF4) compresses just the weights; activations stay in higher precision. This is simple and preserves accuracy well but gives up some speed, because dequantization happens on the fly. Full weight-and-activation quantization (SmoothQuant, FP8 schemes) is faster but harder to calibrate without accuracy loss.

FP8 on modern hardware

NVIDIA’s Hopper and Blackwell architectures have native FP8 tensor cores. FP8 offers close-to-FP16 accuracy with half the memory and roughly double the arithmetic throughput. Frameworks such as TensorRT-LLM expose FP8 kernels directly, and it is becoming the default precision for new deployments on H100-class hardware.

Accuracy considerations

Aggressive quantization, particularly INT4, can degrade long-context reasoning and code-generation quality in ways that standard benchmarks miss. Teams running production LLMs should evaluate quantized variants on task-specific data rather than trusting published perplexity numbers.

Speculative decoding

Decode is memory-bound, meaning the GPU is mostly idle waiting on memory reads. Speculative decoding exploits this by using a small, fast “draft” model to propose several tokens ahead, then having the large “target” model verify them in a single forward pass. If the draft is accepted, multiple tokens are produced for the cost of one decode step; if rejected, the system falls back. Published speedups range from 2x to 3x with minimal quality loss when draft and target models are well-matched.

Variants include Medusa (multiple prediction heads on the target model, no separate draft), Lookahead Decoding (algorithmic self-speculation), and EAGLE (feature-level speculation). All are supported in some combination by current serving frameworks.

Parallelism at scale

Tensor parallelism

Tensor parallelism splits each layer’s weight matrices across multiple GPUs, with all-reduce communication after the forward pass of each layer. It keeps per-token latency low and is the standard way to serve a model that does not fit on one GPU. High inter-GPU bandwidth (NVLink, NVSwitch) is required; over PCIe it becomes the bottleneck.

Pipeline parallelism

Pipeline parallelism assigns different layers to different GPUs. It increases throughput by overlapping stages but adds latency because each token passes through all pipeline stages. It is most useful when scaling to dozens of GPUs per model, where tensor parallelism alone runs out of bandwidth.

Expert parallelism for MoE

Mixture-of-Experts models like Mixtral, DeepSeek-V3, and Qwen3-MoE route each token to a subset of experts. Expert parallelism distributes experts across GPUs, with token routing via all-to-all communication. This is how sparse trillion-parameter models are served at reasonable cost — only active experts contribute to per-token compute.

Serving frameworks in practice

The landscape in 2026 is narrowing around a small set of mature open-source stacks:

vLLM — the reference implementation of PagedAttention and continuous batching. Widest model support, strong community, reasonable default choice for most teams.
TensorRT-LLM — NVIDIA’s high-performance stack. Best raw throughput on NVIDIA hardware, particularly FP8 and kernel-fused paths, but slower to add new models and tied to the NVIDIA toolchain.
SGLang — strong on structured generation, prefix caching (RadixAttention), and programmable control flow. Popular for agentic and RAG workloads.
Text Generation Inference (TGI) — Hugging Face’s production server, tight integration with the Hub, solid default for HF-centric stacks.
Ray Serve — general-purpose Python serving layer; pairs with vLLM or custom engines for ML-specific workloads, useful when the application is bigger than just the model. For the broader operational context see the mlops primer.

Benchmark numbers between these frameworks shift with every release. Any team committing to one should run its own workload-representative benchmarks rather than trusting vendor marketing.

Latency vs. throughput tradeoffs

Inference optimisation is ultimately a Pareto curve between per-request latency (time to first token, time per output token) and cluster throughput (tokens per second across all users). Continuous batching and large batch sizes push throughput up but can raise tail latency. Tensor parallelism lowers latency but increases cost per token. Quantization lowers cost but can hurt quality. Speculative decoding lowers latency but needs a second model in memory.

The right operating point depends on the product. A user-facing chat UI prioritises time-to-first-token and tight tail latency. A batch summarisation pipeline prioritises throughput and is tolerant of seconds of queuing. A coding assistant with tool use prioritises structured-output support and low per-token latency. There is no single “optimal” configuration — only configurations tuned to a workload. See the transformers primer for how the underlying architecture shapes these tradeoffs.

Frequently asked questions

Which framework should a team start with in 2026?
vLLM is the default recommendation — broadest model coverage, active development, and good out-of-the-box performance. Teams with heavy NVIDIA investment and willingness to wait for model support may get better absolute numbers from TensorRT-LLM. Teams building agentic or RAG-heavy products should evaluate SGLang for its structured generation and prefix caching. Running a representative workload through two or three candidates for a weekend before committing is cheap insurance.

Is INT4 quantization safe for production?
For most chat and summarisation workloads, well-tuned INT4 (AWQ, GPTQ) preserves quality within measurement noise. For long-context reasoning, code generation, and agentic tool use, quality can degrade in ways that generic benchmarks miss. The rule is to evaluate on task-specific data, not trust published perplexity or MMLU scores. FP8 on H100-class hardware is a safer default when available.

How does serving change for mixture-of-experts models?
MoE models shift the bottleneck from dense compute to memory capacity and all-to-all bandwidth. Expert parallelism is standard, routing imbalance becomes an operational concern, and KV cache pressure grows because active batch sizes are larger. Frameworks that handle MoE well (vLLM, SGLang, TensorRT-LLM with recent releases) are not interchangeable with dense-model-only stacks. Teams deploying MoE should budget for more tuning and more per-model engineering than dense deployments.