The serving bottleneck
Training a large language model is expensive, but serving it in production presents equally complex challenges. Each generation request requires a KV cache (key-value cache) that grows proportionally to sequence length and the number of model layers. With hundreds of concurrent requests, GPU memory management becomes the limiting factor: the naive approach pre-allocates contiguous memory blocks for each sequence, wasting 60% to 80% of available memory due to fragmentation and conservative pre-allocation.
PagedAttention: virtual memory for Transformers
vLLM, developed by the research team at UC Berkeley, introduces PagedAttention, a KV cache management mechanism inspired by the concept of virtual memory paging in operating systems. Instead of allocating contiguous blocks for the entire sequence, PagedAttention divides the KV cache into fixed-size blocks (the “pages”) that can be allocated at non-contiguous positions in GPU memory.
A mapping table — analogous to an operating system’s page table — translates the logical addresses of the sequence into physical addresses of the blocks. Blocks are allocated on-demand as generation proceeds, eliminating the need to pre-allocate memory for the maximum sequence length.
Continuous batching and sharing
The second fundamental contribution is continuous batching: new requests are inserted into the execution batch as soon as a sequence finishes, without waiting for the entire batch to complete. This eliminates idle time where the GPU waits for all sequences in the batch to reach completion.
PagedAttention also enables memory sharing between sequences: when techniques such as beam search or parallel sampling generate multiple sequences from the same prompt, the KV cache blocks of the shared prefix are referenced through counters, not duplicated. Memory savings can reach 55%.
Results and adoption
In published benchmarks, vLLM achieves 2-4x higher throughput compared to naive inference with HuggingFace Transformers, and up to 24x higher in scenarios with long sequences and high parallelism. The project is released as an open source library with an API compatible with the OpenAI standard, allowing transparent integration into existing systems. For production deployment of language models, the serving layer becomes as critical as the model itself.
Link: docs.vllm.ai
