A memory management algorithm for LLM inference that functions like virtual memory in an operating system. Instead of allocating contiguous memory for Key-Value (KV) caches, PagedAttention partitions the KV cache into blocks that can be stored in non-contiguous memory spaces. This eliminates memory fragmentation and allows the system to batch many more requests together, significantly increasing throughput.
Introduced in the 'vLLM' paper (SOSP 2023) by researchers at UC Berkeley.
The core technology behind vLLM, now widely adopted in high-performance inference engines.