vLLM is a high-performance open-source library for serving Large Language Models. Its key innovation is PagedAttention, an algorithm that manages memory (KV cache) more efficiently, similar to virtual memory in operating systems. This results in significantly higher throughput and lower latency.
Developed at UC Berkeley's Sky Computing Lab (2023).
The gold standard for self-hosting open-source models like Llama 3 and Mistral.