A serving technique where new requests are added to the running batch immediately as old ones finish, rather than waiting for the entire batch to complete. Also known as 'iteration-level scheduling' or 'orca scheduling'. In traditional batching, the GPU must wait for the longest sequence in the batch to complete before processing new requests, leading to idle time. Continuous batching solves this by processing sequences at the token level, allowing new sequences to join the batch at any step.
Popularized by the 'Orca' paper (OSDI 2022) and widely adopted by inference engines like vLLM and TGI.
Standard practice in modern LLM inference servers (vLLM, Hugging Face TGI, NVIDIA Triton) to maximize GPU utilization and throughput.