Continuous Batching

What is Continuous Batching?

A serving technique where new requests are added to the running batch immediately as old ones finish, rather than waiting for the entire batch to complete. Also known as 'iteration-level scheduling' or 'orca scheduling'. In traditional batching, the GPU must wait for the longest sequence in the batch to complete before processing new requests, leading to idle time. Continuous batching solves this by processing sequences at the token level, allowing new sequences to join the batch at any step.

Where did the term "Continuous Batching" come from?

Popularized by the 'Orca' paper (OSDI 2022) and widely adopted by inference engines like vLLM and TGI.

How is "Continuous Batching" used today?

Standard practice in modern LLM inference servers (vLLM, Hugging Face TGI, NVIDIA Triton) to maximize GPU utilization and throughput.

Related Terms

vLLM
Inference
PagedAttention