A serving technique where new requests are added to the running batch immediately as old ones finish, rather than waiting for the entire batch to complete. Also known as 'iteration-level scheduling'.
Essential for LLM inference servers.
Maximizes GPU utilization for chat applications.