Quantization reduces the precision of a model's weights (e.g., from 16-bit to 4-bit) to lower memory usage and increase inference speed, often with minimal loss in accuracy.
Essential technique for running large models on edge devices.
Enables running 70B+ parameter models on consumer GPUs.