Activation-aware Weight Quantization (AWQ) is a technique for compressing Large Language Models (LLMs) to make them more efficient and accessible on consumer hardware. The key insight of AWQ is that not all weights in an LLM are equally important. By analyzing the activation maps of the model, AWQ identifies the 'salient' weights that have the most significant impact on performance. These critical weights are protected during the quantization process, while the less important weights are compressed more aggressively. This allows for a significant reduction in model size with minimal loss in accuracy.
The AWQ method was introduced in a 2023 paper by researchers from MIT, who demonstrated that it was possible to quantize LLMs to 4-bit precision with negligible performance degradation. Their work provided a more effective way to compress models for inference, making it possible to run powerful LLMs on devices with limited memory and computational resources.
AWQ has been quickly adopted by the open-source community and is now a popular method for quantizing LLMs. It is supported by many of the leading high-performance serving engines, including vLLM and Hugging Face's Text Generation Inference (TGI). Along with other techniques like GPTQ and GGUF, AWQ has become a standard for creating and distributing compressed models that can be run on a wide range of hardware.