EXL2 is a highly specialized quantization format for large language models (LLMs), created for the exllamav2 inference library. Its primary design goal is to achieve the fastest possible inference speeds on modern NVIDIA GPUs. A key feature of EXL2 is its support for mixed and fractional bits-per-weight (bpw) settings (e.g., 4.5 bpw). This allows for extremely granular control over the trade-off between model quality and VRAM usage, enabling users to fine-tune a model to perfectly fit their specific hardware constraints.
EXL2 was developed by the creator of the exllamav2 library as an evolution of the GPTQ quantization method. The goal was to build upon the foundation of GPTQ with new CUDA kernels and a more flexible format to maximize performance on NVIDIA's tensor cores. It emerged from the open-source community's need for a high-performance solution for running LLMs locally on consumer-grade gaming GPUs.
EXL2 has become the format of choice for many enthusiasts in the local LLM community who use NVIDIA GPUs and prioritize raw inference speed (tokens per second). It often significantly outperforms more general-purpose formats like GGUF on this specific hardware. However, this speed comes with trade-offs: EXL2 is not compatible with CPUs or non-NVIDIA GPUs, and the quantization process is more complex, requiring a calibration dataset. It is widely supported in popular front-ends like the Oobabooga text-generation-webui.