GGUF (GPT-Generated Unified Format) is a binary file format designed for storing and running large language models (LLMs) efficiently on consumer hardware. It is a single-file format that contains all the necessary information to load and run a model, including the model weights, vocabulary, and metadata. GGUF is optimized for fast loading and memory mapping (mmap), which allows models to be run on systems with limited RAM by offloading some layers to the GPU.
GGUF was developed by the team behind the llama.cpp project as a successor to the earlier GGML format. The primary motivation was to create a more extensible and future-proof format that could accommodate new model architectures and quantization methods without breaking backward compatibility. It solves the rigidity and metadata issues that were present in previous formats.
GGUF has become the de facto standard file format for running LLMs locally on consumer-grade hardware (CPUs and GPUs). It is widely used by the open-source AI community for sharing quantized models on platforms like Hugging Face. The efficiency and flexibility of GGUF have been instrumental in making powerful LLMs accessible to a broader audience of developers and enthusiasts without requiring high-end server infrastructure.