Diffusion models are a class of generative models that function by systematically destroying a data structure and then learning how to recover it. This process involves two key stages: a forward diffusion process, where noise is gradually added to an input (like an image) until it becomes completely random, and a reverse diffusion process, where a neural network is trained to reverse this noising, starting from pure noise to generate a coherent output. This method allows the model to generate high-fidelity, diverse samples, making it a cornerstone of modern generative AI.
The concept was first introduced in 2015 by Sohl-Dickstein et al. in 'Deep Unsupervised Learning using Nonequilibrium Thermodynamics,' drawing inspiration from statistical physics. However, their widespread adoption and popularity surged after the 2020 paper 'Denoising Diffusion Probabilistic Models' by Ho et al., which demonstrated their ability to generate high-quality images, rivaling or surpassing GANs.
Since 2020, diffusion models have become the state-of-the-art for many generative tasks, particularly in image synthesis. They are the core technology behind prominent text-to-image models like DALL-E 2, Stable Diffusion, and Midjourney. Their application has expanded beyond images to include audio (AudioGen), video (Sora), and even 3D model generation.