Key takeaways
- Diffusion models generate images by iteratively denoising random noise, learning the reverse of a process that progressively corrupts training images.
- The key papers: DDPM (2020) formalized the approach; Latent Diffusion (2021) made it computationally feasible by working in compressed latent space.
- Stable Diffusion (2022), DALL-E 2/3, Midjourney, and Imagen are all diffusion models with different architecture, training, and conditioning choices.
- Diffusion is now extending to video (Sora, Runway Gen-3, Veo), audio (Stable Audio), and 3D generation.
- Compared to earlier GANs, diffusion produces more diverse and stable outputs; compared to autoregressive image models, better detail and faster scaling.
The intuition
Training: take an image, add a little noise. Add more noise. Keep going until the image is pure noise. Now train a neural network to reverse each step — given a noisy image, predict how to denoise it slightly. After training, you can generate images from scratch by starting with pure noise and iteratively applying the denoiser. The network learned to hallucinate the structure of images out of noise.

This is a gross simplification of the math but captures the idea. The network is called a denoiser or noise predictor, the iterative denoising is the sampling process, and the number of denoising steps is a tunable knob (more steps = higher quality, more compute). See our neural networks primer for the underlying architecture family.
Why diffusion beat GANs
Before diffusion, generative adversarial networks (GANs) dominated image generation. GANs could produce sharp, realistic images — but they were hard to train (prone to mode collapse where the generator produces limited diversity), hard to condition on text, and limited in resolution.
Diffusion models are simpler to train, more stable, naturally produce diverse outputs, and — with text conditioning — scale to high-quality text-to-image generation. The 2022 wave of DALL-E 2, Imagen, and Stable Diffusion all used diffusion, and GAN-based image generation largely faded from frontier research.
Latent diffusion and Stable Diffusion
A key efficiency: do the diffusion process in a compressed latent space rather than on raw pixels. The Latent Diffusion paper from Stability AI / Ludwig-Maximilians University showed that a pre-trained autoencoder could compress images into a smaller latent representation, and running diffusion there preserved quality while cutting compute 10-100x.
Stable Diffusion (released in 2022) was the public release of a latent diffusion model with text conditioning. It runs on consumer GPUs — a decisive moment for accessibility. Thousands of fine-tuned variants, LoRAs, and community projects followed. Stable Diffusion XL (2023), Stable Diffusion 3 (2024), and later variants have improved quality, prompt fidelity, and text rendering.
Text conditioning
The magic of “type a prompt, get an image” requires conditioning the diffusion process on text. Most systems use a frozen text encoder — CLIP’s text encoder in early Stable Diffusion, T5 in Imagen, proprietary transformers in DALL-E 3 — to produce text embeddings. The embeddings are injected into the diffusion network’s cross-attention layers, steering each denoising step toward images that match the prompt.
Prompt fidelity (how closely the image matches the prompt) improved dramatically from DALL-E 2 to DALL-E 3 and from SD 1.5 to SD 3 / SDXL — primarily through better text encoders and higher-quality training captions. For the broader generative family, see our generative ai primer.
Other diffusion applications
Image editing and inpainting
Providing a partial image or mask and asking the model to complete or modify it. ControlNet, inpainting pipelines, and editing tools like Adobe’s Generative Fill use diffusion for precise editing.
Video generation
Sora (OpenAI), Veo (Google), Runway Gen-3, Kling, Pika, and open-source variants (Wan 2.1, Mochi, CogVideoX) extend diffusion to video. Challenges: temporal consistency (objects stay the same across frames), physics plausibility, long-duration generation. Quality has advanced remarkably in 2024-2025 but production-usable long-form video remains specialized.
Audio diffusion
Stable Audio, AudioLDM, and related models apply diffusion to audio — generating music, sound effects, and foley. See our music-generation primer for more.
3D generation
Text-to-3D and image-to-3D models (Meshy, Luma Genie, Tripo) use diffusion to generate 3D meshes or NeRF-like representations. Quality has improved but still trails 2D image generation.
Scientific applications
Molecular generation for drug discovery (Diffusion for molecular docking, MolDiff). Protein design (RFdiffusion). Climate modeling and fluid dynamics. Diffusion is a general framework that has found use far beyond image generation.
Known limitations
Prompt fidelity on complex scenes
Models still struggle with images requiring specific arrangements — “three apples, two red and one green, on the left of a book”. Counting, spatial reasoning, and complex compositional prompts remain harder than they sound.
Text rendering
Images with embedded text often show gibberish or misspellings. SDXL, Ideogram, and DALL-E 3 have improved this; it’s no longer broken but still imperfect.
Hands and anatomy
The classic failure mode — extra fingers, strange joints, bizarre anatomy. Modern models handle this much better than early SD, but edge cases still surface.
Copyright and training-data concerns
Diffusion models are trained on internet-scale image datasets that include copyrighted images. Lawsuits (Getty vs. Stability AI, artist-led class actions) are contesting whether training and generation constitute infringement. The legal landscape is unresolved. See our computer vision coverage for broader context.
Commercial landscape
- Closed commercial: Midjourney, DALL-E 3 (OpenAI), Adobe Firefly, Google Imagen 3, Ideogram, Flux Pro (Black Forest Labs).
- Open weights: Stable Diffusion family (SD 1.5, SDXL, SD 3), Flux, Pixart, Kolors. Self-hostable, customizable, free to fine-tune.
- Specialized: Leonardo, NightCafe, Lexica, Playground for creator workflows; Scenario for game asset pipelines; RunwayML for video.
Frequently asked questions
How long does it take to generate an image?
Depends on model size, resolution, and compute. Stable Diffusion 1.5 generates a 512×512 image in a few seconds on a consumer GPU. SDXL at 1024×1024 takes around 10-30 seconds. Commercial hosted models (Midjourney, DALL-E) typically respond in 10-60 seconds. Real-time diffusion (under 1 second per image) is the active frontier, using techniques like latent consistency models (LCM) and Stable Diffusion Turbo.
Can I train my own diffusion model?
Training a frontier model from scratch is out of reach for individuals — tens of thousands of GPU-hours and massive image datasets. Fine-tuning or LoRA-adapting an existing open-weights model is accessible — a LoRA for a specific style or character can be trained in a few hours on a single consumer GPU. Most practical customization of diffusion happens via fine-tuning rather than from-scratch training.
Are diffusion-generated images copyrightable?
Depends on jurisdiction and human contribution. The US Copyright Office has ruled that purely AI-generated images are not copyrightable (Thaler v. Perlmutter, 2023; Zarya of the Dawn, 2023), but significant human editing or creative selection can qualify. Other jurisdictions (EU, UK, Japan) have different rules. When in doubt, assume fully AI-generated output has weak copyright protection, and preserve evidence of your own creative contributions.






