How Diffusion Models Work

A step-by-step, visual explainer of forward noise and reverse denoising in modern diffusion models.

Diffusion Models

Diffusion models were introduced as generative probabilistic frameworks that gradually add and then remove noise. By learning to reverse the diffusion, they synthesize new samples that follow the data distribution.

t = 499β†’clean sample

Forward Process

The forward process applies Gaussian noise over a schedule of discrete time steps:

q(xt∣xtβˆ’1)=N ⁣(xt; 1βˆ’Ξ²t xtβˆ’1, βtI).q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\,\sqrt{1-\beta_t}\,x_{t-1},\,\beta_t I\right).

Sampling uses the reparameterization trick:

x=ΞΌ+σ Ρ,Ρ∼N(0,I),x = \mu + \sigma\,\varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I),

so a single diffusion step is

xt=1βˆ’Ξ²t xtβˆ’1+Ξ²t ϡt,Ο΅t∼N(0,I).x_t = \sqrt{1-\beta_t}\,x_{t-1} + \sqrt{\beta_t}\,\epsilon_t,\qquad \epsilon_t \sim \mathcal{N}(0, I).

With the definitions

Ξ±t=1βˆ’Ξ²t\alpha_t = 1 - \beta_t Ξ±Λ‰t=∏s=1tΞ±s\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

we get the closed form:

xt=Ξ±Λ‰t x0+1βˆ’Ξ±Λ‰t ϡt,Ο΅t∼N(0,I).x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon_t,\qquad \epsilon_t \sim \mathcal{N}(0, I).

Reverse Process

The learned model predicts the reverse transition:

pΞΈ(xtβˆ’1∣xt)=N ⁣(xtβˆ’1; μθ(xt,t), β~tI),p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\bigl(x_{t-1};\,\mu_\theta(x_t,t),\,\tilde{\beta}_t I\bigr),

Predicting the model mean is equivalent to predicting the noise:

ΞΌΞΈ(xt,t)=1Ξ±t ⁣(xtβˆ’Ξ²t1βˆ’Ξ±Λ‰t ϡ^).\mu_\theta(x_t, t)=\frac{1}{\sqrt{\alpha_t}}\!\left(x_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\hat{\epsilon}\right). x^0=xtβˆ’1βˆ’Ξ±Λ‰t ϡ^Ξ±Λ‰t\hat{x}_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t}\,\hat{\epsilon}}{\sqrt{\bar{\alpha}_t}}

Using this estimate, the posterior mean is:

ΞΌΞΈ(xt,t)=Ξ±Λ‰tβˆ’1 βt1βˆ’Ξ±Λ‰t x^0+Ξ±t(1βˆ’Ξ±Λ‰tβˆ’1)1βˆ’Ξ±Λ‰t xt.\mu_\theta(x_t, t)= \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\hat{x}_0+ \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,x_t. Ξ²~t=1βˆ’Ξ±Λ‰tβˆ’11βˆ’Ξ±Λ‰t βt\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t}\,\beta_t

The reverse step samples:

xtβˆ’1=ΞΌΞΈ(xt,t)+Ξ²~t zt,zt∼N(0,I),x_{t-1} = \mu_\theta(x_t, t) + \sqrt{\tilde{\beta}_t}\,z_t,\qquad z_t \sim \mathcal{N}(0, I), xT∼N(0,I)x_T \sim \mathcal{N}(0, I)

starting from pure noise.

Neural network view

Under the hood, a neural network predicts the noise at each timestep. The animation below shows a dense network with signals pulsing from input to output.

Self-attention over pixels

Modern diffusion U-Nets often mix convolutions with self-attention so distant pixels can influence each other. Click any pixel below to make it the query; the white lines show attention weights fading with distance, controlled by the Οƒ slider.

Click any pixel to make it the query.Higher Οƒ β†’ flatter attention.

U-Net backbone

Diffusion models commonly use a U-Net: an encoder that downsamples to a bottleneck, then a decoder that upsamples while fusing skip connections. Slide to see the forward (left) and reverse (right) halves light up.

Left: downsampling encoder.Right: upsampling decoder with skip links.

Training dynamics

Diffusion models are optimized with gradient descent. The plot below shows steps on a simple quadratic: small learning rates creep toward the minimum; larger rates move faster but can overshoot.

Small lr β†’ slow but stable.Large lr β†’ overshoots past the minimum.

Single gate intuition

A neural network is built from simple gates. Here is one ReLU-style gate with two inputs and one outputβ€”signals flow left to right in monochrome.

A single RELU gate: signals enter from the left, combine in the gate, and produce one output on the right.