How Diffusion Models Work

A step-by-step, visual explainer of forward noise and reverse denoising in modern diffusion models.

Diffusion Models

Diffusion models were introduced as generative probabilistic frameworks that gradually add and then remove noise. By learning to reverse the diffusion, they synthesize new samples that follow the data distribution.

Reverse step

t = 499→clean sample

Forward Process

The forward process applies Gaussian noise over a schedule of discrete time steps:

q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\,\sqrt{1-\beta_t}\,x_{t-1},\,\beta_t I\right).

Sampling uses the reparameterization trick:

x = \mu + \sigma\,\varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I),

so a single diffusion step is

x_t = \sqrt{1-\beta_t}\,x_{t-1} + \sqrt{\beta_t}\,\epsilon_t,\qquad \epsilon_t \sim \mathcal{N}(0, I).

With the definitions

\alpha_t = 1 - \beta_t

\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

we get the closed form:

x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon_t,\qquad \epsilon_t \sim \mathcal{N}(0, I).

Reverse Process

The learned model predicts the reverse transition:

p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\bigl(x_{t-1};\,\mu_\theta(x_t,t),\,\tilde{\beta}_t I\bigr),

Predicting the model mean is equivalent to predicting the noise:

\mu_\theta(x_t, t)=\frac{1}{\sqrt{\alpha_t}}\!\left(x_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\hat{\epsilon}\right).

\hat{x}_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t}\,\hat{\epsilon}}{\sqrt{\bar{\alpha}_t}}

Using this estimate, the posterior mean is:

\mu_\theta(x_t, t)= \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\hat{x}_0+ \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,x_t.

\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t}\,\beta_t

The reverse step samples:

x_{t-1} = \mu_\theta(x_t, t) + \sqrt{\tilde{\beta}_t}\,z_t,\qquad z_t \sim \mathcal{N}(0, I),

x_T \sim \mathcal{N}(0, I)

starting from pure noise.

Neural network view

Under the hood, a neural network predicts the noise at each timestep. The animation below shows a dense network with signals pulsing from input to output.

Self-attention over pixels

Modern diffusion U-Nets often mix convolutions with self-attention so distant pixels can influence each other. Click any pixel below to make it the query; the white lines show attention weights fading with distance, controlled by the σ slider.

Attention spread (σ)

Click any pixel to make it the query.Higher σ → flatter attention.

U-Net backbone

Diffusion models commonly use a U-Net: an encoder that downsamples to a bottleneck, then a decoder that upsamples while fusing skip connections. Slide to see the forward (left) and reverse (right) halves light up.

Forward / reverse pass

Left: downsampling encoder.Right: upsampling decoder with skip links.

Training dynamics

Diffusion models are optimized with gradient descent. The plot below shows steps on a simple quadratic: small learning rates creep toward the minimum; larger rates move faster but can overshoot.

Learning rate

Small lr → slow but stable.Large lr → overshoots past the minimum.

Single gate intuition

A neural network is built from simple gates. Here is one ReLU-style gate with two inputs and one output—signals flow left to right in monochrome.

A single RELU gate: signals enter from the left, combine in the gate, and produce one output on the right.