cjyuResearch

Diffusion models are the state of the art method of generating high quality computer generated images. They generate them by transforming pure noise into meaningful data through a denoising process that parametrizes the noise in each step as a gaussian distribution. The denoising process is represented by a Markov chain of noisy to clean images.

I was confused by this for the longest time but figured out the key reason for my confusion. We start off with pure noise which is just a multivariate gaussian. At each denoising step, the noise is assumed to follow a gaussian distribution. So we are effectively subtracting gaussian distributions from a gaussian distribution. We remember from statistics(hopefully) that adding or subtracting two Gaussians results in a Gaussian. But if this is the case, wouldn't the final distribution generated by the diffusion model always be Gaussian as well? Something is clearly amiss.

The secret to representing complex nonlinear distributions through the subtraction of Gaussian distributions lies in their generation. First, we have to clarify, adding or subtracting two independent Gaussians results in a Gaussian. If one of the Gaussians is generated from the other Gaussian as a prior in a nonlinear fashion, the result is no longer strictly Gaussian. So for diffusion to work, we need two things from our noise removal process.

• The Gaussian noise that is removed must be dependent on some prior. It could be the previous noise step or the original image.

• The Gaussian noise must be predicted in a nonlinear fashion, usually with a U-Net or some other architecture.

If we have only the first condition, we get the following process:

If we have both conditions, we get the resulting process:

Visualizations of Diffusion Distributions