project 5: Diffusion Models

PersonSam Huang

Part A: Power of Diffusion Models

This part of the project is to play around with diffusion models and implement different sampling loops to achieve various effects, such as inpainting and optical illusions. In this project, we use the DeepFloyd IF diffusion model. In this part, we use the global random seed of 180.

Sampling from the Model

DeepFloyd was trained as a text-to-image model that takes texts prompts as input and outputs corresponding images. Lets generate some images with given prompts embedding. Lets see some results after stage 2 of DeepFloyd which outputs 256 by 256 three channels images with 20 inference steps.

20 inference steps

Now let’s try again with 50 inference steps. Wow, the details are so much more in each of the images! The 20 steps look pretty good, but is more cartoonish. With 50 steps, although some still look cartoonish, the model obviously attempt to add in more information.

50 inference steps

Sampling Loop

We first implement the forward loop, given a clean image x0x_0 we get a noisy image xtx_t at time step tt.

The forward path is defined by q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})  which is equivalent to computing xt=αˉtx0+1αˉtϵ,ϵN(0,I){x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I}).

Lets see a series of images with increasing noise level at step tt.

Classical Denoising

Let’s first see how classical approach like Gaussian blur filter denoising do on these noisy images. Well the results technically aren’t bad given the amount of noisy, but they are certainly not good.

UNet Denoising: One Step & Iteratively

Let’s see if we can do better with stage_1.unet. Given a noisy image, we pass in to stage_1.unet to get the noise estimate and the subtracted the noise to get a estimated clean images following the equations above.

To further improve the denoising through unet, instead of estimate a clean image in one step, we can actually do this iteratively. Lets say for every 30 step, we estimated a cleaner version of the noisy image, and then we do this iteratively following this formula:L: xt=αˉtβtαˉtx0+αt(1αˉt)1αˉtxt+vσx_{t'} = \frac{\bar{\alpha}_{t'}\beta_t}{\bar{\alpha}_t} x_0 + \frac{\alpha_t(1-\bar{\alpha}_t)}{1-\bar{\alpha}_{t'}} x_t + v_\sigma where xtx_t is the image at timestep tt, xtx_{t'} is noisy image as timestep tt’ where t<tt’<t (less noisy). αˉt\bar{\alpha}_{t'} is defined by alphas_cumprod; αt=αˉtαˉtα_t = \frac{\bar{\alpha}t}{\bar{\alpha}{t'}} and βt=1αtβ_t = 1 - α_t; x0x_0 is current estimate of clean image.

We can see that iterative denoising does offer better details than one step, and both are much better than Gaussian denoising.

Diffusion Model Sampling

Now let’s use iterative denoise to generate some images with prompt embedding of “a high quality photo”, we can do this by starting with random noise as images so the model would be as creative as possible.

Classifier Free Guidance

Although we have prompt embedding of “a high quality photo” the images are not of “great quality.” we can use a technique called Classifier Free Guidance (CFG) to enhance certain embedding so we can improve image quality at expense of image diversity. In CFG, we compute noise estimate conditioned on a text prompt and an unconditional noise estimate. We denote ϵc\epsilon_c and ϵu\epsilon_u and we let our new estimate be ϵ=ϵu+γ(ϵcϵu)\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u).

Let’s get 5 images of “a high quality photo” with CFG scale of γ=7\gamma=7. Indeed, so much better!

Image-to-Image Translation

When we do denoising, the more noise we add, the more “creative” that the model has to be to recover the image by adding “generated” details. Here we are going to see the effect of different starting index. We will see that this would gradually match to the original images.

Let’s see some more examples from web images and hand-drawn.

Inpainting

We are gonna use diffusion model to inpaint a masked part of the image. Specifically, we force xtx_t to have have the same pixel as xorigx_{orig} where binary mask mm is 0. Specifically, xtmxt+(1m)forward(xorig,t)x_t \leftarrow mx_t + (1-m)\text{forward}(x_{\text{orig}}, t).

Text-Conditioned Image-to-Image

For previous results, we’ve been using prompt of “a high quality photo”. Now we want to try a different prompt, for example, “a rocket ship”. The images would gradually look like our original photo, but also look like our prompt.

Visual Anagrams

Here we want to create the optical illusions with diffusion models, we will create an image that looks like "an oil painting of an old man", but when flipped upside down will reveal "an oil painting of people around a campfire". For second pair, we use prompt “an oil painting of a snowy mountain village” and “a photo of a man”. For the third pair, we use prompt “a photo of the Amalfi coast” and “a photo of a dog”. The algorithm we use is ϵ1=UNet(xt,t,p1)\epsilon_1 = \text{UNet}(x_t, t, p_1), ϵ2=flip(UNet(flip(xt),t,p2))\epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)), and ϵ=ϵ1+ϵ22\epsilon = \frac{\epsilon_1 + \epsilon_2}{2}.

Hybrid Images

Similar to above, we will now build a hybrid images combining low frequencies from one noise estimate with high frequencies of another, and thus creating images that share features from both prompt. We use algorithm ϵ=flowpass(ϵ1)+fhighpass(ϵ2)\epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2) where ϵ1=UNet(xt,t,p1)\epsilon_1 = \text{UNet}(x_t, t, p_1) and ϵ2=UNet(xt,t,p2)\epsilon_2 = \text{UNet}(x_t, t, p_2).

First image uses “a lithograph of a skull” and “a lithograph of waterfalls”. Second image uses “a rocket ship” and “a photo of a dog”. Third image uses “an oil painting of an old man” and “an oil painting of people around a campfire.”

Part B: Diffusion Models from Scratch

Train a Denoiser using UNet

We will train a denoiser to denoise noisy image zz with σ=0.5\sigma=0.5 applied to a clean image xx. We will be using the MNIST dataset. We will first train an Unconditional UNet according to the following structure.

Let’s first see what noisy images look like at different noise level.

To train the denoiser, we generate training data pairs (z,x)(z, x) where each xx is a clean MNST digit, and we construct a noisy version zz through z=x+σϵ, ϵN(0,1)z = x + \sigma\epsilon, \ \epsilon \sim \mathcal{N}(0,1). We will train our denoising with the following loss function L=Ez,xDθ(z)x2\mathcal{L} = \mathbb{E}_{z,x} \|D_\theta(z)-x\|^2

Let’s see how our denoiser does after 1st and 5th epoch of training.

after 1 epoch
after 5 epochs

It does seems like 5 epochs later the denoiser perform much better. Since we were using a standard σ=0.5\sigma=0.5 same as training, let’s see how they perform at other noise levels.

Train a Diffusion Model with Adding Time Conditioning

Now, we are ready for diffusion. We w3ill train a UNet model that can iteratively denoise an image. We will first change our loss function to L=Ez,xϵθ(z)ϵ2\mathcal{L} = \mathbb{E}{z,x} \|\epsilon_\theta(z)- \epsilon\|^2. Since we want to conditioned on time tt, this function eventually become L=Ez,xϵθ(xt,t)ϵ2\mathcal{L} = \mathbb{E}{z,x} \|\epsilon_\theta(x_t,t)- \epsilon\|^2. That is, to denoise xtx_t, we could apply our UNet ϵθ\epsilon_\theta on xtx_t and get the noise ϵ\epsilon at each timestep tt. Specifically, we will inject a value tt and our denoiser would be conditioned on it.

We will use the following algorithm to train and get a loss curve.

Now let’s use our trained denoiser diffusion model to sample some images using the following algorithm starting from random noise. Let’s also compare performance after 5 or 20 epochs of training.

after 5 epochs
after 20 epochs

Class-Conditioned

Now, after adding condition on a timestep t to generate better result iteratively, we can also condition denoising on class to better guide the generation process. Now instead of generating through ϵθ(x,t)\epsilon_\theta(x,t) we would use ϵθ(x,t,c)\epsilon_\theta(x,t,c). We would use the algorithm below to train the process from random noise to our desired class though guided one-hot vector cc.

The sampling process is pretty much the same as time-conditioned generation, except that we need to use classifier-free guidance with γ=5\gamma=5. Let’s generate some digits!

5 epochs
20 epochs

The model after 20 epochs of training indeed shows better shape for generated digits, but the 5 epoch checkpoint seems acceptable as well.