project 5: Diffusion Models

Person	Sam Huang

Part A: Power of Diffusion Models

This part of the project is to play around with diffusion models and implement different sampling loops to achieve various effects, such as inpainting and optical illusions. In this project, we use the DeepFloyd IF diffusion model. In this part, we use the global random seed of 180.

Sampling from the Model

DeepFloyd was trained as a text-to-image model that takes texts prompts as input and outputs corresponding images. Lets generate some images with given prompts embedding. Lets see some results after stage 2 of DeepFloyd which outputs 256 by 256 three channels images with 20 inference steps.

Now let’s try again with 50 inference steps. Wow, the details are so much more in each of the images! The 20 steps look pretty good, but is more cartoonish. With 50 steps, although some still look cartoonish, the model obviously attempt to add in more information.

Sampling Loop

We first implement the forward loop, given a clean image $x_0$ we get a noisy image $x_t$ at time step $t$ .

The forward path is defined by $q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})$ which is equivalent to computing ${x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$ .

Lets see a series of images with increasing noise level at step $t$ .

Classical Denoising

Let’s first see how classical approach like Gaussian blur filter denoising do on these noisy images. Well the results technically aren’t bad given the amount of noisy, but they are certainly not good.

UNet Denoising: One Step & Iteratively

Let’s see if we can do better with stage_1.unet. Given a noisy image, we pass in to stage_1.unet to get the noise estimate and the subtracted the noise to get a estimated clean images following the equations above.

To further improve the denoising through unet, instead of estimate a clean image in one step, we can actually do this iteratively. Lets say for every 30 step, we estimated a cleaner version of the noisy image, and then we do this iteratively following this formula:L: $x_{t'} = \frac{\bar{\alpha}_{t'}\beta_t}{\bar{\alpha}_t} x_0 + \frac{\alpha_t(1-\bar{\alpha}_t)}{1-\bar{\alpha}_{t'}} x_t + v_\sigma$ where $x_t$ is the image at timestep $t$ , $x_{t'}$ is noisy image as timestep $t’$ where $t’<t$ (less noisy). $\bar{\alpha}_{t'}$ is defined by alphas_cumprod; $α_t = \frac{\bar{\alpha}t}{\bar{\alpha}{t'}}$ and $β_t = 1 - α_t$ ; $x_0$ is current estimate of clean image.

We can see that iterative denoising does offer better details than one step, and both are much better than Gaussian denoising.

Diffusion Model Sampling

Now let’s use iterative denoise to generate some images with prompt embedding of “a high quality photo”, we can do this by starting with random noise as images so the model would be as creative as possible.

Classifier Free Guidance

Although we have prompt embedding of “a high quality photo” the images are not of “great quality.” we can use a technique called Classifier Free Guidance (CFG) to enhance certain embedding so we can improve image quality at expense of image diversity. In CFG, we compute noise estimate conditioned on a text prompt and an unconditional noise estimate. We denote $\epsilon_c$ and $\epsilon_u$ and we let our new estimate be $\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)$ .

Let’s get 5 images of “a high quality photo” with CFG scale of $\gamma=7$ . Indeed, so much better!

Image-to-Image Translation

When we do denoising, the more noise we add, the more “creative” that the model has to be to recover the image by adding “generated” details. Here we are going to see the effect of different starting index. We will see that this would gradually match to the original images.

Let’s see some more examples from web images and hand-drawn.

Inpainting

We are gonna use diffusion model to inpaint a masked part of the image. Specifically, we force $x_t$ to have have the same pixel as $x_{orig}$ where binary mask $m$ is 0. Specifically, $x_t \leftarrow mx_t + (1-m)\text{forward}(x_{\text{orig}}, t)$ .

Text-Conditioned Image-to-Image

For previous results, we’ve been using prompt of “a high quality photo”. Now we want to try a different prompt, for example, “a rocket ship”. The images would gradually look like our original photo, but also look like our prompt.

Visual Anagrams

Here we want to create the optical illusions with diffusion models, we will create an image that looks like "an oil painting of an old man", but when flipped upside down will reveal "an oil painting of people around a campfire". For second pair, we use prompt “an oil painting of a snowy mountain village” and “a photo of a man”. For the third pair, we use prompt “a photo of the Amalfi coast” and “a photo of a dog”. The algorithm we use is $\epsilon_1 = \text{UNet}(x_t, t, p_1)$ , $\epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2))$ , and $\epsilon = \frac{\epsilon_1 + \epsilon_2}{2}$ .

Hybrid Images

Similar to above, we will now build a hybrid images combining low frequencies from one noise estimate with high frequencies of another, and thus creating images that share features from both prompt. We use algorithm $\epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2)$ where $\epsilon_1 = \text{UNet}(x_t, t, p_1)$ and $\epsilon_2 = \text{UNet}(x_t, t, p_2)$ .

First image uses “a lithograph of a skull” and “a lithograph of waterfalls”. Second image uses “a rocket ship” and “a photo of a dog”. Third image uses “an oil painting of an old man” and “an oil painting of people around a campfire.”

Part B: Diffusion Models from Scratch

Train a Denoiser using UNet

We will train a denoiser to denoise noisy image $z$ with $\sigma=0.5$ applied to a clean image $x$ . We will be using the MNIST dataset. We will first train an Unconditional UNet according to the following structure.

Let’s first see what noisy images look like at different noise level.

To train the denoiser, we generate training data pairs $(z, x)$ where each $x$ is a clean MNST digit, and we construct a noisy version $z$ through $z = x + \sigma\epsilon, \ \epsilon \sim \mathcal{N}(0,1)$ . We will train our denoising with the following loss function $\mathcal{L} = \mathbb{E}_{z,x} \|D_\theta(z)-x\|^2$

Let’s see how our denoiser does after 1st and 5th epoch of training.

It does seems like 5 epochs later the denoiser perform much better. Since we were using a standard $\sigma=0.5$ same as training, let’s see how they perform at other noise levels.

Train a Diffusion Model with Adding Time Conditioning

Now, we are ready for diffusion. We w3ill train a UNet model that can iteratively denoise an image. We will first change our loss function to $\mathcal{L} = \mathbb{E}{z,x} \|\epsilon_\theta(z)- \epsilon\|^2$ . Since we want to conditioned on time $t$ , this function eventually become $\mathcal{L} = \mathbb{E}{z,x} \|\epsilon_\theta(x_t,t)- \epsilon\|^2$ . That is, to denoise $x_t$ , we could apply our UNet $\epsilon_\theta$ on $x_t$ and get the noise $\epsilon$ at each timestep $t$ . Specifically, we will inject a value $t$ and our denoiser would be conditioned on it.

We will use the following algorithm to train and get a loss curve.

Now let’s use our trained denoiser diffusion model to sample some images using the following algorithm starting from random noise. Let’s also compare performance after 5 or 20 epochs of training.

Class-Conditioned

Now, after adding condition on a timestep t to generate better result iteratively, we can also condition denoising on class to better guide the generation process. Now instead of generating through $\epsilon_\theta(x,t)$ we would use $\epsilon_\theta(x,t,c)$ . We would use the algorithm below to train the process from random noise to our desired class though guided one-hot vector $c$ .

The sampling process is pretty much the same as time-conditioned generation, except that we need to use classifier-free guidance with $\gamma=5$ . Let’s generate some digits!

The model after 20 epochs of training indeed shows better shape for generated digits, but the 5 epoch checkpoint seems acceptable as well.