project 5: Diffusion Models
Person | Sam Huang |
---|
Part A: Power of Diffusion Models
This part of the project is to play around with diffusion models and implement different sampling loops to achieve various effects, such as inpainting and optical illusions. In this project, we use the DeepFloyd IF diffusion model. In this part, we use the global random seed of 180.
Sampling from the Model
DeepFloyd was trained as a text-to-image model that takes texts prompts as input and outputs corresponding images. Lets generate some images with given prompts embedding. Lets see some results after stage 2 of DeepFloyd which outputs 256 by 256 three channels images with 20 inference steps.
Now let’s try again with 50 inference steps. Wow, the details are so much more in each of the images! The 20 steps look pretty good, but is more cartoonish. With 50 steps, although some still look cartoonish, the model obviously attempt to add in more information.
Sampling Loop
We first implement the forward loop, given a clean image we get a noisy image at time step .
The forward path is defined by which is equivalent to computing .
Lets see a series of images with increasing noise level at step .
Classical Denoising
Let’s first see how classical approach like Gaussian blur filter denoising do on these noisy images. Well the results technically aren’t bad given the amount of noisy, but they are certainly not good.
UNet Denoising: One Step & Iteratively
Let’s see if we can do better with stage_1.unet. Given a noisy image, we pass in to stage_1.unet to get the noise estimate and the subtracted the noise to get a estimated clean images following the equations above.
To further improve the denoising through unet, instead of estimate a clean image in one step, we can actually do this iteratively. Lets say for every 30 step, we estimated a cleaner version of the noisy image, and then we do this iteratively following this formula:L: where is the image at timestep , is noisy image as timestep where (less noisy). is defined by alphas_cumprod; and ; is current estimate of clean image.
We can see that iterative denoising does offer better details than one step, and both are much better than Gaussian denoising.
Diffusion Model Sampling
Now let’s use iterative denoise to generate some images with prompt embedding of “a high quality photo”, we can do this by starting with random noise as images so the model would be as creative as possible.
Classifier Free Guidance
Although we have prompt embedding of “a high quality photo” the images are not of “great quality.” we can use a technique called Classifier Free Guidance (CFG) to enhance certain embedding so we can improve image quality at expense of image diversity. In CFG, we compute noise estimate conditioned on a text prompt and an unconditional noise estimate. We denote and and we let our new estimate be .
Let’s get 5 images of “a high quality photo” with CFG scale of . Indeed, so much better!
Image-to-Image Translation
When we do denoising, the more noise we add, the more “creative” that the model has to be to recover the image by adding “generated” details. Here we are going to see the effect of different starting index. We will see that this would gradually match to the original images.
Let’s see some more examples from web images and hand-drawn.
Inpainting
We are gonna use diffusion model to inpaint a masked part of the image. Specifically, we force to have have the same pixel as where binary mask is 0. Specifically, .
Text-Conditioned Image-to-Image
For previous results, we’ve been using prompt of “a high quality photo”. Now we want to try a different prompt, for example, “a rocket ship”. The images would gradually look like our original photo, but also look like our prompt.
Visual Anagrams
Here we want to create the optical illusions with diffusion models, we will create an image that looks like "an oil painting of an old man", but when flipped upside down will reveal "an oil painting of people around a campfire". For second pair, we use prompt “an oil painting of a snowy mountain village” and “a photo of a man”. For the third pair, we use prompt “a photo of the Amalfi coast” and “a photo of a dog”. The algorithm we use is , , and .
Hybrid Images
Similar to above, we will now build a hybrid images combining low frequencies from one noise estimate with high frequencies of another, and thus creating images that share features from both prompt. We use algorithm where and .
First image uses “a lithograph of a skull” and “a lithograph of waterfalls”. Second image uses “a rocket ship” and “a photo of a dog”. Third image uses “an oil painting of an old man” and “an oil painting of people around a campfire.”
Part B: Diffusion Models from Scratch
Train a Denoiser using UNet
We will train a denoiser to denoise noisy image with applied to a clean image . We will be using the MNIST dataset. We will first train an Unconditional UNet according to the following structure.
Let’s first see what noisy images look like at different noise level.
To train the denoiser, we generate training data pairs where each is a clean MNST digit, and we construct a noisy version through . We will train our denoising with the following loss function
Let’s see how our denoiser does after 1st and 5th epoch of training.
It does seems like 5 epochs later the denoiser perform much better. Since we were using a standard same as training, let’s see how they perform at other noise levels.
Train a Diffusion Model with Adding Time Conditioning
Now, we are ready for diffusion. We w3ill train a UNet model that can iteratively denoise an image. We will first change our loss function to . Since we want to conditioned on time , this function eventually become . That is, to denoise , we could apply our UNet on and get the noise at each timestep . Specifically, we will inject a value and our denoiser would be conditioned on it.
We will use the following algorithm to train and get a loss curve.
Now let’s use our trained denoiser diffusion model to sample some images using the following algorithm starting from random noise. Let’s also compare performance after 5 or 20 epochs of training.
Class-Conditioned
Now, after adding condition on a timestep t to generate better result iteratively, we can also condition denoising on class to better guide the generation process. Now instead of generating through we would use . We would use the algorithm below to train the process from random noise to our desired class though guided one-hot vector .
The sampling process is pretty much the same as time-conditioned generation, except that we need to use classifier-free guidance with . Let’s generate some digits!
The model after 20 epochs of training indeed shows better shape for generated digits, but the 5 epoch checkpoint seems acceptable as well.