you're watching my computer draw something that has never existed until just now of course red pandas exist and images of them exist not this specific one there's nowhere on the internet that you can find this exact combination of pixels no this is a brand new red panda conjured out of thin air by a diffusion model these are computational models that generate new images through an iterative process starting from a sample of gausian noise they do this by learning to reverse what's known as a diffusion process this is a process where you start from a real image and gradually add more and more samples of gausian noise if you repeat this for long enough you'll eventually destroy all of the information in the the original image and arrive at a pure sample of gausian noise it turns out that this is a sequential process that a neural network can learn to reverse and neural networks that successfully do so are called diffusion models by showing it many of these sequences examples of how to turn gausa noise to realistic looking images the diffusion model learns to do something very hard generate brand new images starting from something very easy drawing a sample of gausa noise at least this is how 99% of people will be introduced to diffusion models but when it's presented this way I find it very surprising that this works for one thing how do diffusion models decide that one noise sample should transform into a fish but a different noise sample into a when diffusion models are presented this way the decision seems totally arbitrary it's always left these models under a shroud of mystery one of the things that I've accepted I would never understand Beyond a surface level but there's another way to view diffusion models that most people never learn one that I argue is the more effective one at uncovering the secrets behind why these models are so successful at one of the hardest tasks that you can ever ask a computer to do to understand this new view of diffusion models the first concept we need to be comfortable with is what we'll call Image space the space of all possible images let's say of size 1,000 by 1,000 pixels this is a 1 million dimensional space with each axis representing the value of one of the pixels in an image of course what you're seeing now is only a two-dimensional grid which can only fully represent two pixels but we'll Pretend This represents the entire 1 million dimensional space we also know that pixels can take a value of between 0o and 255 representing the intensity of that pixel so all possible images live in a square-shaped confined space A 1 million dimensional Cube or hyper Cube to be precise where each side has length 256 units so the part of 1 million dimensional space that we care about is bounded between these values 0 and 255 instead of the unbounded axes that we usually see in math now each location in image space is a different possible image for example this cat playing a piano might be located at this spot and this image which is pure nonsense I sample the values of each pixel randomly from a gausian distribution might be located here now it might be clear to you that some images in image space look like good images and others don't but to a computer all of these images look the same each of them are just 1 million dimensional vectors or matrices with 1,000 rows and 1,000 columns a good image generator then has to understand what makes an image a good one it has to differentiate between the bad images and the good images and somehow generate images that are in some sense closer to the good images the first step to solving this problem is to collect a large data set of good images these are the kinds of images that you would want your image generator to generate and see if we can spot any patterns about where good images live in image space we can do this by plotting the values of each of the 1 million pixels on this 1 million dimensional grid one pixel per axis and just observe where the different images in your data set land when you do this you'll observe two things first you'll find that the vast majority of image space is completely empty in other words no good images live there this is because there are highly specific specific rules that a good image must follow in order to look like a good image to our eyes for example that nearby pixels should have highly similar values and most images almost all of them in fact if you can even call them that simply don't follow those rules in fact these are the rules that an image generator would have to learn to exploit in order to be good at generating images second an image of a banana looks very different from an image of a cat so what you'll find is that banana images tend to Cluster closely together in image space and the same goes for cat images but the banana cluster will be located far away from the cat cluster at least compared to the distances between one banana image and another banana image of course this is a highly oversimplified characterization of image space but it's the best we can do for now at least for beings who don't live in one million dimensional space and it's good enough for our purposes here now that we have some idea of where good images live what can we say about how diffusion models work in this map of image space recall that we start the image generation process by drawing a random sample from osian distribution since it's a random sample it'll be at some random location in image space as likely to be at any given location as any other and since we established that most of image space is empty there's a very high probability that it's going to land outside of the small Pockets where good images live and I know it doesn't look too hard to randomly land in one of these clusters depicted in this diagram but it's worth emphasizing here just how much this 2D diagram underestimates how empty the actual 1 million dimensional space is the first thing that a diffusion model does is take as input this randomly generated image and return returns some prediction which we're going to subtract from our initial image this is the first iteration in the sequential process that will transform this noise sample into a good image it turns out that this output from the diffusion model this thing that we're going to subtract from our random sample is a very special Direction in this map of image space namely it is the direction that brings you to the closest cluster from wherever you're located right now so it's this direction right here as a minor note by convention we train diffusion models so that it's the negative of the model output that brings you to the closest cluster that's why when you subtract the model output from your initial sample you get a better image you're moving from your random sample straight to one of these clusters where all the good images live just to make things super clear this direction that the diffusion model gives you is a vector Direction in other words the diffusion model gives you instructions on which direction to move for every one of your 1 million pixels make pixel one a bit brighter make pixel 2 a bit darker Etc all the way to your 1 millionth pixel we'll get some intuition for how the model knows what this direction should be when we dip our toes into the training process a bit later but for now we're really close to understanding how diffusion models generate images starting from your initial location which is a random location and image space you query or ask the model for a direction and take a small step in that direction then you ask the model again for a new Direction this time from this new location at which you have just arrived and you just keep doing that and at some point you'll end up at some location inside some cluster of good images remember this animation of how this red panda was generated this basically depicts the path that our sample of gausa noise took to get from some random location in image space into the red panda cluster using a diffusion model as guidance we're also really close to answering one of the questions that we posed at the start of the video how does a diffusion model decide what to generate why does it transform a particular noise sample to a cat and a different noise sample to a house Etc I think this question hits at the core of how diffusion models work because by definition noise contains no information related to visual concepts at all so when a diffusion model transforms a particular noise sample into an image of a cat it's not that some notion of catness is somehow imperceptibly encoded in that noise sample and the diffusion model is somehow picking up on it we know this because we were the ones who set up the process of generating that noise sample and we didn't embed any catness in the noise in other words it's impossible for the model to pick up on a signal that isn't there so how does it make this decision well based on what we just discussed about image space the diffusion model generated a cat because it just so happened that the noise you sampled was closer to the cat cluster than any other cluster in this 1 million dimensional image space not for any profound reason but just due to Pure chance and so the model pointed you in the direction of the cat cluster because it was the closest to your initial sample now you might notice that the path that the initial noise sample takes is sometimes curved and you might wonder why it isn't just completely straight putting aside the fact that a diffusion model is a neural network that is learned from data and so it's prone to some degree of error the answer is that the diffusion model actually brings you in the direction that most quickly increases the probability of your image at the local region where your current sample lies in image space this is what the thumbnail was trying to depict and this might be a different direction than the one that takes you straight towards a cluster of good images this is also why generating images via diffusion is an iterative process you can't just trust the first Direction the diffusion model gives you because that might not actually point to a cluster of good images you need to take small steps and keep querying the diffusion model As you move through image space giving it a chance to update and improve its recommendation as you get closer to a cluster think of asking a blindfolded person to climb a hill that he hasn't visited and so he doesn't know where the peak is located and can take you there directly the best he can do is use his feet to search for the direction of steepest asent locally and taking a small step then repeat this process and hoping that it'll lead him to the top of the hill in this analogy the Blindfolded person is the diffusion model and the hill is just the probability landscape with the peak of the Hill being one of the small clusters where good images live this is what I consider to be the easiest way to understand diffusion models there are models that approximate the gradients of the probability distribution of images and so generating images via diffusion is just performing gradient asent on this virtual probability distribution that's implied by the gradients that the diffusion model learns now you might wonder about what it even means to increase the probability of an image but that's getting a bit ahead of ourselves for the purposes of this video if you're interested in exploring this direction I'm releasing a second more technical video and we can dive into that there but for now sticking with the understanding that I've painted here is a reasonably good approximation of what a diffusion model is doing now the question is how do we train such a model that knows this magical direction from wherever we started in image space and I think you'll be surprised at how simple it is the main Insight is this you start with an image from your training set so let's say it's one of the images in this dog cluster then you backtrack away from it using some random Direction what that means is you sample some noise from a gausian distribution much like how you sampled an initial point in the first step of the image generation process then simply add that noise to your image now geometrically since you're adding this noise sample into your image you can interpret the noise sample as a direction in image space in particular since it brought your perfectly nice dog image into this grainy noisy version it brought you away from the dog cluster and so you can think of this noise example as this yellow arrow that takes you from inside the dock cluster to some point outside now observe that what we have here is a supervised learning pair you can treat the noisy dog image as the input to a network and ask it to predict this yellow Direction which is what brought you from inside the dock cluster to outside but it can also bring you from outside back in if you give it many different variations of this task such as using different lengths and directions of this yellow Vector you're teaching the model how to get to this good image from different locations in image space and when you start from different images each time so instead of starting with this dog image you choose an image of a different dog you're teaching the model how to bring you to the dog cluster from other locations in image space and if you train it on not only dog images but images of all sorts of things like cats humans cars and houses it would eventually learn how to bring you to some cluster of good images starting from any location in image space and so the model is learning this dense Vector field it Associates each location in image space with a certain vector or Direction and that direction is the one that is most helpful towards generating good images it's the direction that brings you to the closest cluster where good images live from wherever your current location is if you're familiar with how image classifiers are trained it's useful to draw some analogies here much like how image classifiers learn to recognize cats by associating different images of cats with a class label diffusion models learn to associate different noisy images with directions in image space there are three differences worth talking about here first in image classification the output is typically very small in comparison to the size of the input a class label is a very compact output you only need very few bits of information to convey it in diffusion models the output is a vector Direction in the same space as your input so the output has to be exactly the same size as the input second in image classification it doesn't matter which cat the network sees in the input you'll always ask the network to predict the same thing which is the cat class label in diffusion models you instead ask the network to predict a different output each time based on what direction would bring you closest to a cleaner less noisy version of the current input image but perhaps the most striking difference is that you don't need any human labeling to train a diffusion model the supervised learning pairs are generated fully automatically by adding gaussa noise to unlabeled images so if you know how to draw samples from a gausian distribution you can train a diffusion model without any human labeling required and this is one of the most exciting implications of diffusion models at least for me personally this says that if you can draw samples from a gaussian distribution you can get an image generator almost for free that sampling from ouan and generating images two tasks that at first glance are completely unrelated might share a more fundamental connection and that the solution to what we thought was one of the hardest computational problems in history has been sitting right under our noses the whole time we've covered a lot of ground in this video we talked about image space which is the space in which all possible images live and tried to gain some understanding of its structure how most of it is empty and how similar images form tightly knit clusters we then developed a view of diffusion models as models that can navigate image space very well they learn a vector field that Maps every location in image space with the direction that brings you to the closest cluster of good images or almost equivalently the gradients of the probability distribution of images viewed this way image generation can be cast as starting from an arbitrary location in image space and simply following the vector field that the diffusion model learns or performing gradient Ascent on the probability distribution of images finally we contextualized the training algorithm for diffusion models in the understanding of image space that we have built and developed a geometric intuition for how the training process gives diffusion models the capability that they need in the process we developed a rough mechanistic understanding behind how a diffusion model decides what to generate it generates an image that is similar to the images belonging to the cluster that is closest to your initial location in image space there's still a lot to be said about these models in the next video we'll talk about what's missing from the characterization of diffusion models that I've presented here for example we'll see that the diffusion process doesn't follow this clean path from the initial location to the closest cluster and that you have to add some small amount of random noise after every step to generate images that look any good and we'll build some intuition for why that is in doing so we'll be able to view image generation as the process of sampling from some probability distribution much like how a coin flip can be seen as a sample from some probability distribution albe it a very simple one we'll see how diffusion can serve as a common framework that unifies these two seemingly very different processes in fact we'll use the exact same algorithm the diffusion algorithm to perform both of these tasks it's one of my favorite aha moments in math and AI that I've had in a while so if you enjoyed this video keep your eyes peeled for the next one thanks for sticking around I'll see you there