you're watching my computer draw
something that has never existed until
just
now of course red pandas exist and
images of them
exist not this specific one there's
nowhere on the internet that you can
find this exact combination of
pixels no this is a brand new red panda
conjured out of thin air by a diffusion
model
these are computational models that
generate new images through an iterative
process starting from a sample of
gausian
noise they do this by learning to
reverse what's known as a diffusion
process this is a process where you
start from a real image and gradually
add more and more samples of gausian
noise if you repeat this for long enough
you'll eventually destroy all of the
information in the the original image
and arrive at a pure sample of gausian
noise it turns out that this is a
sequential process that a neural network
can learn to
reverse and neural networks that
successfully do so are called diffusion
models by showing it many of these
sequences examples of how to turn gausa
noise to realistic looking
images the diffusion model learns to do
something very hard generate brand new
images starting from something very easy
drawing a sample of gausa
noise at least this is how 99% of people
will be introduced to diffusion
models but when it's presented this way
I find it very surprising that this
works for one thing how do diffusion
models decide that one noise sample
should transform into a fish but a
different noise sample into a
when diffusion models are presented this
way the decision seems totally
arbitrary it's always left these models
under a shroud of mystery one of the
things that I've accepted I would never
understand Beyond a surface
level but there's another way to view
diffusion models that most people never
learn one that I argue is the more
effective one at uncovering the secrets
behind why these models are so
successful at one of the hardest tasks
that you can ever ask a computer to
do to understand this new view of
diffusion models the first concept we
need to be comfortable with is what
we'll call Image space the space of all
possible images let's say of size 1,000
by 1,000
pixels this is a 1 million dimensional
space with each axis representing the
value of one of the pixels in an
image of course what you're seeing now
is only a two-dimensional grid which can
only fully represent two pixels but
we'll Pretend This represents the entire
1 million dimensional
space we also know that pixels can take
a value of between 0o and
255 representing the intensity of that
pixel so all possible images live in a
square-shaped confined
space A 1 million dimensional Cube or
hyper Cube to be
precise where each side has length 256
units so the part of 1 million
dimensional space that we care about is
bounded between these values 0 and
255 instead of the unbounded axes that
we usually see in math
now each location in image space is a
different possible
image for example this cat playing a
piano might be located at this spot and
this image which is pure nonsense I
sample the values of each pixel randomly
from a gausian
distribution might be located
here now it might be clear to you that
some images in image space look like
good
images and others
don't but to a computer all of these
images look the same each of them are
just 1 million dimensional vectors or
matrices with 1,000 rows and 1,000
columns a good image generator then has
to understand what makes an image a good
one it has to differentiate between the
bad images and the good
images and somehow generate images that
are in some sense closer to the good
images the first step to solving this
problem is to collect a large data set
of good
images these are the kinds of images
that you would want your image generator
to generate and see if we can spot any
patterns about where good images live in
image
space we can do this by plotting the
values of each of the 1 million pixels
on this 1 million dimensional grid one
pixel per axis and just observe where
the different images in your data set
land when you do this you'll observe two
things first you'll find that the vast
majority of image space is completely
empty in other words no good images live
there this is because there are highly
specific specific rules that a good
image must follow in order to look like
a good image to our eyes for example
that nearby pixels should have highly
similar
values and most images almost all of
them in fact if you can even call them
that simply don't follow those
rules in fact these are the rules that
an image generator would have to learn
to exploit in order to be good at
generating images
second an image of a banana looks very
different from an image of a cat so what
you'll find is that banana images tend
to Cluster closely together in image
space and the same goes for cat
images but the banana cluster will be
located far away from the cat
cluster at least compared to the
distances between one banana image and
another banana
image of course this is a highly
oversimplified characterization of image
space but it's the best we can do for
now at least for beings who don't live
in one million dimensional space and
it's good enough for our purposes
here now that we have some idea of where
good images live what can we say about
how diffusion models work in this map of
image space
recall that we start the image
generation process by drawing a random
sample from osian
distribution since it's a random sample
it'll be at some random location in
image space as likely to be at any given
location as any
other and since we established that most
of image space is empty there's a very
high probability that it's going to land
outside of the small Pockets where good
images
live and I know it doesn't look too hard
to randomly land in one of these
clusters depicted in this
diagram but it's worth emphasizing here
just how much this 2D diagram
underestimates how empty the actual 1
million dimensional space
is the first thing that a diffusion
model does is take as input this
randomly generated image and return
returns some prediction which we're
going to subtract from our initial
image this is the first iteration in the
sequential process that will transform
this noise sample into a good
image it turns out that this output from
the diffusion model this thing that
we're going to subtract from our random
sample is a very special Direction in
this map of image
space namely it is the direction that
brings you to the closest cluster from
wherever you're located right now so
it's this direction right
here as a minor note by convention we
train diffusion models so that it's the
negative of the model output that brings
you to the closest
cluster that's why when you subtract the
model output from your initial sample
you get a better image you're moving
from your random sample straight to one
of these clusters where all the good
images
live just to make things super clear
this direction that the diffusion model
gives you is a vector Direction in other
words the diffusion model gives you
instructions on which direction to move
for every one of your 1 million
pixels make pixel one a bit brighter
make pixel 2 a bit darker Etc all the
way to your 1 millionth
pixel we'll get some intuition for how
the model knows what this direction
should be when we dip our toes into the
training process a bit
later but for now we're really close to
understanding how diffusion models
generate
images starting from your initial
location which is a random location and
image space you query or ask the model
for a direction and take a small step in
that
direction then you ask the model again
for a new Direction
this time from this new location at
which you have just
arrived and you just keep doing that and
at some point you'll end up at some
location inside some cluster of good
images remember this animation of how
this red panda was
generated this basically depicts the
path that our sample of gausa noise took
to get from some random location in
image space into the red panda cluster
using a diffusion model as
guidance we're also really close to
answering one of the questions that we
posed at the start of the video how does
a diffusion model decide what to
generate why does it transform a
particular noise sample to a cat and a
different noise sample to a house Etc I
think this question hits at the core of
how diffusion models work because by
definition noise contains no information
related to visual concepts at
all so when a diffusion model transforms
a particular noise sample into an image
of a cat it's not that some notion of
catness is somehow imperceptibly encoded
in that noise sample and the diffusion
model is somehow picking up on it
we know this because we were the ones
who set up the process of generating
that noise sample and we didn't embed
any catness in the
noise in other words it's impossible for
the model to pick up on a signal that
isn't
there so how does it make this
decision well based on what we just
discussed about image
space the diffusion model generated a
cat because it just so happened that the
noise you sampled was closer to the cat
cluster than any other cluster in this 1
million dimensional image
space not for any profound reason but
just due to Pure
chance and so the model pointed you in
the direction of the cat cluster because
it was the closest to your initial
sample now you might notice that the
path that the initial noise sample takes
is sometimes curved and you might wonder
why it isn't just completely
straight putting aside the fact that a
diffusion model is a neural network that
is learned from data and so it's prone
to some degree of error the answer is
that the diffusion model actually brings
you in the direction that most quickly
increases the probability of your image
at the local region where your current
sample lies in image
space this is what the thumbnail was
trying to depict
and this might be a different direction
than the one that takes you straight
towards a cluster of good
images this is also why generating
images via diffusion is an iterative
process you can't just trust the first
Direction the diffusion model gives you
because that might not actually point to
a cluster of good
images you need to take small steps and
keep querying the diffusion model As you
move through image space
giving it a chance to update and improve
its recommendation as you get closer to
a
cluster think of asking a blindfolded
person to climb a hill that he hasn't
visited and so he doesn't know where the
peak is located and can take you there
directly the best he can do is use his
feet to search for the direction of
steepest asent locally and taking a
small
step then repeat this process and hoping
that it'll lead him to the top of the
hill in this analogy the Blindfolded
person is the diffusion model and the
hill is just the probability landscape
with the peak of the Hill being one of
the small clusters where good images
live this is what I consider to be the
easiest way to understand diffusion
models there are models that approximate
the gradients of the probability
distribution of images
and so generating images via diffusion
is just performing gradient asent on
this virtual probability
distribution that's implied by the
gradients that the diffusion model
learns now you might wonder about what
it even means to increase the
probability of an image but that's
getting a bit ahead of ourselves for the
purposes of this video if you're
interested in exploring this direction
I'm releasing a second more technical
video and we can dive into that
there but for now sticking with the
understanding that I've painted here is
a reasonably good approximation of what
a diffusion model is
doing now the question is how do we
train such a model that knows this
magical direction from wherever we
started in image space
and I think you'll be surprised at how
simple it is the main Insight is this
you start with an image from your
training set so let's say it's one of
the images in this dog
cluster then you backtrack away from it
using some random
Direction what that means is you sample
some noise from a gausian distribution
much like how you sampled an initial
point in the first step of the image
generation process
then simply add that noise to your
image now geometrically since you're
adding this noise sample into your image
you can interpret the noise sample as a
direction in image
space in particular since it brought
your perfectly nice dog image into this
grainy noisy version it brought you away
from the dog
cluster and so you can think of this
noise example as this yellow arrow that
takes you from inside the dock cluster
to some point
outside now observe that what we have
here is a supervised learning pair you
can treat the noisy dog image as the
input to a network and ask it to predict
this yellow Direction which is what
brought you from inside the dock cluster
to outside but it can also bring you
from outside back in
if you give it many different variations
of this task such as using different
lengths and directions of this yellow
Vector you're teaching the model how to
get to this good image from different
locations in image
space and when you start from different
images each time so instead of starting
with this dog image you choose an image
of a different dog you're teaching the
model how to bring you to the dog
cluster from other locations in image
space
and if you train it on not only dog
images but images of all sorts of things
like cats humans cars and
houses it would eventually learn how to
bring you to some cluster of good images
starting from any location in image
space and so the model is learning this
dense Vector field it Associates each
location in image space with a certain
vector or Direction
and that direction is the one that is
most helpful towards generating good
images it's the direction that brings
you to the closest cluster where good
images live from wherever your current
location
is if you're familiar with how image
classifiers are trained it's useful to
draw some analogies here much like how
image classifiers learn to recognize
cats by associating different images of
cats with a class label diffusion models
learn to associate different noisy
images with directions in image
space there are three differences worth
talking about
here first in image classification the
output is typically very small in
comparison to the size of the input a
class label is a very compact output you
only need very few bits of information
to convey it in diffusion models the
output is a vector Direction in the same
space as your input so the output has to
be exactly the same size as the
input second in image classification it
doesn't matter which cat the network
sees in the input you'll always ask the
network to predict the same thing which
is the cat class
label in diffusion models you instead
ask the network to predict a different
output each time based on what direction
would bring you closest to a cleaner
less noisy version of the current input
image but perhaps the most striking
difference is that you don't need any
human labeling to train a diffusion
model the supervised learning pairs are
generated fully automatically by adding
gaussa noise to unlabeled
images so if you know how to draw
samples from a gausian distribution you
can train a diffusion model without any
human labeling required
and this is one of the most exciting
implications of diffusion models at
least for me
personally this says that if you can
draw samples from a gaussian
distribution you can get an image
generator almost for
free that sampling from ouan and
generating
images two tasks that at first glance
are completely
unrelated might share a more fundamental
connection and that the solution to what
we thought was one of the hardest
computational problems in history has
been sitting right under our noses the
whole
time we've covered a lot of ground in
this
video we talked about image space which
is the space in which all possible
images live and tried to gain some
understanding of its
structure how most of it is empty and
how similar images form tightly knit
clusters we then developed a view of
diffusion models as models that can
navigate image space very well they
learn a vector field that Maps every
location in image space with the
direction that brings you to the closest
cluster of good
images or almost equivalently the
gradients of the probability
distribution of images
viewed this way image generation can be
cast as starting from an arbitrary
location in image space and simply
following the vector field that the
diffusion model
learns or performing gradient Ascent on
the probability distribution of
images finally we contextualized the
training algorithm for diffusion models
in the understanding of image space that
we have built
and developed a geometric intuition for
how the training process gives diffusion
models the capability that they
need in the process we developed a rough
mechanistic understanding behind how a
diffusion model decides what to generate
it generates an image that is similar to
the images belonging to the cluster that
is closest to your initial location in
image space
there's still a lot to be said about
these
models in the next video we'll talk
about what's missing from the
characterization of diffusion models
that I've presented
here for example we'll see that the
diffusion process doesn't follow this
clean path from the initial location to
the closest cluster and that you have to
add some small amount of random noise
after every step to generate images that
look any good
and we'll build some intuition for why
that
is in doing so we'll be able to view
image generation as the process of
sampling from some probability
distribution much like how a coin flip
can be seen as a sample from some
probability distribution albe it a very
simple
one we'll see how diffusion can serve as
a common framework that unifies these
two seemingly very different processes
in fact we'll use the exact same
algorithm the diffusion algorithm to
perform both of these
tasks it's one of my favorite aha
moments in math and AI that I've had in
a while so if you enjoyed this video
keep your eyes peeled for the next
one thanks for sticking around I'll see
you there