What if we aim to train a model that
solves all these tasks and then we
[music] just scale it up? So this is
where we are. We have image based
solutions. We have video based
solutions. [music] And to solve 40
tasks, image based solutions are nice.
They're easy to scale. They're great at
semantics, but they have this nature
where they process the frames
independently. And the video frames sol
video models solve that problem, but
they're more expensive to train [music]
and scale. This part I'd like to do it
together because I think um it's fun.
&gt;&gt; Great. Thank you for the introduction.
&gt;&gt; Cool. Okay. So, when I first received
the invitation to talk here, I thought
to myself, okay, I'm going to travel
from UK to Berlin and it's going to be
during winter time. I know because it's
the winter speaker series, right? So I
thought I should come near the Christmas
time and if you have any recommendations
for any why not smart
&gt;&gt; fight me after the talk and right now
I'm excited to be here speaking about
video understanding.
I'm Dara and I'm a research engineer at
Google deep mind. In the past three
years, I focused on video understanding
with some experience in those areas. And
right now, I will deep dive into one of
the works I did at Google Deep Mind with
my collaborators.
But first, let's discuss what video
understanding entails. Vision is one of
our five senses and it's critically
important to have it while we're living
our lives. And let's take this take this
kid as an example. Imagine for a moment
that he doesn't know how to ride a bike.
He sees this girl and thinks I want to
ride a bike as well so that I can move
around fast. And in that case he would
in order in in order to learn how to
ride a bike he would observe this girl
understand how different parts of the
bike are moving together to make this
possible.
And this is an example for me because I
was preparing this slide deck last week
working from home and I found myself in
this perfect situation where I needed to
rely on my vision.
So in this example, I brew myself a cup
of coffee. I went to my desk. I put it
aside, turned my head, took a book about
Argentina, which is an amazing place, by
the way. I did this year. And then all
of my cup was there. And then I brought
it close enough so that I can drink it.
So coffee, an amazing book. This is how
my perfect day starts.
And this is how my worst day starts. I
don't know about you, but I hate driving
in traffic. And this is a video from
Assemblle. Oh, I I lived in Assembly
before. Um it's a big chaos. And as you
can see in this video, people sometimes
walk on the highway. I guess cars
suddenly change the lanes and it's the
same. So surviving in a summer traffic
requires me to first calm down and then
to understand where everything is moving
like tracking the cars around me and
also anticipating how near they are to
me. So overall as humans while we're
pretty good at vision there are some
cases where we or like at least I could
use some extra help.
And that brings us to the topic of the
today. As humans, we understand how
things around us move and where they
are. So that means we can do 3D plus 3D
representing space and 1D representing
time 4D understanding, right? And how do
we build systems that do the same? So if
we look into the literature or like
applications that we see in the world
right now, it's usually people training
some expert models that are pretty good
at one task and for another task another
model that solves that. That makes sense
because it's easier to train a single
model for a single task. But today I
want to discuss a different path. What
if we aim to train a model that solves
all these tasks and then we just scale
it up like does that work and how can we
make this work
and today I will share the outcomes of a
research work that tackles this
question. This is a work I did along
with my amazing collaborators listed
here at Google deep mind and I will
share the works uh the details of the
work shortly. uh but in summary we show
that we can improve the model
performance as we scale the model size
as well.
So let's zoom out for a second and look
at how we can approach this problem.
So let's take this video as an example.
This is a video of a dancer called Donel
and he is dancing whacking. And if
you've never heard of Backing, it's a
that style that was born in the 70s in
the underground club scene of the Los
Angeles. [clears throat]
And as you can see, he also has some
signature moves like he's spinning
multiple times, etc. So when we look at
this video, what are some of the things
that we want to understand better? So
first we might ask, what is he doing?
He's not walking. Any other things?
I can get some guesses around.
&gt;&gt; Sorry.
&gt;&gt; The setting. Where is he? What is he
wearing? How many spins is he
[clears throat] doing?
Okay. So, let's take take this first
question as an example. What is he
doing? To answer this question, we
somehow want to use this video and get
to this output, right? So let's discuss
what kind of things we can do here
on a very high level we need some kind
of a vision encoder that takes this
input and extracts the information right
and for as a vision encoder like there
are some building blocks so if we look
into what we can use CNN's are you know
pretty standard approaches for
extracting features they're pretty good
at it and if we look at more modern
approaches then transformer encoders
would be good solutions for that. Okay,
so now we have an idea about the
building block of this encoder. But
let's see how we actually get from here
to there.
And for that, let's first remember that
video is actually a sequence of frame
that are stacked along the time, right?
So maybe we can simplify this problem by
approaching a video understanding
problem as an image understanding
problem. And in that case we would have
an image encoder and we would pass the
frames of the video one by one through
this image encoder and in that case it
would be a single image encoder actually
shared weights. So that's a solution and
this has been explored at the literature
as well. So for example there are some
discriminative approaches like dino and
they learn by distinguishing this image
from that image. So dino like models are
really good at answering semantic
questions like what is here what is in
this se and then there are generative
approaches like masked autoenccoding or
abbreviated as me and these are
reconstructive approaches. They learn by
masking out parts of the input and
forcing the model to learn the masked
out parts. And I will talk more about
them in the upcoming parts.
So, but let's remember that we are on a
mission to solve all 40 tasks, right?
So, we started by asking what is he
doing? But we want to answer multiple
questions. So, for that there are a
couple of ways to approach this problem.
So first of all we have the features
from the image encoder. What we can do
is we can train a readout module that
solves a particular task. Readout
modules are usually simple lightweight
modules that could be an MLP or an
attention layer and it gives you the
relevant output.
Advantage it's lightweight. Disadvantage
there are usually if we look at the
literature there are usually different
readouts for each task. So it's not very
generalizable.
The second approach and um the one
dominating the headlines right now is of
course using a vision language model,
right? And in that case the advantage is
that it's generalizable because the task
the question we have is if let's say if
it's what does the dancer wear or if
anything else then the text encoder
would be giving us the features and we
have the features from the image and
these are somehow fused. There are many
ways to do it but I won't get into the
details and then you would get the
relevant answer for that. So that's a
pretty good approach. However, there is
a catch with that. So training such
models requires massive amount of data
that usually comes from the web because
web is a great source in that sense and
in that case what happens is that they
are trained on image text pairs right
and the text usually describes what's in
the image but what's in the image is
usually a very high level information.
For example, in this image, the caption
could be guy throwing his hand, but like
we are unlikely to hear guy holding like
lifting his arm
like for 45°, right? We are kind of
missing those little cues.
That's one. Um there is one more
disadvantage actually. I I don't know if
it's obvious in the slides, but I'd like
to take a guess here.
frames
in the sequence the movement.
&gt;&gt; Yeah, exactly. So, um I think the point
is that we are taking the static frames
one by one and we kind of lose the time,
right? That's exactly the indeed the
disadvantage of that because as we chop
the frames and as we feed them into the
encoder, we now lose the time. But to
understand the motion, we need to have a
sense of how things change over the
time.
So [clears throat] this takes us to the
second approach and that is to use a
video encoder because video encoders
just takes the videos as they are and
they respect the spatial temporal nature
of the videos. So that would solve our
issue and this has been explored in the
literature. So uh we know there are 3D
convolutions or other works like vivid
video. So
could solve our problem. But the issue
the disadvantage is that it's hard to
train and scale this models. Another
disadvantage is that as um there are
more image data available, video data is
not as abundant which makes it also a
part of why it is hard to train.
So this is where we are. We have image
based solutions. We have video based
solutions. And to solve 40 tasks, image
based solutions are nice. They're easy
to scale. They're great at semantics,
but they have this nature where they
process the frames independently. And
then video frames sol video models solve
that problem, but they're more expensive
to train and scale.
So this brings us to the specific gap we
wanted to address in our work. The first
thing is the evaluation gap because if
we look into the existing literature on
video models, we see that they are
usually evaluated on how well the they
describe what's happening in the scene
but not how things are happening in the
scene. So the first thing we did is we
decided to look into the tasks that
actually require understanding how the
things work. The second thing is as I
mentioned video models have been harder
to train and scale. So in this work we
scared up we scaled our video backbone
to 22 billion parameters which is as far
as I know still the largest video back
node today and we show that it
consistently improves the performance.
Okay, now we come to the methodology.
Okay, so we defined our goal, right? We
want to scale up video models and solve
the 4D tasks. So in the methodology
section, I would like to highlight that
the key word here is scalability. So we
all always had this in mind while we
were making any decisions. So as we talk
about scalability, it is important to
also not just talk about the number of
model parameters but also to talk about
the scale of the data. Because if we
scale up the model but we don't scale
our data then we are likely not to end
up with a good solution still. So we
need to go with methods that allows us
to leverage as many data as possible and
that means we are going for approaches
like that are not supervised learning or
anything and in that case we go with
self-supervised learning and it has been
shown that self-supervised learning is
scalable and within the self-supervised
learning mascot encoding has been shown
to be a scalable method in many vision
works earlier the MAE paper itself or
the video ME paper has shown that um it
learns useful features and on a very
high level what happens here
on a masked autoenccoder.
Okay, here's my cursor.
So this is our input image. So in this
case this is from the original ME paper.
Therefore they use images. We're going
to use videos. This is the input image.
There are the patches. Some patches are
masked usually like 75% of the image
which is you know more than the images
more than half of the images masked and
then the encoder
gets the features for the unmasked ones
and then the decoder is forced to
reconstruct all of them. So on a high
level this is how MAE works
and this is how we make MAE work on a
scale and we call our method simple MAE
or 4DS.
So let's start from top left. This is
our video right. So that would be width,
height and the time.
And we also like the MA work itself. We
mask the input. So in the original MA
work they do 75% masking in the previous
work video MA they do 90% masking and we
do 95% masking so you know reducing as
much as we can because again we always
keep scalability in mind and then we
have the features that we pass through a
self attention layer and then the course
tokens are concatenated to help us get
the reconstructed video. So what's
happening between between here. So here
we have a linear decoder. So this is the
decoder part. Again a decision that came
because of the scalability. If you look
at the literature, we usually see that
there is a transformer decoder. So
transformer decoder is higher quality.
However, they also come with more number
of parameters. And again since the
keyword is scalability, we thought okay
let's go with a linear decoder and see
how it works. And also as a side note um
as much as like we train the model on
the reconstruction at the end of the day
for downstream evaluation what we care
about is just the video features
themselves. So during inference or like
while we're evaluating the model we
don't use this linear decoder at all. So
that's why maybe not having a, you know,
heavy decoder is also fine because we're
interested in those features and maybe
this decoder already learned useful
features. We'll see.
So this is a pre-training setup. We use
a standard mean squared error loss for
RGB pixel values. Again, no labels or
anything. We trained on 170 million
short videos and we sublip them. So we
ended up with like around two billion
four billion clips. Um yeah input
dimension resolution are listed here and
there are a couple of things um that are
important to highlight here. So I'm
coming from the industry I'm speaking at
a university right now about scalable
models. So it's important to talk about
I believe like the resources and
everything right. So we trained this
model on 256 chips, 256 TPU chips and um
we also applied some other tricks um
like we converted the model weights to
be full 60
32 um except the loss in softmax
computation and besides that data
parallelism you know is an obvious
approach to
work models at scale. However in our
case that wasn't sufficient. So we
applied model sharding and optimizer
stage charting as well
and that's how we got to the billions of
parameters at the end.
Okay. So now we know the methodology and
now let's talk about what kind of
evaluation tasks we looked into right
because I as I mentioned the ones in the
literature are quite focused on the
semantics but we want to get to the 4D
understanding. So this part I'd like to
do it together because I think um it's
fun and you already have seen some of
those examples. So in this one
I was mentioning that I would like to or
I need to track everything around me and
anticipate how far they are for me. Um
any guesses on which computer vision
tasks this corresponds to?
Any guess?
Tracking everything around, anticipating
how far they are.
&gt;&gt; Okay.
&gt;&gt; Okay. I'm getting some guesses. Can I
hear again?
&gt;&gt; Object tracking. Yes.
&gt;&gt; Optical [clears throat] flow, depth
estimation. They all make sense. In this
case, I had depth estimation and object
tracking in mind. And for assessing
those, we use the following data sets.
So, scanet is an indoor RGBD data and it
contains videos of yeah indoor scenes
and this is what we use for evaluating
depth estimation. And for object
tracking, we use the void data set. This
is a data set that was collected from
waybo cars in urban and suburban areas
and it has lots of different adaptations
and in this case we use the 2D bounding
box annotations.
Okay, so this one I left my cup, turn my
head, remembered where it was and
brought it in to drink coffee.
I mean something.
&gt;&gt; Sorry.
&gt;&gt; That makes sense, but not what I had in
mind. There are multiple right answers
by the way. Like please
&gt;&gt; object
&gt;&gt; makes sense.
&gt;&gt; Exactly. That that that was what I had
in mind. Exactly. Chemical estimation
because in this case it's an egocentric
understanding where it's the it's from
the first person view. It's the
estimation of where the camera is. And
for that, we look into the real estate
10K. Um, RE10K is a data set collected
from YouTube and it contains videos of
some, you know, real estate properties
and it also has the camera pose
annotations.
Okay, this one's a bit tricky. So, this
is, you know, understanding how
different parts of the bike move. But
this time, it's not object tracking,
right? It's a bit more
than that.
Segmentation makes sense.
&gt;&gt; Yes, that makes sense.
There are lots of correct answers. I'm
trying to get to mic though.
&gt;&gt; Makes makes a lot of sense.
&gt;&gt; Okay. Yeah, I thought this would be a
bit tricky, but I had point tracking in
mind. So point tracking is um a problem
of tracking a point in a video in every
other frame of the video. So it's a bit
more fine grain problem than the object
tracking itself is. And for that we use
the perception test benchmark. It's a
multimodel benchmark with lots of real
world videos and lots of different
annotations and cases. And in this case
we look into the point tracking.
Okay. I think this one is easy.
What is he doing?
Yes, I heard action recognition
somewhere. Thank you.
So that would be action recognition. And
for that we looked into two benchmarks.
Um one is something something B2 and the
other is kinetics. So kinetics um is a
collection of YouTube videos where they
have been annotated then with what's
happening in the video. And something
something V2 is a benchmark where people
put
say something into something or like
there are lots of object interactions.
So we look into both of those both of
the results. However, since something
something V2 requires such you know it
has those temporal dynamics happening we
uh look into the SSV2 results a bit more
closely.
Okay. So these are the tasks and now
let's get into our evaluation protocol.
So this was one of the approaches you
might remember. We have the video
encoder and then we have a readout
module. I mentioned this is usually a
lightweight readout module and then this
gives us the result right. So when we
look into the literature there are
several different ways to do doing this.
So this readout could be very simple. It
could be a dense layer or it could be an
attention layer or it could be a DPT
layer. dense prediction transformer and
especially the DPT one is used mostly
for dense tasks and by a dense task I
mean a task where you would need to
predict a value for each one of the
pixels and usually depth estimation is
uh considered one of those tasks and in
our case you know we have lots of
different evaluations what are we going
to do are we going to like try all of
them like combination of them how is it
going to work and what we decided is to
go with the attention layer. And it's
also probably worth highlighting at this
point like what queries and keys
correspond to. We used a cross attention
layer and queries are learnable except
in the tracking cases because then
queries are the object or the point that
we're tracking. And also another thing
that I'd like to highlight here is that
um on this figure. So we have the video
video encoder and the readout, right? So
as we're training the readout, what are
we doing with the video encoder? So this
is not like there are several different
ways. So we can either freeze the video
encoder and freezing video encoder means
we don't update it, we just train the
readout. That's uh cheaper to do.
Amazing. And there's another way where
we can fine tune end to end end to end
and we have did the both and I will show
the results for that shortly.
So this is this is I would say the main
message of the paper because our main
question was can we solve 40 tasks as
can we yeah solve 40 tasks as we
increase the model size and the answer
is yes. So here we see the results for
five tasks. Camera pose estimation,
point tracking, depth estimation, object
tracking, and action classification. The
the x-axis is the model size in log
scale. And the yaxis is the performance.
And as you can see, the error based ones
like depth estimation or camera pose
estimation has a different trend than
the other ones. But the important thing
here is that we observe the performance
gets better as we scale up. And there
are a few other things to highlight here
as well. The improvement is almost
always monotonic
except I see some exception here.
And another thing is you may have
noticed there are two lines in each of
those plots. There is the frozen and
there's the fine to which is what I just
explained like what are we doing with
the vision? encoder itself and in this
case fine-tuning usually helps
especially with the depth estimation and
action classification but the results
are a bit mixed like at point tracking
camera pose and object tracking actually
frozen is doing better
okay so this was how the 4DS model does
as it scales but it's also important to
compare how it does according to you
know previous work and I'm going to lots
of results here but first let's let's
take a look at how what we're looking
here so here on this table
I have the tasks and here we have the
models the first section we have image
models this is the video baselines
and on the bottom we have the 4DS models
so the results for this we've just seen
and now we're going to compare according
to the previous mod and there's also an
important detail here. Some of the
models have been have included language
in their pre-training in our
architecture. You may have noticed that
I never talked about language. This is a
pure pixel supervision but some previous
models did have it and it's worth
highlighting what that at the end means.
Okay. So first results image backbones.
So here we see the results for image
backbones as well as some video prism
results as well because all of those
results are relative right so I need to
show both sides to make a comparison in
between. So what we see here is that we
see for non-sematic tasks like real
estate 10k or vivo open the image models
are not doing really well right and um
also one thing I would like to highlight
here is as we were evaluating the models
we have the encoder and then we have the
readouts right so for the video model we
know that the you know temporal nature
is handled within the encoder itself.
However, we know that for images, it's
not handled. And while we were
evaluating, we thought, okay, if we
evaluate image models on these, they're
going to be really bad anyways because
they don't know anything about the
temporal nature. So what we did in the
attentive readout is that we added some
learnable temporal positional embedding
while training that readout module in
order to like sort of put the image
models in the less disadvantaged
position. So this is the results based
on that. Um even with this you know um
addition the image models are still not
doing well on non-semantic tasks.
They're not doing bad on semantic tasks
especially with kinetics. They're doing
a good job. SSV2 not so much.
One part worth highlighting is the scan
net results though. So depth estimation
is quite geometric task, right? because
it's estimating how far a pixel is from
the camera and we see that image models
like especially Dynino is doing pretty
good and this is probably due to the
fact that um so in Dynino paper they
also show that um they're doing pretty
well in depth estimation so it's no
surprise but probably the reason why it
is so well compared to the other
non-semantic tasks is because
for depth estimation we just need a
single frame, right? We don't need the
other frames. We're just estimating the
depth of the scene. However, for point
tracking or object tracking, we need
information from the other frames as
well. And for relative camera pose
estimation, since it's relative, it's
relative to the first frame of the clip.
So this may be why although scanet is a
quite geometric task that uh image
models might be doing still a pretty
good job.
Okay. Second thing these are the models
with the language pre-training because
so we didn't do language pre-training
but it's important for us to see how the
models with language pre-training does.
Are they better? Are they worse? And we
see the results here. So overall when we
switch from image to video things
improve overall and with the language
pre-training as well we see that the
semantic tasks especially are becoming a
lot better
and these are the rest of the baselines
the vja and video baselines that we
evaluate on overall what we observe here
is that um the video models are overall
doing a pretty good job on nonsemantic
and semantic tasks.
Okay, now we come to the 4DS results. We
already have seen these numbers, you
know, with the plots with the lines. So,
I'm not going to go through them again.
But now that we have the whole picture,
it's important to like compare
everything against everything. So, what
do we see here? First, the largest 4DS
model is obtaining the best performance
across all models on non-semantic tasks.
When we come to the semantic tasks like
SSV2 or kinetics, my mouse was in the
wrong place.
I mean here they're still doing a good
job especially for SSV2. For kinetics
it's not as well. So therefore the
conclusion here would be that yeah as
you scale your model your performance
gets better and also compared to the
baseline it's a pretty you know decent
performance except that when it comes to
the semantic tasks it's a bit in a
disadvantaged place.
Okay this part is important as well
because I know I'm talking about 2122
doing parameter model and like it's
doing so well but how are we going to
run inference on that right? So it's
it's hard. Therefore, uh we looked into
distillation. In this one, we distill
a B model from an E model. So a B model
has 91 million parameters. An E1 has
close to 4 billion parameters.
And what we see here is first lesson
going from B to distill B improves the
performance a lot. And it's improves it
so much that it's on par with the model
that has like three times more than
three times number of parameters. So
yeah um distillation in this case
improves the numbers quite a lot
and here are some qualitative results.
So as we go from left to right it is the
increasing number of parameters of the
4DS model and there these are three
scenes from the scanet data set. So what
we see is as we go from left to right
the predictions are getting better and
sharper which again uh shows that
scaling helps with the performance.
This is the results for object tracking
again three cases from open data set.
The dotted lines are for predicted
bounding boxes and the regular
rectangles are for ground truth. And as
we scale we see that they overlap more
and more again showing that scaling
improves the performance.
And this is the result for point
tracking. So in this figure the circles
are ground truth and the dots connected
are the predictions and the line in
between represents the error. So in all
of those predictions, the longer the
line is, the worse the prediction is.
And we see on the right
20 million parameter model has lots of
those lines laying around whereas the 20
million parameter model is doing a
better job.
Okay, we talked about scaling the model
size and how about scaling the data,
right? We had already had the intuition
that such self-supervised methods would
need lots of data as we scale the model.
And our experiments also show the same
thing in all for almost all of the
evaluations as we so for all all of them
actually as we increase the model size
sorry the data size the uh performance
improves in all of them and in almost
all of them it's monotonic. So uh this
sort of verifies our assumption that we
need uh lots of data as we scale our
models
and this is um one thing that I also
want to mention. So we have this huge
echol right 22 billion parameters and
how are we going to get the features?
Are we going to get the features from
the last layer or are we going to read
it from somewhere in the middle? Because
anything could be. And for that we read
this study. We looked into different
layers of the encoder read them and then
like used them for the readout. And what
we observe is that for most of the
non-semantic tasks like RE10K camera
post, point tracking, wave open, deeper
layers are usually better. And it seems
that 95% is usually a good compromise.
So this is what we did use at the end.
And for the action classification,
actually the deeper layers are not as
good. So for this one, we ended up using
the 75%
depth of the encoder to read the
features from.
Okay. So so far so good. We came to the
limitations though.
Um first regarding the methodology. So
we use the mass autoenccoder because we
know that it's been shown to be a
scalable model and there are other video
works that did use video. So it made
sense to start from here. But obviously
self-supervised learning landscape is
huge. So there are lots of other
methodologies that could be used like
maybe contrasted learning based or joint
embedding prediction architectures.
There could be future works to do here.
And the second thing is that I showed
nice results that show you know that
things go up things improve but I
haven't shown you any scaling laws or
anything right which would be actually
interesting to have from this work. So
one limitation is that our results at
the moment are empirical and that we
don't have some sort of scaling law for
models at the moment.
And this brings me towards the end of
the talk. I hope you found this
interesting and want to take a look
because the paper is on archive and the
code checkpoints including the still
model by the way and a collab demo are
available on GitHub.
So yeah, thank you for joining today and
I'm happy to receive any questions and
the best smart suggestions as well.
[applause]
I'm going I'm going to be walking around
with a mic. Yeah, I'm trying to be fair.
Uh just so we can keep get as many as
people as possible. Keep your questions
succinct and maybe for one question per
person. So anybody with a question?
mind the position of the objects because
objects reside in three dimensions and
the images are just the two dimensional
representation of threedimensional
objects and they are basically the
shadows of insect objects.
So yeah I I didn't really understand why
would you use
uh this methology. So is the question
why we use video data to solve 40 tasks?
Is it
&gt;&gt; video is essentially bunch of pixels?
They do not encode the positions of the
objects in three dimension. They are
just shadows. So you are choosing
shadows to capture
the dense reality.
&gt;&gt; Right? So there are different modalities
we can use as well. Right? So um we can
use point clouds as well I guess. Right?
that would give the true 3D uh structure
of a sea. Um however uh point clouds are
not well in this case as I mentioned
scale of the model is important and
correspondingly scale of the data is
important as well and there isn't as
much point cloud data as video data
available. So that's what that's like
one factor that makes it harder to use
actually you know the ground truth 3D
map of the world. Um but However, there
are also works that um tackle, you know,
seen understanding from a point cloud
perspective as well. So, that's another
&gt;&gt; Hi, I'm fascinated by your work. I just
finished a program. It was a three-month
online program with MIT and my final
project I did vision and motion
recognition in still images. emotion
recognition
&gt;&gt; emotion recognition in the faces of
pictures of people's faces had some big
data sets and it was fascinating I love
working with it but of course my mind
was racing to video applications because
it's the obvious next step um so this
was wonderful thank you I really
appreciate it and I hope I I'm going to
read your papers and see what I need to
do next
&gt;&gt; however I'm also curious why you guys
I'm sure It's just too much data. But
had you considered including the audio?
Because I imagine that's a bridge source
of data. Now, it could also be wrong.
It's often totally different than what
you're watching. But what um had you
played with that? Is it something you
might want to work with in the future?
um this isn't something we have played
with that but I think it's it's a really
interesting um approach to equip audio
as well because then we have the you
know multim modality right not just the
vision but like what we're hearing as
well it's a rich resource but in this
case we have included it
&gt;&gt; thanks hi um I was wondering a bit about
more longterm uh video understanding
tasks so the uh tasks you're working
here are from what I understand like 16
frames which is like a few seconds at
best. So what do you think would be
necessary to go to a more longer term uh
video understanding tasks like action
segmentation? Is it that you would use a
base uh representation or should it be
like a change in the way how we think
about videos in the first place like
should they be 3D tensors or like should
you change architectures to work more in
a streaming manner?
&gt;&gt; Yeah, I think that's a good question. I
think that starts from like I think I
feel like any new research question
starts from how we evaluate that and
when it comes to long-term video
understanding there are some benchmarks
that look into longer horizons like
several minutes maybe or like an hour.
Um but in that case it's usually um not
the 40 tasks we're looking into. They're
usually more towards like Q&amp;A kind of
questions. There are some exceptions as
well but I think um that's indeed
something actually worth exploring and
for that I think um for example in the
case of object tracking point tracking
if there were some benchmarks that
assess this then we can start thinking
about okay what kind of approaches we
can do and then you know try and fail
try and fail and then find what actually
works. Okay, thanks.
&gt;&gt; Trying to be a little more fair with the
space usage here. Anybody
two quick questions
basically using video encoder, they are
really restricted to what you can do
with this model. So because you always
need like 16 frames. to use. It seems
like an o choice for a model that you
want to be able to use for different
things. For example, you wanted to
recognize a single frame on movie or
something like that. That would be kind
of difficult with this. So, uh wouldn't
it make more sense to kind of build
something on something like dyno on top
of you?
&gt;&gt; So, is the question we pre-trained this
model for say 16 frames or something,
but like if I wanted to use it for image
understanding, what what can I If I want
to extend it to more frames.
&gt;&gt; One frame.
&gt;&gt; I'm sorry. One frame
&gt;&gt; for one frame.
&gt;&gt; Yeah. So one frame um just the easiest
thing to apply. We can still use that
encoding for that. What we can do is we
can just you know replicate 16 frames
and get the features and we would be
getting the same features. Yeah. For
some things that would work for depth
probably but it wouldn't necessarily
work for recognizing a frame from a
movie or something like that.
But it's still a single frame, right?
Oh, you mean as movie it's a long
sequence and that's why
&gt;&gt; it is a long sequence but let's say you
want to recognize one frame from that
specific issue that we are going into.
&gt;&gt; Yeah. So that depends on how you
represent the video. So it depends on
what is the frame rate of the video
you're using as an input. So, for
example, if you have a high frame rate,
then you are likely going to end up with
frames that kind of look like each
other, but like moving, you know, tiny
bits in each other. But if you have a
video that's longer and if you sample at
a higher frame rate, then you have a
different representation.
Um but yeah this case this one is
limited to 16 and like if you wanted to
you know toss all the videos. I guess
what you can do is like you can run
inference in each of the chunks and then
sort of aggregate the results to see um
if it works for your case in that sense.
The depth estimation that is not to
scale with the single image once because
you have this depth with right.
&gt;&gt; Sorry, could you repeat
&gt;&gt; the depth estimation is not to scale.
It's because you have depth in
&gt;&gt; it's scale depth. Yeah, it's probably
worth mentioning because the models
don't know anything about metric depths,
right? So we predict the scaled. Yeah.
Yeah, I just wanted to ask um thank you
for your presentation and I wanted to
ask
I got really interested in these course
learn tokens.
&gt;&gt; Um
&gt;&gt; so
&gt;&gt; in these course learned tokens we named
them.
Um so if I'm not mistaken they they get
learned globally through the whole
training and in the end they are fixed
in place and they don't change proper.
&gt;&gt; Um so at the end by at the end do you
mean during the evaluation?
&gt;&gt; Yeah during evaluation they don't change
they are fixed right
&gt;&gt; so during the evaluation we toss them
out because this is just for during the
pre-training during the reconstruction
because our loss is pixel reconstruction
like ality loss between
each corresponding pixel. So
reconstructing and the linear decoder
helps our encoder to learn useful
features.
&gt;&gt; Then we put away the linear decoder and
use the learned features to train the
readout module
you see here.
&gt;&gt; Yeah. Yeah.
&gt;&gt; So it's so that comes normally after the
encoder but while we're evaluating we're
not using it.
&gt;&gt; Okay. Good.
Okay. So, you're just um appending as
much tokens as there are missing from
the method.
&gt;&gt; Yes.
&gt;&gt; Did you also do any data creation
doing model training or did you increase
your data each time to increase the
model size as well?
Um so we used the so we didn't
necessarily limit the data size but
obviously as we uh so all the models no
data size has not changed um in each of
those cases it's it's remained the same
but we did this similation with the one
of the model sizes to see if increasing
data amounts does help or not but we can
say
thanks
&gt;&gt; thank you for waiting
in advance if it's a question but I'm
wondering if you t something on your
side
of motion using the example used from
the traffic in Turkey I don't remember
exactly and telling part whether the um
movement of the frame stems from the
camera that's moving or maybe the actual
movement of the objects that are in the
frame and that maybe just so happen to
occupy the
So uh what's the question related to
this or I think that's the first
question.
&gt;&gt; So have you tackled um differentiating
the relative motion?
&gt;&gt; No, there's no assumption on like what
the like what the camera motion is or
what the you know um scene is. Obviously
if there is you know some motion
happening in the videos because imagine
training on 170 million clips of I don't
know nothing happening. not really an
interesting data. However, we don't have
any assumption on the data itself about
the camera pose.
&gt;&gt; So for example, the difference between
the movements from the camera or
shifting and the actual movement of the
objects in the frame is semantically
indistinguishable for the model. But
then we don't really care if the
movement is only apparent because the
camera's moving but in the example or if
the movement is actual
&gt;&gt; I think the question is whether like we
thought about or we did put something in
the model that this this is a question
or
&gt;&gt; yeah we didn't have any assumptions on
that in the data
Um so you mentioned already cheap at the
end
embedding
&gt;&gt; architecture. Um what do you think is
going to be the next kind of
architecture and video
understanding and generation?
Um and
would that be brain inspired for
instance and what do you think is how
how is is it going to evolve?
&gt;&gt; I think well first of all I think that's
a research question so that's why it's
exciting to be talking about all of
this. In this one, we explored ME. I
think ME overall we see that um it does
a pretty well job. It's still uh
explored in the community as well like
different ways to uh implement MAE.
There has been a work that came out
today about recurrent ME for example
that's doing a pretty good job on
videos. Um but apart from that for
example for the J architectures there's
been recently like VJ 2 um that also
includes for example
tasks which is quite interesting. So I
would say it's not clear at the moment
but it's also what makes research
exciting. So we are yet to see I would
say.
&gt;&gt; All right. Thank you very much for the
talk. has been amazing. And yeah,
[applause and cheering]