What if we aim to train a model that solves all these tasks and then we [music] just scale it up? So this is where we are. We have image based solutions. We have video based solutions. [music] And to solve 40 tasks, image based solutions are nice. They're easy to scale. They're great at semantics, but they have this nature where they process the frames independently. And the video frames sol video models solve that problem, but they're more expensive to train [music] and scale. This part I'd like to do it together because I think um it's fun. >> Great. Thank you for the introduction. >> Cool. Okay. So, when I first received the invitation to talk here, I thought to myself, okay, I'm going to travel from UK to Berlin and it's going to be during winter time. I know because it's the winter speaker series, right? So I thought I should come near the Christmas time and if you have any recommendations for any why not smart >> fight me after the talk and right now I'm excited to be here speaking about video understanding. I'm Dara and I'm a research engineer at Google deep mind. In the past three years, I focused on video understanding with some experience in those areas. And right now, I will deep dive into one of the works I did at Google Deep Mind with my collaborators. But first, let's discuss what video understanding entails. Vision is one of our five senses and it's critically important to have it while we're living our lives. And let's take this take this kid as an example. Imagine for a moment that he doesn't know how to ride a bike. He sees this girl and thinks I want to ride a bike as well so that I can move around fast. And in that case he would in order in in order to learn how to ride a bike he would observe this girl understand how different parts of the bike are moving together to make this possible. And this is an example for me because I was preparing this slide deck last week working from home and I found myself in this perfect situation where I needed to rely on my vision. So in this example, I brew myself a cup of coffee. I went to my desk. I put it aside, turned my head, took a book about Argentina, which is an amazing place, by the way. I did this year. And then all of my cup was there. And then I brought it close enough so that I can drink it. So coffee, an amazing book. This is how my perfect day starts. And this is how my worst day starts. I don't know about you, but I hate driving in traffic. And this is a video from Assemblle. Oh, I I lived in Assembly before. Um it's a big chaos. And as you can see in this video, people sometimes walk on the highway. I guess cars suddenly change the lanes and it's the same. So surviving in a summer traffic requires me to first calm down and then to understand where everything is moving like tracking the cars around me and also anticipating how near they are to me. So overall as humans while we're pretty good at vision there are some cases where we or like at least I could use some extra help. And that brings us to the topic of the today. As humans, we understand how things around us move and where they are. So that means we can do 3D plus 3D representing space and 1D representing time 4D understanding, right? And how do we build systems that do the same? So if we look into the literature or like applications that we see in the world right now, it's usually people training some expert models that are pretty good at one task and for another task another model that solves that. That makes sense because it's easier to train a single model for a single task. But today I want to discuss a different path. What if we aim to train a model that solves all these tasks and then we just scale it up like does that work and how can we make this work and today I will share the outcomes of a research work that tackles this question. This is a work I did along with my amazing collaborators listed here at Google deep mind and I will share the works uh the details of the work shortly. uh but in summary we show that we can improve the model performance as we scale the model size as well. So let's zoom out for a second and look at how we can approach this problem. So let's take this video as an example. This is a video of a dancer called Donel and he is dancing whacking. And if you've never heard of Backing, it's a that style that was born in the 70s in the underground club scene of the Los Angeles. [clears throat] And as you can see, he also has some signature moves like he's spinning multiple times, etc. So when we look at this video, what are some of the things that we want to understand better? So first we might ask, what is he doing? He's not walking. Any other things? I can get some guesses around. >> Sorry. >> The setting. Where is he? What is he wearing? How many spins is he [clears throat] doing? Okay. So, let's take take this first question as an example. What is he doing? To answer this question, we somehow want to use this video and get to this output, right? So let's discuss what kind of things we can do here on a very high level we need some kind of a vision encoder that takes this input and extracts the information right and for as a vision encoder like there are some building blocks so if we look into what we can use CNN's are you know pretty standard approaches for extracting features they're pretty good at it and if we look at more modern approaches then transformer encoders would be good solutions for that. Okay, so now we have an idea about the building block of this encoder. But let's see how we actually get from here to there. And for that, let's first remember that video is actually a sequence of frame that are stacked along the time, right? So maybe we can simplify this problem by approaching a video understanding problem as an image understanding problem. And in that case we would have an image encoder and we would pass the frames of the video one by one through this image encoder and in that case it would be a single image encoder actually shared weights. So that's a solution and this has been explored at the literature as well. So for example there are some discriminative approaches like dino and they learn by distinguishing this image from that image. So dino like models are really good at answering semantic questions like what is here what is in this se and then there are generative approaches like masked autoenccoding or abbreviated as me and these are reconstructive approaches. They learn by masking out parts of the input and forcing the model to learn the masked out parts. And I will talk more about them in the upcoming parts. So, but let's remember that we are on a mission to solve all 40 tasks, right? So, we started by asking what is he doing? But we want to answer multiple questions. So, for that there are a couple of ways to approach this problem. So first of all we have the features from the image encoder. What we can do is we can train a readout module that solves a particular task. Readout modules are usually simple lightweight modules that could be an MLP or an attention layer and it gives you the relevant output. Advantage it's lightweight. Disadvantage there are usually if we look at the literature there are usually different readouts for each task. So it's not very generalizable. The second approach and um the one dominating the headlines right now is of course using a vision language model, right? And in that case the advantage is that it's generalizable because the task the question we have is if let's say if it's what does the dancer wear or if anything else then the text encoder would be giving us the features and we have the features from the image and these are somehow fused. There are many ways to do it but I won't get into the details and then you would get the relevant answer for that. So that's a pretty good approach. However, there is a catch with that. So training such models requires massive amount of data that usually comes from the web because web is a great source in that sense and in that case what happens is that they are trained on image text pairs right and the text usually describes what's in the image but what's in the image is usually a very high level information. For example, in this image, the caption could be guy throwing his hand, but like we are unlikely to hear guy holding like lifting his arm like for 45°, right? We are kind of missing those little cues. That's one. Um there is one more disadvantage actually. I I don't know if it's obvious in the slides, but I'd like to take a guess here. frames in the sequence the movement. >> Yeah, exactly. So, um I think the point is that we are taking the static frames one by one and we kind of lose the time, right? That's exactly the indeed the disadvantage of that because as we chop the frames and as we feed them into the encoder, we now lose the time. But to understand the motion, we need to have a sense of how things change over the time. So [clears throat] this takes us to the second approach and that is to use a video encoder because video encoders just takes the videos as they are and they respect the spatial temporal nature of the videos. So that would solve our issue and this has been explored in the literature. So uh we know there are 3D convolutions or other works like vivid video. So could solve our problem. But the issue the disadvantage is that it's hard to train and scale this models. Another disadvantage is that as um there are more image data available, video data is not as abundant which makes it also a part of why it is hard to train. So this is where we are. We have image based solutions. We have video based solutions. And to solve 40 tasks, image based solutions are nice. They're easy to scale. They're great at semantics, but they have this nature where they process the frames independently. And then video frames sol video models solve that problem, but they're more expensive to train and scale. So this brings us to the specific gap we wanted to address in our work. The first thing is the evaluation gap because if we look into the existing literature on video models, we see that they are usually evaluated on how well the they describe what's happening in the scene but not how things are happening in the scene. So the first thing we did is we decided to look into the tasks that actually require understanding how the things work. The second thing is as I mentioned video models have been harder to train and scale. So in this work we scared up we scaled our video backbone to 22 billion parameters which is as far as I know still the largest video back node today and we show that it consistently improves the performance. Okay, now we come to the methodology. Okay, so we defined our goal, right? We want to scale up video models and solve the 4D tasks. So in the methodology section, I would like to highlight that the key word here is scalability. So we all always had this in mind while we were making any decisions. So as we talk about scalability, it is important to also not just talk about the number of model parameters but also to talk about the scale of the data. Because if we scale up the model but we don't scale our data then we are likely not to end up with a good solution still. So we need to go with methods that allows us to leverage as many data as possible and that means we are going for approaches like that are not supervised learning or anything and in that case we go with self-supervised learning and it has been shown that self-supervised learning is scalable and within the self-supervised learning mascot encoding has been shown to be a scalable method in many vision works earlier the MAE paper itself or the video ME paper has shown that um it learns useful features and on a very high level what happens here on a masked autoenccoder. Okay, here's my cursor. So this is our input image. So in this case this is from the original ME paper. Therefore they use images. We're going to use videos. This is the input image. There are the patches. Some patches are masked usually like 75% of the image which is you know more than the images more than half of the images masked and then the encoder gets the features for the unmasked ones and then the decoder is forced to reconstruct all of them. So on a high level this is how MAE works and this is how we make MAE work on a scale and we call our method simple MAE or 4DS. So let's start from top left. This is our video right. So that would be width, height and the time. And we also like the MA work itself. We mask the input. So in the original MA work they do 75% masking in the previous work video MA they do 90% masking and we do 95% masking so you know reducing as much as we can because again we always keep scalability in mind and then we have the features that we pass through a self attention layer and then the course tokens are concatenated to help us get the reconstructed video. So what's happening between between here. So here we have a linear decoder. So this is the decoder part. Again a decision that came because of the scalability. If you look at the literature, we usually see that there is a transformer decoder. So transformer decoder is higher quality. However, they also come with more number of parameters. And again since the keyword is scalability, we thought okay let's go with a linear decoder and see how it works. And also as a side note um as much as like we train the model on the reconstruction at the end of the day for downstream evaluation what we care about is just the video features themselves. So during inference or like while we're evaluating the model we don't use this linear decoder at all. So that's why maybe not having a, you know, heavy decoder is also fine because we're interested in those features and maybe this decoder already learned useful features. We'll see. So this is a pre-training setup. We use a standard mean squared error loss for RGB pixel values. Again, no labels or anything. We trained on 170 million short videos and we sublip them. So we ended up with like around two billion four billion clips. Um yeah input dimension resolution are listed here and there are a couple of things um that are important to highlight here. So I'm coming from the industry I'm speaking at a university right now about scalable models. So it's important to talk about I believe like the resources and everything right. So we trained this model on 256 chips, 256 TPU chips and um we also applied some other tricks um like we converted the model weights to be full 60 32 um except the loss in softmax computation and besides that data parallelism you know is an obvious approach to work models at scale. However in our case that wasn't sufficient. So we applied model sharding and optimizer stage charting as well and that's how we got to the billions of parameters at the end. Okay. So now we know the methodology and now let's talk about what kind of evaluation tasks we looked into right because I as I mentioned the ones in the literature are quite focused on the semantics but we want to get to the 4D understanding. So this part I'd like to do it together because I think um it's fun and you already have seen some of those examples. So in this one I was mentioning that I would like to or I need to track everything around me and anticipate how far they are for me. Um any guesses on which computer vision tasks this corresponds to? Any guess? Tracking everything around, anticipating how far they are. >> Okay. >> Okay. I'm getting some guesses. Can I hear again? >> Object tracking. Yes. >> Optical [clears throat] flow, depth estimation. They all make sense. In this case, I had depth estimation and object tracking in mind. And for assessing those, we use the following data sets. So, scanet is an indoor RGBD data and it contains videos of yeah indoor scenes and this is what we use for evaluating depth estimation. And for object tracking, we use the void data set. This is a data set that was collected from waybo cars in urban and suburban areas and it has lots of different adaptations and in this case we use the 2D bounding box annotations. Okay, so this one I left my cup, turn my head, remembered where it was and brought it in to drink coffee. I mean something. >> Sorry. >> That makes sense, but not what I had in mind. There are multiple right answers by the way. Like please >> object >> makes sense. >> Exactly. That that that was what I had in mind. Exactly. Chemical estimation because in this case it's an egocentric understanding where it's the it's from the first person view. It's the estimation of where the camera is. And for that, we look into the real estate 10K. Um, RE10K is a data set collected from YouTube and it contains videos of some, you know, real estate properties and it also has the camera pose annotations. Okay, this one's a bit tricky. So, this is, you know, understanding how different parts of the bike move. But this time, it's not object tracking, right? It's a bit more than that. Segmentation makes sense. >> Yes, that makes sense. There are lots of correct answers. I'm trying to get to mic though. >> Makes makes a lot of sense. >> Okay. Yeah, I thought this would be a bit tricky, but I had point tracking in mind. So point tracking is um a problem of tracking a point in a video in every other frame of the video. So it's a bit more fine grain problem than the object tracking itself is. And for that we use the perception test benchmark. It's a multimodel benchmark with lots of real world videos and lots of different annotations and cases. And in this case we look into the point tracking. Okay. I think this one is easy. What is he doing? Yes, I heard action recognition somewhere. Thank you. So that would be action recognition. And for that we looked into two benchmarks. Um one is something something B2 and the other is kinetics. So kinetics um is a collection of YouTube videos where they have been annotated then with what's happening in the video. And something something V2 is a benchmark where people put say something into something or like there are lots of object interactions. So we look into both of those both of the results. However, since something something V2 requires such you know it has those temporal dynamics happening we uh look into the SSV2 results a bit more closely. Okay. So these are the tasks and now let's get into our evaluation protocol. So this was one of the approaches you might remember. We have the video encoder and then we have a readout module. I mentioned this is usually a lightweight readout module and then this gives us the result right. So when we look into the literature there are several different ways to do doing this. So this readout could be very simple. It could be a dense layer or it could be an attention layer or it could be a DPT layer. dense prediction transformer and especially the DPT one is used mostly for dense tasks and by a dense task I mean a task where you would need to predict a value for each one of the pixels and usually depth estimation is uh considered one of those tasks and in our case you know we have lots of different evaluations what are we going to do are we going to like try all of them like combination of them how is it going to work and what we decided is to go with the attention layer. And it's also probably worth highlighting at this point like what queries and keys correspond to. We used a cross attention layer and queries are learnable except in the tracking cases because then queries are the object or the point that we're tracking. And also another thing that I'd like to highlight here is that um on this figure. So we have the video video encoder and the readout, right? So as we're training the readout, what are we doing with the video encoder? So this is not like there are several different ways. So we can either freeze the video encoder and freezing video encoder means we don't update it, we just train the readout. That's uh cheaper to do. Amazing. And there's another way where we can fine tune end to end end to end and we have did the both and I will show the results for that shortly. So this is this is I would say the main message of the paper because our main question was can we solve 40 tasks as can we yeah solve 40 tasks as we increase the model size and the answer is yes. So here we see the results for five tasks. Camera pose estimation, point tracking, depth estimation, object tracking, and action classification. The the x-axis is the model size in log scale. And the yaxis is the performance. And as you can see, the error based ones like depth estimation or camera pose estimation has a different trend than the other ones. But the important thing here is that we observe the performance gets better as we scale up. And there are a few other things to highlight here as well. The improvement is almost always monotonic except I see some exception here. And another thing is you may have noticed there are two lines in each of those plots. There is the frozen and there's the fine to which is what I just explained like what are we doing with the vision? encoder itself and in this case fine-tuning usually helps especially with the depth estimation and action classification but the results are a bit mixed like at point tracking camera pose and object tracking actually frozen is doing better okay so this was how the 4DS model does as it scales but it's also important to compare how it does according to you know previous work and I'm going to lots of results here but first let's let's take a look at how what we're looking here so here on this table I have the tasks and here we have the models the first section we have image models this is the video baselines and on the bottom we have the 4DS models so the results for this we've just seen and now we're going to compare according to the previous mod and there's also an important detail here. Some of the models have been have included language in their pre-training in our architecture. You may have noticed that I never talked about language. This is a pure pixel supervision but some previous models did have it and it's worth highlighting what that at the end means. Okay. So first results image backbones. So here we see the results for image backbones as well as some video prism results as well because all of those results are relative right so I need to show both sides to make a comparison in between. So what we see here is that we see for non-sematic tasks like real estate 10k or vivo open the image models are not doing really well right and um also one thing I would like to highlight here is as we were evaluating the models we have the encoder and then we have the readouts right so for the video model we know that the you know temporal nature is handled within the encoder itself. However, we know that for images, it's not handled. And while we were evaluating, we thought, okay, if we evaluate image models on these, they're going to be really bad anyways because they don't know anything about the temporal nature. So what we did in the attentive readout is that we added some learnable temporal positional embedding while training that readout module in order to like sort of put the image models in the less disadvantaged position. So this is the results based on that. Um even with this you know um addition the image models are still not doing well on non-semantic tasks. They're not doing bad on semantic tasks especially with kinetics. They're doing a good job. SSV2 not so much. One part worth highlighting is the scan net results though. So depth estimation is quite geometric task, right? because it's estimating how far a pixel is from the camera and we see that image models like especially Dynino is doing pretty good and this is probably due to the fact that um so in Dynino paper they also show that um they're doing pretty well in depth estimation so it's no surprise but probably the reason why it is so well compared to the other non-semantic tasks is because for depth estimation we just need a single frame, right? We don't need the other frames. We're just estimating the depth of the scene. However, for point tracking or object tracking, we need information from the other frames as well. And for relative camera pose estimation, since it's relative, it's relative to the first frame of the clip. So this may be why although scanet is a quite geometric task that uh image models might be doing still a pretty good job. Okay. Second thing these are the models with the language pre-training because so we didn't do language pre-training but it's important for us to see how the models with language pre-training does. Are they better? Are they worse? And we see the results here. So overall when we switch from image to video things improve overall and with the language pre-training as well we see that the semantic tasks especially are becoming a lot better and these are the rest of the baselines the vja and video baselines that we evaluate on overall what we observe here is that um the video models are overall doing a pretty good job on nonsemantic and semantic tasks. Okay, now we come to the 4DS results. We already have seen these numbers, you know, with the plots with the lines. So, I'm not going to go through them again. But now that we have the whole picture, it's important to like compare everything against everything. So, what do we see here? First, the largest 4DS model is obtaining the best performance across all models on non-semantic tasks. When we come to the semantic tasks like SSV2 or kinetics, my mouse was in the wrong place. I mean here they're still doing a good job especially for SSV2. For kinetics it's not as well. So therefore the conclusion here would be that yeah as you scale your model your performance gets better and also compared to the baseline it's a pretty you know decent performance except that when it comes to the semantic tasks it's a bit in a disadvantaged place. Okay this part is important as well because I know I'm talking about 2122 doing parameter model and like it's doing so well but how are we going to run inference on that right? So it's it's hard. Therefore, uh we looked into distillation. In this one, we distill a B model from an E model. So a B model has 91 million parameters. An E1 has close to 4 billion parameters. And what we see here is first lesson going from B to distill B improves the performance a lot. And it's improves it so much that it's on par with the model that has like three times more than three times number of parameters. So yeah um distillation in this case improves the numbers quite a lot and here are some qualitative results. So as we go from left to right it is the increasing number of parameters of the 4DS model and there these are three scenes from the scanet data set. So what we see is as we go from left to right the predictions are getting better and sharper which again uh shows that scaling helps with the performance. This is the results for object tracking again three cases from open data set. The dotted lines are for predicted bounding boxes and the regular rectangles are for ground truth. And as we scale we see that they overlap more and more again showing that scaling improves the performance. And this is the result for point tracking. So in this figure the circles are ground truth and the dots connected are the predictions and the line in between represents the error. So in all of those predictions, the longer the line is, the worse the prediction is. And we see on the right 20 million parameter model has lots of those lines laying around whereas the 20 million parameter model is doing a better job. Okay, we talked about scaling the model size and how about scaling the data, right? We had already had the intuition that such self-supervised methods would need lots of data as we scale the model. And our experiments also show the same thing in all for almost all of the evaluations as we so for all all of them actually as we increase the model size sorry the data size the uh performance improves in all of them and in almost all of them it's monotonic. So uh this sort of verifies our assumption that we need uh lots of data as we scale our models and this is um one thing that I also want to mention. So we have this huge echol right 22 billion parameters and how are we going to get the features? Are we going to get the features from the last layer or are we going to read it from somewhere in the middle? Because anything could be. And for that we read this study. We looked into different layers of the encoder read them and then like used them for the readout. And what we observe is that for most of the non-semantic tasks like RE10K camera post, point tracking, wave open, deeper layers are usually better. And it seems that 95% is usually a good compromise. So this is what we did use at the end. And for the action classification, actually the deeper layers are not as good. So for this one, we ended up using the 75% depth of the encoder to read the features from. Okay. So so far so good. We came to the limitations though. Um first regarding the methodology. So we use the mass autoenccoder because we know that it's been shown to be a scalable model and there are other video works that did use video. So it made sense to start from here. But obviously self-supervised learning landscape is huge. So there are lots of other methodologies that could be used like maybe contrasted learning based or joint embedding prediction architectures. There could be future works to do here. And the second thing is that I showed nice results that show you know that things go up things improve but I haven't shown you any scaling laws or anything right which would be actually interesting to have from this work. So one limitation is that our results at the moment are empirical and that we don't have some sort of scaling law for models at the moment. And this brings me towards the end of the talk. I hope you found this interesting and want to take a look because the paper is on archive and the code checkpoints including the still model by the way and a collab demo are available on GitHub. So yeah, thank you for joining today and I'm happy to receive any questions and the best smart suggestions as well. [applause] I'm going I'm going to be walking around with a mic. Yeah, I'm trying to be fair. Uh just so we can keep get as many as people as possible. Keep your questions succinct and maybe for one question per person. So anybody with a question? mind the position of the objects because objects reside in three dimensions and the images are just the two dimensional representation of threedimensional objects and they are basically the shadows of insect objects. So yeah I I didn't really understand why would you use uh this methology. So is the question why we use video data to solve 40 tasks? Is it >> video is essentially bunch of pixels? They do not encode the positions of the objects in three dimension. They are just shadows. So you are choosing shadows to capture the dense reality. >> Right? So there are different modalities we can use as well. Right? So um we can use point clouds as well I guess. Right? that would give the true 3D uh structure of a sea. Um however uh point clouds are not well in this case as I mentioned scale of the model is important and correspondingly scale of the data is important as well and there isn't as much point cloud data as video data available. So that's what that's like one factor that makes it harder to use actually you know the ground truth 3D map of the world. Um but However, there are also works that um tackle, you know, seen understanding from a point cloud perspective as well. So, that's another >> Hi, I'm fascinated by your work. I just finished a program. It was a three-month online program with MIT and my final project I did vision and motion recognition in still images. emotion recognition >> emotion recognition in the faces of pictures of people's faces had some big data sets and it was fascinating I love working with it but of course my mind was racing to video applications because it's the obvious next step um so this was wonderful thank you I really appreciate it and I hope I I'm going to read your papers and see what I need to do next >> however I'm also curious why you guys I'm sure It's just too much data. But had you considered including the audio? Because I imagine that's a bridge source of data. Now, it could also be wrong. It's often totally different than what you're watching. But what um had you played with that? Is it something you might want to work with in the future? um this isn't something we have played with that but I think it's it's a really interesting um approach to equip audio as well because then we have the you know multim modality right not just the vision but like what we're hearing as well it's a rich resource but in this case we have included it >> thanks hi um I was wondering a bit about more longterm uh video understanding tasks so the uh tasks you're working here are from what I understand like 16 frames which is like a few seconds at best. So what do you think would be necessary to go to a more longer term uh video understanding tasks like action segmentation? Is it that you would use a base uh representation or should it be like a change in the way how we think about videos in the first place like should they be 3D tensors or like should you change architectures to work more in a streaming manner? >> Yeah, I think that's a good question. I think that starts from like I think I feel like any new research question starts from how we evaluate that and when it comes to long-term video understanding there are some benchmarks that look into longer horizons like several minutes maybe or like an hour. Um but in that case it's usually um not the 40 tasks we're looking into. They're usually more towards like Q&A kind of questions. There are some exceptions as well but I think um that's indeed something actually worth exploring and for that I think um for example in the case of object tracking point tracking if there were some benchmarks that assess this then we can start thinking about okay what kind of approaches we can do and then you know try and fail try and fail and then find what actually works. Okay, thanks. >> Trying to be a little more fair with the space usage here. Anybody two quick questions basically using video encoder, they are really restricted to what you can do with this model. So because you always need like 16 frames. to use. It seems like an o choice for a model that you want to be able to use for different things. For example, you wanted to recognize a single frame on movie or something like that. That would be kind of difficult with this. So, uh wouldn't it make more sense to kind of build something on something like dyno on top of you? >> So, is the question we pre-trained this model for say 16 frames or something, but like if I wanted to use it for image understanding, what what can I If I want to extend it to more frames. >> One frame. >> I'm sorry. One frame >> for one frame. >> Yeah. So one frame um just the easiest thing to apply. We can still use that encoding for that. What we can do is we can just you know replicate 16 frames and get the features and we would be getting the same features. Yeah. For some things that would work for depth probably but it wouldn't necessarily work for recognizing a frame from a movie or something like that. But it's still a single frame, right? Oh, you mean as movie it's a long sequence and that's why >> it is a long sequence but let's say you want to recognize one frame from that specific issue that we are going into. >> Yeah. So that depends on how you represent the video. So it depends on what is the frame rate of the video you're using as an input. So, for example, if you have a high frame rate, then you are likely going to end up with frames that kind of look like each other, but like moving, you know, tiny bits in each other. But if you have a video that's longer and if you sample at a higher frame rate, then you have a different representation. Um but yeah this case this one is limited to 16 and like if you wanted to you know toss all the videos. I guess what you can do is like you can run inference in each of the chunks and then sort of aggregate the results to see um if it works for your case in that sense. The depth estimation that is not to scale with the single image once because you have this depth with right. >> Sorry, could you repeat >> the depth estimation is not to scale. It's because you have depth in >> it's scale depth. Yeah, it's probably worth mentioning because the models don't know anything about metric depths, right? So we predict the scaled. Yeah. Yeah, I just wanted to ask um thank you for your presentation and I wanted to ask I got really interested in these course learn tokens. >> Um >> so >> in these course learned tokens we named them. Um so if I'm not mistaken they they get learned globally through the whole training and in the end they are fixed in place and they don't change proper. >> Um so at the end by at the end do you mean during the evaluation? >> Yeah during evaluation they don't change they are fixed right >> so during the evaluation we toss them out because this is just for during the pre-training during the reconstruction because our loss is pixel reconstruction like ality loss between each corresponding pixel. So reconstructing and the linear decoder helps our encoder to learn useful features. >> Then we put away the linear decoder and use the learned features to train the readout module you see here. >> Yeah. Yeah. >> So it's so that comes normally after the encoder but while we're evaluating we're not using it. >> Okay. Good. Okay. So, you're just um appending as much tokens as there are missing from the method. >> Yes. >> Did you also do any data creation doing model training or did you increase your data each time to increase the model size as well? Um so we used the so we didn't necessarily limit the data size but obviously as we uh so all the models no data size has not changed um in each of those cases it's it's remained the same but we did this similation with the one of the model sizes to see if increasing data amounts does help or not but we can say thanks >> thank you for waiting in advance if it's a question but I'm wondering if you t something on your side of motion using the example used from the traffic in Turkey I don't remember exactly and telling part whether the um movement of the frame stems from the camera that's moving or maybe the actual movement of the objects that are in the frame and that maybe just so happen to occupy the So uh what's the question related to this or I think that's the first question. >> So have you tackled um differentiating the relative motion? >> No, there's no assumption on like what the like what the camera motion is or what the you know um scene is. Obviously if there is you know some motion happening in the videos because imagine training on 170 million clips of I don't know nothing happening. not really an interesting data. However, we don't have any assumption on the data itself about the camera pose. >> So for example, the difference between the movements from the camera or shifting and the actual movement of the objects in the frame is semantically indistinguishable for the model. But then we don't really care if the movement is only apparent because the camera's moving but in the example or if the movement is actual >> I think the question is whether like we thought about or we did put something in the model that this this is a question or >> yeah we didn't have any assumptions on that in the data Um so you mentioned already cheap at the end embedding >> architecture. Um what do you think is going to be the next kind of architecture and video understanding and generation? Um and would that be brain inspired for instance and what do you think is how how is is it going to evolve? >> I think well first of all I think that's a research question so that's why it's exciting to be talking about all of this. In this one, we explored ME. I think ME overall we see that um it does a pretty well job. It's still uh explored in the community as well like different ways to uh implement MAE. There has been a work that came out today about recurrent ME for example that's doing a pretty good job on videos. Um but apart from that for example for the J architectures there's been recently like VJ 2 um that also includes for example tasks which is quite interesting. So I would say it's not clear at the moment but it's also what makes research exciting. So we are yet to see I would say. >> All right. Thank you very much for the talk. has been amazing. And yeah, [applause and cheering]