Okay. Is this working? It's complicated to go on stream. Um, okay. I think we're getting there. Okay, this is good. Okay, seems to works. Hello everyone. Hey man, how are you my king? Chill out. [laughter] Chill out. Chill out. How's everybody? I think my mic is good. Um, your channel has been so helpful as I embark on my thesis and research on a man this is so kind. Um, you know, it's this is this is um I I'm making this basically for for myself. when I was um doing a PhD and it was like desperately trying to like understand what was going on. It was hard to kind of ask questions and um I don't know like find your way a bit. Um so yeah, if you have any question man like don't hesitate. Um yeah, for those that don't know the format of this thing is basically I'm going to there's a paper right and I'm uh going to look at it and dive deep into it. Um, and we also have the first author of this paper uh that will come on stream and you'll be able to uh uh answer all my dumb questions. [laughter] So, if you also have dumb questions, don't hesitate to uh to let us know. But also like if you have just general question, the first hour is just going to be me explaining the RLM and after that at 11 I believe um my guy will come in and we'll be able to u uh to chat with him in uh on everything uh related to Arlem. I just need to make sure that like I'm not messing up. Um I have a new setup now so I'm able to like stream a bit everywhere at the same time. Um, I still don't have the hang of it. Uh, I think it's getting there. It's getting there. So, okay, this one is there and I have the chat over here. So, this is good. This one is working. Um, I think the chat on Twitter is not working probably. Uh, wait a second. There we go. No, Twitter is working. Okay, perfect. So, we're we're doing great here. So, the paper we're going to review is the RLM paper. Uh you might have seen it uh float around a bit. Um it's a great one. Like I really really liked how it was written and I really like the just the general flow of it. Right. Uh maybe show you the paper over here. Uh share screen. Right. So, it's this one. Um, pretty good paper and basically it's a it's kind of a harness u paper where it shows like a new way of of like a plugging an LM into like a long context type of harness so that you can actually have like ingest more context without having the context rot problem. Um, so overall it's great. It's a lot of pages, but like what what matter more is like the first eight pages, right? So, it's not like a a terribly intimidating uh paper to read. Uh but what is cool is that they added a lot of information about like what did didn't go wrong, sorry, what did it go right, like all of the qualitative information about literally what's up with um the RLM as it goes and ingest uh the context in different situations. Um, so no, it's a it's a pretty uh pretty nice one. Uh, we're going to be able to check this out uh in its entirety and um like I said, uh don't hesitate to ask any question about uh about anything uh deep learning related. Um I'm going to also check out on uh over here. Okay. Yeah, we're doing good on all front. Perfect. So, let's start and we're going to start with a nice uh little overview of the whole thing over here. So, here it is. Uh so, Alexang will uh uh hop on in a few um and we're going to be able to ask him a bunch of question. How about like I have like about 20 questions of varying varying usefulness. Um, so we'll be able to to check these out. If you have others, don't don't hesitate again. Just need to calibrate my stuff. Okay, so this paper actually comes from uh an idea that was uh like postulated in October. Um so um he has been uh actively working on it and refining uh the idea and now we have a paper uh that was written and um it's like it's exactly like the blog post but like just pushed uh a bit further. So if you actually want to understand a bit more also how to um um how ideas in uh research are formed, take a look at the blog post because it's much more um raw in like the the intuition and like what's what went into his head while he was u thinking about like um long context and what he followed in order to like do the actual uh bigger experiment. um and you see like uh what what's scaled from here to here. So it's pretty cool. Uh I suggest you to read both, right? One for the intrusion, one for like the the rigorousness of how it's uh it's written, but it's very approachable. Um and this is a long context paper, right? So um it deal with like having um model ingest much more um context that they can theoretically uh uh ingest normally or um they can ingest it normally but there's there are in the on the on the frame of the context window where the context is kind of deterating a bit right you you most likely already encountered this when you're like I don't know chatting with chat GBD or whatever it is and then like as as the discussion go longer and longer it just it just get dumber and dumber. So the intuition here is that like what you do generally is like you can summarize the information here or like you take two chat uh windows and try to kind of like merge them into like a bigger bigger one and you try to distill out some of the information. It's kind of this idea but coupled with um a fact about LM which is they're very very good at um coding right uh so like that's kind of the the core um intuition here and um what they did in order to like ingest even bigger context that the window allow uh is very interesting. they put the prompt that is absolutely uh massive in some cases like 10 million token um inside an environment uh like variable right in a pl um setup right so the the prompt that is gigantic is uh there and then the model doesn't have it just kind of dig into it again and again um through like this type of workflow right where it's a it's like uh asking uh it's like digging into it with like it's a print prompt like the first 100 lines or whatever it is and then you're going to continue like this and then going to slice it up a bit in a a bit in a Jupiter kind of notebooky way right where you have the data set and then you kind of play with it to understand it. It's kind of the same idea. Um and uh what is interesting is in this case it can actually uh kind of delegate some of the tasks to another agent and the agent will have a prompt and it will be able to interact with the same thing right in different way but he have subtask right so uh here it's like in chapter one find all it are listed to belong to whatever right and then the lm will do it and then it will return the response and then we do this back and forth um it doesn't do recursive like this right with the depth two or three or four um but do like at that point then come back to the main main root node um and then it uh it at at some point you need to kind of give a final answer right and this is all happening in a repl environment so um the information is getting stored into variable that is getting created here like a part one part two um pre-kata post kata whatever it is right and it will just like use these variable to understand like Okay, this is what is in this variable. So then I can use it. Uh I can create for loop and create a whole bunch of stuff. So it's really like free form coding that is happening here. And at the end it output a response and uh that's it. So this big thing is the RLM. It's the LLM with uh the environment um that is attached to it. So um that's just a bit and like if you if you look at um at the performance, it's actually pretty pretty uh interesting. Um if you look here and um there's a tree here that I shown the S um needle in the haststack here. There's the ulong and then another version of it that is a bit harder. So this one is like linear complexity. This one is like quadratic complexity in term of like uh length um uh of um of the of the prompt, right? Um so it's much harder even though it's like the same amount of input. Um and uh GPT5 here is good like a needle in a stack because I don't know it's not it's not enter like yes it's a lot of context but it's not complicated. You can just do one one scan through it and you're done right these ones require like doing a lot of computation but you have to do it in one pass which is hard. uh when you take the same model so not post train or anything and you put into the RLM RS um what you get is uh still NA stack is good but it's even good outside of the context window which is super interesting like it's past the theoretical limit of what the model can ingest um which is um I'd say like I mean if if you if you if you if you seen a like um [clears throat] my stuff for the past year, I've been talking a lot about quality long context because I I really think that like if you if you can crack the quality long context problem, there's a lot of stuff that becomes immediately um uh available, right, to to to the models to be able to to do that are a bit more complicated. And this is a seems to be a way to do it for free almost, right? There's a cost to do this, but it's not crazy. So uh it be like this whole region that was not possible before um is now possible like all of this right and on top of this if you look at the curve they're just getting straighten up uh throughout right so uh the olong is getting better and the the long pair which is like quadrat quadratic in in like complexity per input length um is also doing great so at 1 million we're still good and then I think they pushed it to 10 million um which is insane, right? And and it's still it's still working fine. And if you look at the average API cost here for this figure, um it's not terribly uh more costly. You see here the RM GPD5 on on long pairs. So this one is like kind of scaling uh a bit like this because it's quadratic in complexity. So it has to do a bit more processing. Um but it's possible at this this regime over here which is like just impossible for the other ones. Um and then like the the other ones for the RLM on along is still doing uh super fine while um these ones the GPD alone are struggling a bit uh sorry the GP alone are cheaper right but for a lesser performance here. So it is uh more costly in uh in general um to to use that stuff. Things get a bit murkier uh over there in a smaller um context because like technically you don't even need to do sub agent call or anything like that. the LM can just like answer straight up. Um, but this is just to show that like yes, it's more costly to do this, but also compared to the other method that we're going to see, um, it's not terribly, uh, costly. So, it seems to be a paradigm that makes a lot of sense. Um, so we have, uh, here, from what I understood from the paper, basically attach more LMS on the top of the main one and use them as storage kind of, but um, it's a bit more complicated than that. Like, yeah, the main storage is really this, right? It's this environment that you have running on the main node, right? But then you can spawn agent to like just do something, right? It's not like I mean you can call it agent, but it's like literally like just another RLM, right? And this RLM could do stuff, right? Like you could it has a ERPL and you can like do some analysis and and work on there. Um so that's kind of what's what's going on. and you can spawn them at multiple point and then aggregate their information um in various degrees. So that's kind of what you get uh on this front. Um and uh it's kind of like some sometimes like you could in in some cases there they could not spawn the sub agent and in some instances it's actually better to not spawn the sub agent. Um but in other cases if you have to do some semantic search type of stuff, right? or like semantic understanding. Um you're much better to delegate it to a LM and the LM might just like take the information and just ingest it and then just do something with it. Um yeah, so that's kind of that's kind of the the uh the element. And if you look here the uh you see here it split the prompt on chapter two part one part two and it feed this um the context of part one in it. Right? You see that? So like it is like have have a have a tighter slice of what it's um it's it's going to look at. So it's more like subprocessing of the information and storage. Um but then after that you kind of take all the the information and you and you put it together. Um yeah so this is a general idea right? Uh hey my man it's learning. Hey man, how are you? You doing good? Uh we have a lot of uh uh old faces here. It's a fun um it's a fun paper. I really like this one. Um cool effect. We have I muffin that is saying that he's working on this as he speaks. Coolman let us know what's up. Uh there's also the folks from Primate Collect that um um did a whole bunch of uh experiment on it. So if you if you want to take a look at their paper also uh it's also a good place to start. I think they did a whole bunch of ablation and they looked at the other other models um the GLM and and stuff like that. Okay. So, if you look at the long benchmark uh that they're using, right? Um for those that like are maybe not familiar with long uh benchmark, there's not terribly a lot of them, right? And um it's not just about like having massive uh task. It's also about like having tasks that are a bit more complicated, right? If you have that, if you have like a bit more complicated task um then like um what's it like you will actually be able to understand a bit better um how the models are handling it because if you look at it here like all of these task have the same amount of input context, right? This one is much easier. It's kind of like constant, right? It's it's not like it's never harder uh that much. And this is why you see like LM4 lama 4 scal being able to do it to 10 million like because it's a easy task but as soon as you have like bit more complicated um um task it's a bit more difficult and you have even more difficult than them in the in the benchmark used. [snorts] So uh wait we have a question here how that solve the context issue demand LM eventually will run out. It's a bit more different than that because like if you see here the prompt is not loaded in context, right? It's not so you can jam a 100 million token there. It's not it's not in its context, right? The what is is in context is a system prompt and then the tooling to do the EPL on it, right? On the on the stuff and then it will take slice of it and then it will it will spawn the sub agent. So the the RPL is sorry the this the system prop is actually telling the model to like do that stuff right. Um so like maybe if we push it to 100 million it will run out because it will it will get to the point where like it's even the slices that it's taking and the steps are too much but it works at up to 10 million right now. Um so like it's taking slices of the information. it knows what the question is, right? And it's being very careful to not run out of context. And then it's like spawning these guys which always have fresh context in this this case. So you can always like just spawn more of them and then do that work. Um and then this is getting aggrad gated. So like you're getting the final answer by like digging programmatically through um through the stuff. Um so okay that's good. And then the benchmark that are used are there's what this one the S ne ah so needle NA stack everybody um kind of know about it a bit but you have a document over here that is a bit bigger super mega long text you have the needles right and then usually like you have just like Paul Graham essays but in this case like you're going to like shuffle them and then like jam them into whatever right what it is Paul Graham or like whatever is a long context stuff Um and then you're going to ask a question, right? And then the question the model will get it and a document and it will be have to go parse through it and find the right stuff. Um and then this is the answer, right? So that's kind of kind of the idea uh of this one. So this one is the easy um kind of kind of easy uh long cotex element. Um Alex Elzang MIT guy. Yes, he is m MIT sale um [clears throat] uh guy. Yeah, PHG over there. So the main model just orchestrate the communication between multiple agent with fresh memory. Yes, it's kind of doing this but not just this. It can also not spawn the sub agent and it will still be good if you just have the RPL uh environment. It will still be um doing uh pretty pretty good actually. So we have this one, we have the browse comp plus, right? Um so BRS comp is a benchmark for browsing agent and then what they're doing is that um they're actually augmenting it right so they're augmenting the brump uh like a corpus here with um um kind of how say like a um information that is gold right that is being retrieved by O3 over there right and that is human verified and then uh information that is kind of like noise right which is information that makes sense in some way but that is not the information that you need. Right? So you have these R negative and then you have this this gold kind of information all piled up into like the BRs um uh comp uh benchmark and then this is what they're using in this specific case and you can go up to like um uh here they're using 1K document uh the benchmark provide a verified offline corpor of 100k document that is guaranteed content gold evidence and our negative document for each task and then you have like a whole bunch of of tasks but we are using 150 random only sample task as our evalation set and then we provide 100 randomly chosen document to the model. So 100 doc is given and then there's a bunch of tasks that need to to go there and once you give the 100 document they guarantee that there's a gold evidence in there to so that they can actually do the stuff. So this is a bit more difficult than this one right. So there's some gradation of difficulty here. Then we have oolong, right? And ulong is there's a whole bunch of stuff document and then you have to piece the information together on all the document. It's not like you cannot it's not just going to find it in one doc. It's going to be like you have to piece information um together in multiple multiple stuff here. Same thing with let's say like this transcript of like 12 hour of dialogue in this whatever game, right? Um so that's Olong, right? And then they decided to kind of make it even harder with um uh so they split uh they manually modify the three course uh split of ulong to include 20 20 new queries that specifically require aggregating pairs of chunk to construct the final answer. So a task will be like this in the above data list all pair of user ID where both user have at least one incent with a numeric value or allocation. Each one of the question can be labeled as one of the label description absorpt concept entity blah blah blah. So like you have to kind of take pairs of information and then like like find them into this um this um kind of mesh of information and then you can uh output something useful. Um so a bit harder. This is supposed to be like quadratic in in difficulty. And the last one is the long bench right V2. So the long bench is like um there's a whole bunch of document that are long that are getting there. You're we're data annotating them. Uh we're going to review the their their stuff. There's some revision that is uh being made and then there's some manual review at the end. Right? So it's like massive mega big document and um you have single document QA here. You have multiple document QA. What is interesting for us is this one is the code uh repository understanding. So it's QA 50 questions um on uh like literally code repository, right? And then you have to kind of dig into this code in order to understand and answer a bunch of questions. Um so it's this slice that we're taking over there. It's actually like it it's a bit weird, but it's actually um uh how it like a bit difficult to make good um long context benchmark. Uh even on in the while like let's say you have the multi-turn type thing, right? you have like there's a big corpus over there I need to dig into but surprisingly there's not that much um at least there's more now but like there was not that much like benchmark that already were structured in a way that you can test and train the models on okay and also like it's maybe why some of the model are not that great um on all context because they never seen they never were trained with a long uh long enough context um at all. Okay, that's good. Hey Arman, hope you're doing great. So the me the me method is like this, right? Like the this is this is the stuff there's the rl over there the uh environment uh hold the prompt and then like it's loaded as a variable and like you can play with it. Um and that's kind of that like so like the context is loaded in a variable and then the model has access to it right that's kind of the the the situation that we have here it can spawn these agent and the prompt look like this you are tasked with answer a query with associate context you can access transform analyze this context interactively in a environment that can recursively query sub element lm like in the ablation you they remove this which are strongly encouraged use as much as possible will be required iteratively until you provide the final answer, right? And then there's a whole bunch of stuff, right? They're using the same um roughly the same kind of system prompt for both the uh the um the GPD5 and the Quentry coder. Uh but then the the quantry they have to say this [laughter] like they have to say this this little thing here which is like uh be very careful about using LMQ query as it incur high runtime cost always batch as much information as reasonably possible into each call. Um because the thing like the quentry was always spawning agent right for every line or something like that. So they had to stop it and and just restrain it a bit for doing this. GBD5 doesn't do this out of the box, right? Like it will just like use it properly and stuff. Um but quentry is just it just doesn't care at all. Uh [laughter] [gasps] which uh uh I don't know which I found it kind of funny. We have Noah that says, "I didn't know LinkedIn had live stream." I didn't know either. It just works. [laughter] [clears throat] So don't hesitate to ask ask question whatever platform you're at. Um so this is a system prompt. You can actually take a look. It's in the paper. Um and this is the only thing that is guiding the LM, right? So like technically the RLM is this, right? This is the only thing. There's no RL that is being done on this um whatsoever. It's literally literally just um um just a good system prompt with the RPL environment. That's kind of it, right? Um, so this is good. We can then take a look a bit uh more uh over here. There's a bunch of patterns that are emerging um on the on uh the the models, right? They're not being told explicitly like what to do, but they end up doing a whole bunch of stuff, right? So the type of thing that they're doing here is that they can probe and then interact with the probe and then like with some with like reax or like semantic kind of sub agent call. Um they can defer some of the reasoning um over large context by creating recursive LM calls, right? So they're going to do a like a function here and then they're going to like recursively ask a bunch of question uh everything in in code also here, right? And then create the prompt and then shove it there. Um and then here can stitch recursive LM output to uh to form longer uh composite output. Uh which is pretty pretty interesting. Um I really like um the way it's the I said it's set up. It's very natural. Uh it feels a bit like it's it discovering what it can do and what it cannot do. So a bunch of stuff that we uh pattern that they see is that filtering input using code execution based on model power. So here ability to filter input context without explicit explicitly seeing it. So they don't know what's in the prompt, but they kind of filter whatever they need to filter um based on like their prior information and prior ideas about like what should be uh in there. Um so like um that's it. It doesn't need to load a lot in order to filter and get what it needs, right? um stuff like just uh doing reics on like a bunch of information that should technically be there, right? Or asking you see here what they did like uh find all it that are listed that belong to belong to people. It's doing a lot of inferring what should be in in this stuff and then actually going over there and then uh doing it or like having another model doing it. Um there is doing they're doing a lot of chunking like like we saw over there like they're uh like splitting this into chapters and then asking it to to go and fetch information in the chapter. Um they are doing a lot of verification [laughter] this part is the part that I find funny right like um I don't know like sometime they just panic and then they verify verify verify. Uh there is a failure case that is super funny with I think creree they just verify forever and then they they do [laughter] they got the right answer and then they they they just discard everything that they did and they they they they say rank the wrong answer. Um yeah so like um it can verify its information through sub LM calls um and it can pass recursive LM output to variable for long output task. So like it kind of is a way to kind of safeguard its um context because it's going to like if you see here right like uh over there right this output might be gigantic you don't know maybe it's like I don't know 20 20k but it actually is just this for it literally is one variable right um and it's using these variable and stitching them together it doesn't have to see it, you can just trust it like it's good or you can maybe verify a bit, whatever it is. Um, but that's that's very interesting. Or you can actually just like look at the length of this and decide if it's a good idea to like open it up and load into it context. Um, so the the variable encaps encapsulation is a very interesting um element of how it can safeguard its context. So it's always kind of operating in like a better u section of the of the context. um than like a just raw raw LM calls. uh the baseline that are used we have the RLM with the ERPL and we have the RLM with no sub call right so like in the no sub call uh uh section all of this is just gone right you just don't do that and you just do it there and already it's it's it's it is better in some cases um we have also like um the benchmark it against a summary agent which is another common methodology that you can have which is basically like literally like you you have summarization happening there's different ways of doing it right But it always is like whatever happened in the past is now summarized in like a smaller format and then like you take that and then you add the other piece of context and then you keep on going and then once you like once it's too big you compat you compress it and then you keep on doing it a bit like what's happening with like claw and stuff right um kind of cop action type of mythology the issue with this is that if the information is very sensitive right in terms of like its actual format it can there can it can kind of get lost symmetrically after too many compaction. Um so it's like lossy uh long context in this specific case. So they're doing a mix of this right and what they're actually doing is this over here. Um in it fashion the agent is giving input until its context is full at which point it's created to summarize all relevant information and continue. Right? So that's what's happening and they're using like a smaller agent for summarization. Um there last one that they're uh benchmarking is um code act which is kind of the like the react type of framework where um I I always forget like what uh what react stand for like you see think action environment take action and then response right that's that's sort of like kind of a um uh back and forth right but in this case it's like you think and then the action this code, right? And then you shove this into the environment and then you have a response, right? So, or you can do that multiple time. So, it's kind of similar to the what whatever is happening with the with the RLM um but with within the kind of the React uh framework for agent. So, the element is using code in as the action instead of like doing a bunch of tool calls uh and uh just doing like JSON in just out. Um so that's that the main result is what we saw here right it's working so this is good it's it's actually is working so it's fantastic to uh uh to see the pricing is not too crazy stuff uh it depends also like it's highly variable right so it depends on like some calls are massively more expensive than others um in like the 50 to 75th percentile it's pretty similar all across uh but then the prices start to explode with GPD5 here on the 95 95th percentile. You see here the Kodak and the summary agent they are getting massive uh u amount of um multiplier u well while the problem is is like a kind of double this one but it's still not uh it's not not crazy high right it's much cheaper than this these ones um and it has less biases about like what they actually can can be doing quen same thing um but quen seems to be uh more um pricey, right? And remember, Gwen is the one that it has tendency to just go ahead and not care too much about the um compute cost. Um if we look at more details here about the exact number. So we have these benchmark like code QA, browse comp plus oolong and oolong pair, right? And like the task length uh roughly, right? You see here br it can go up to like 11 million is kind of kind of wild here. Um but if we look uh the base model is like is not able to do this at all right this part is just impossible for it like it's too big of context to uh to ingest. So you need to have an RNS to do something. So if you put like the Kodak RS or the summary RS right it can get better. So you have 12% here and 30% 8% here it's doing better in the code QA generally. Um, and these ones, the olong pair, it's like quadratic, so it's a bit bit more difficult. Uh, it's doing not too bad in the in the olong, right? But the uh RLM, just the RM, right? Uh, sorry, the RM with no sub call with quantry um actually doing better uh over here on these two benchmark. Um, and then um it's still doing uh really fine here, right? Just with no sub call at all. So it's just one LLM in a harness with the re a bit similar to Kodak than anything else when you give it the sub agent. Um funny enough this uh this one codec kier the the performance drops a bit. Um but here the performance increase for this one and this one and this one is fairly similar over there. Right? So allong and long pair are more like semantically kind of relevant. This is more exact type of stuff. Right? kind of makes sense, right? That like no sub call is better. So, um this kind of means that some model have some more trouble deciding when to use what, right? For which uh type of cases. Um anyway, and then GPD5 just generally much more strong um than uh Quinn except with the no sub call here, but generally like the performance is uh better all across and with the sub call. So um it seems to be better at like deciding when to use it stuff. Uh so we have a bunch of observation here. I don't if you have a question by the way like I'm just going to go and uh talk about the whole thing. Wait a second. Yeah good. Um yeah the first observation here you can scale the RLM to 10 million token regime and it outperform base LM. Um so like 10 million is crazy right? Uh but it's just like is able to do it kind of out of the box uh without even like sub call just like just by having the RPL environment. Um for long input RPA is necessary while the recursive sub calling provides strong benefit on information dense input. So if it's like coding a bit less uh relevant you see it even here with GPD5 it's it's not that relevant because it's not like that semantically uh complic complicated bit more sparse um here lm performance degrade as a function of input length and problem complexity right you can see it here uh clearly right like the degradation here right so it's it's not just one axis of length it's an axis of length and then like task complexity and And this will tell you how much degradation you should see. So it's really task specific. Um yeah, so that's a something that you have to keep in mind if you're like working on like long context type stuff. If you can make your task easier, it's always uh better. Um yeah, just always generally better. The inference cost of RM remain comparable to a base model call but are high variance due to different in trajectory length. So we saw the high variance over here, right? that stuff. But you look, you have to look here. I didn't catch it at first, but like this is the cost, right? And this is the variance. So you see the variance here is kind of high for no sub call. Uh the variance here is also kind of high. Um yeah, so you just have to to check these out, but it's generally not like crazy um crazy costly. Um that's kind of it. And I think they did an experiment where they scale number of document, right? Um so here you have 100 you have 10,000 uh document in context and uh in this experiment in the appendix you see like the look what happened in the degradation right because remember the comp plus the models are not able to do it like these ones the base model are just not able to do that one uh but like if you just increase the number of document you can see that degradation let's say of GP5 is just nose dive straight up right with 10 document it's good with uh 50 barely and 100 is just like almost inexistent, right? While the um uh the RLM is kind of doing okay all across and even better when it has all the information over here for some reason. I'm not sure what um and without the crazy cost. Here you see the React is kind of uh much higher at,000. Uh but RM is still relatively um uh cheap. So are able to scale well without performance degradation and the inference cost is scaling not too bad. Um, uh, yeah, that's kind of that. So, that's it. Like, that's the RLM. Any any question, folks, uh, on what we just saw? Uh, we see there's a comment here. I believe it's that plus modifying prompt plus RPL. The RPL is doing a lot of heavy work, right? Because it's is stored inside uh, this sandbox environment, right? And it's getting processed and manipulated. Um so uh that's what kind of the the magic of it. Um and then like there's also the sub agent call that is uh that are being fired up um over and over again especially in quen 3 uh [laughter] we're going to see we're going to see what's going on. And so there's a bunch of trajectory that they put in the paper um for which are pretty cool, right? You have the this one with um GPT 5 which is like a happy path, right? You have a thousand document over there and then like you can literally see what the model is doing in code because this is how it interact with like the environment. Um and so it's searching for a specific keyword and it's like looking at very specific steps. You see here the window and it's trying to find snippet that makes sense. And this is I don't know [laughter] the the um uh the keyword that it's looking for right so it created the keywords and it's looking for them and running the red jet query you have like the response right and I think this one is making sublm call to find the answer which is uh like these these this is like what it's asking um from uh one of a dm right so the root node is asking this for a sub agent which is extract the following from this article. What festival in town is this about and what year? What is this specific celebration held? Blah blah. And then it will respond all of this. And then based on this, it will check the information, right? This is what he's uh gonna gonna have. Um and then it will be able to figure out that like yeah, this is the response. It's Maria Dalmasio. And then this is what we will output at the end, right? Print winner first last and then first. This is what he has. So it's cool. Like none of this is kind of hardcoded or toolled. They they just go and do Python. um over here. Yeah. Uh what's this in platform where these screenshots were collected? Look cool. Um this in platform is they actually mention it in the paper. They just vibe coded it. Um like literally they literally voded it and then they they're using it. Um so yeah uh maybe the is in the the GitHub repository. You can check it out. All the code is open source for uh the RLM. Um, so I haven't read the paper, but how do well does this work with other modality, particularly audio and visual? That's a great question. We're going to ask him that. Well, Alex is there. Uh, so Alex, you can answer this one if you want to, or you can wait like 8 minutes before you come and stream, but this is mainly like text based. Um, yeah. Um, is this the one the funny one? This is the funny one. We're going to take a look at this one. Right. This last one, right? Um, so this one cost a dollar [laughter] and this is the question and it's on the olong pair. So it's kind of difficult. It's like you have to like double check and verify a whole bunch of stuff. Um, and then uh it says that the uh the model will begin by probing the context with various code snippets, right? It's probing it stuff. Um, and then like it decide to check semantically like and classify the data using a sub agent call. Um and uh then it goes right and it processed this in batch and you see it's like doing this in um in uh what's it like it created this this uh function which is doing the lm query and it's like continuously like calling this function right and then like the recursive lm calls will do their stuff um and then the root lm will look right if the uh whatever instance satisfied the query in this case and then has the pair and it has like the the information right and then it will look at it and what it will it will do here um it will continuously verify its answer this is quen 3 cooler right um I think in this case it [laughter] so it will repeat this process over there right and attempt again to regenerate the answer and we'll do that uh five times like over and over again and I return the same answer and when it has to be like actually like um give a final answer because it's starting to run out of context um yeah it's just going to be the root lm generating an answer out of nowhere and then just like like spitting it out which is going to be the wrong answer [laughter] so uh I don't know like this this part I found it absolutely um I don't know there's it tells you a lot right like the quen 3 coder will do that the GP5 five seems to be doing this a bit less like this kind of neuroche uh doublech checking but just the fact that it doesn't trust this stuff that it's being produced by like itself technically like the these these sub agent um and then just verifying again and again it tells you a lot about like the benchmark the difficulty of it uh but also like what what the LMS are actually doing when they get to um um to difficult uh stuff like that right um so I think there's there's maybe stuff that can happens in order to make the model and and steer the model in like being more sure about what it is uh looking at. Uh but it really looked like I don't know like a student which has like a thousand document open and just double-checking and double checking and just like not trusting itself. Um yeah, this one was funny. There's uh other ones. Um but uh you can read the paper. It's all there right and it give you a sense of what's going on. And I really like that in the paper where like the quantitative information is fine, right? It's all good, but I really like to see the exact qualitative vibe of what how these things are running. Um it can, by the way, you can also try to run it uh on your own, right? Um it's just literally a harness that you put around it. Um and the last part that I really liked is this one is negative result. Honestly, I think like all papers should have that. Like just tell like you you found something cool, perfect. just tell us what didn't work so that we don't try to do it again. Um so using the exact same RM system prop across all model could be problematic. So like this is the quentry thing they had to make this change otherwise like quentry was just like spinning like crazy. Um [clears throat] and uh like this part is interesting. So model without sufficient coding capability struggle as RLM. So the correlary the opposite is also u seems to be true like model that are good at coding seems to be good at long context because they can manipulate the pl um environment efficiently um yeah thinking model without sufficient output tokens struggle as rm that's also an interesting one so they tested a whole bunch you can check out like this other um blog post uh from prime recursive language model the paradigm of 2026 6 by Sebastian Mame. Um he tried a whole bunch of other LMS also so you can have a good kind of overview what's going on. Um RLM without asynchronous LM calls are slow right like in this case like the u the calls are blocking so like you're doing step uh stuff step by step maybe like doing it asynchronously will will will help but then you get into the weird setup of like what happened? How do you continue if like stuff is not done yet with another sub agent which is an interesting problem I think um and depending on the model distinguish between the file and thought is brittle for RLM which is also super interesting and maybe like it tells us a lot about like this failure cases right because it couldn't recognize maybe that it has the final answer for real this time so like it should just like go and commit um and it still think that it's maybe a thought and you just need to kind of get the final answer again and again. Um I don't know it's really grad student coded uh this whole interaction which is uh a bit fun. Um there's a bunch of limitation here that you can read in order like what the stuff hold but we're going to actually uh ask Alex a whole bunch of question uh right about now. Um, so yeah, uh, we're we can have you hop on whenever you want. Um, and then we can, uh, start this, uh, this up. Okay. So, I'm going to try to I think um think you're all set up to come over here. And I'm going to pull out my um 20 plus questions. Hello. >> Hey. How are you? >> Good. Good. How are you? >> Good. Good. Not too bad. Uh well, thank you for uh coming and [laughter] uh answering um answering question for for all of us. I think uh it's really it's really cool. I really like the paper. Um, >> thank you. >> Real witch also like a a remind me of the TRM actually paper like a very I don't know like a very not simple in the sense that like it's nothing impressive but simple in the sense that it was easy to follow. >> Um, >> so we like that. To start off, can you tell us a bit about yourself and like u just your your background and like research interest uh for everybody here? >> 100%. Yeah. Um and yeah I I also I will preface by saying um hopefully the the paper was was easy to follow. I think um actually some of the earlier iterations of the paper we had uh I I had wanted or I I I wanted to write in some like theory related motivation for like why there was actually a whole thing about like why um certain problems are harder than others. uh which I will actually talk about today but I I scrapped in the paper mainly because I think uh we didn't have enough like uh strong evidence to support that like what I was claiming was actually true but I think intuitively you can see a lot of it um but anyways uh so I um uh for context I am a first year PhD student at MIT um I think it is very funny when when people say I'm an MIT researcher um I've only been there for like three months Uh but I um yeah, I I graduated from my undergrad uh at Princeton uh back in uh 2024. Um I originally wanted to do math, funnily enough. Um and I guess like it's what the the school is known for. Um but I I did a bunch of like really random stuff. Uh so like I I I I did some like exploration with a friend into like blockchain schemes and then all this random stuff. Uh I I did a bunch of like more core RL work um earlier on uh in my undergrad and then um eventually near the end uh I started working uh with the Swebench team uh on like SUB bench related stuff uh mainly SUB bench multimodal uh which came out last year. Uh and then I got really involved in GPU mode which is also really random. Uh so I I you know I I I was really interested in GPU programming. Um I currently help host all the competitions. uh which is not at all related to RLMs. Um but the point I'm trying to make is that like uh I think a lot of uh my current research interests uh and ideas are um kind of motivated by uh lots of random things that I've uh done in the past. Um and I think the field moves super fast. So there isn't really like one thing to be like fixated or or or focused on. Um but yeah, currently I think um RLMs are my my main research focus. Um and it's not necessarily long context problems which I will also kind of elaborate on as we go. Um but I I think that like in general um for this year there is a lot of really really interesting research uh that we can do for language models uh that isn't necessarily like um systems work or like infra work um but yeah >> cool yeah it's really cool and I see that you've done a whole bunch of benchmarking uh yes >> that's uh that's very Can you tell us a bit about like what motivated uh uh that work also. >> So I I will say I think um like I'm not going to lie and say like you know I love making benchmarks. I I don't think making benchmarks is fun. I you know everyone wants to do the flashy thing, right? Like you know you want to train a 100b model and then do all this cool stuff. Um and and so for context too the the benchmarks uh that is referring to uh is well sweet bench multimodal is the first one with the sweet bench team. Uh and then kernel bench uh which is like um LLM generated uh GPU kernels and evaluating those. Uh and then the most recent one is this thing called video game bench uh which is like trying to [snorts] evaluate uh vision language models that can play an assortment of games. Uh, so these are like games from the '9s like Doom, uh, Mario, Kirby, uh, like these kinds of things. Um, but I I think that this uh you actually talked about it earlier, but like this RLM work uh I think kind of highlights too that like we don't have a lot of good benchmarks. And I I think when like the reason I ended up working on benchmarks every single time was because I wanted to work on a particular problem, but there was just no eval for it. >> That's it. >> Um and I think like this actually is a is a really big problem in in the field in general. I think uh I I have strong opinions on like what eval should look like. Um, I think that like we kind of have a problem of like eval currently are not, you know, a great indicator of how good a model actually is even on the task that you're evaling. So like for example, you know, there's a lot of evals for math and coding. Um, but they're not even great evals for like whether or not a model is good at math or coding. Um, so this is something that, you know, like I I think, you know, it needs to change and um I I don't think a lot of people like working on evals. Like I, you know, I'm going be honest, I don't think it's fun, but I I think it's really important work. Um, and uh, unfortunately, you know, I might have to come up with evals for, you know, RLMs and and and future works like this, but we'll see. Yeah, I I think generally long context because long context is worse because like you really have to think hard, right? And then like there's a lot of document then like what you going to do? You going to verify each one? Like no man, so you have to kind of make it and then do it properly. I mean the actually like the people that have uh a lot of long context stuff that are relevant for LM are actually like the closed lab which they have no why would they publish their internal data right so that's that's kind of a um a bit of the problem and I realized that when I was reading the minimax paper um text 01 uh text 01 >> um >> because um they were talking about like this um uh uh like this translation long context task where there's this dead language only 100 people talk talk it and then like there's like this like this book that you give in context and then you check if they can do the translation right in paper makes a lot of sense but then like what happened is like at some point even the models the newer newer model are getting good at the language even without the book because they're seeing the book at some point. So like then it's even harder to come up with like um good long uh long stuff. So you don't really know if the model is actually able to process the the the longer input. So when you actually give it to your stuff, it just suck. And then you're like why does it suck? It doesn't have the stuff in context. It never seen that big of a input. And then that's that's kind of the end of the story. There's nothing you can do with it. Which is actually what interested me a lot with this paper is like um instead of trying to like bring in context all this information right and try to come up with like massive amount of training data for long context well they can have a small context it's it's all fine but they get like the tool that they need in order to kind of mine uh like literally they mine u the information a bit like a I don't know a PhD or like a graduate student like they they don't have everything in context they have the data sitting there. They have their Jupyter notebook and then they just go right and then they they learn about and at the end of a session the notebook >> they get some insight right out of it. Maybe it's wrong but they get some insight out of it and [laughter] then they can >> move. Um yeah also funny story I so in the paper uh there's only four benchmarks uh and even for like the code QA I think you mentioned it it's part of a larger benchmark uh long bench v2 which I think arguably is is is maybe the current most difficult long context benchmark um but the reason so we actually evaluated on a lot of benchmarks like a lot more than is shown in the paper uh and the problem with a lot of these benchmarks that we ended up finding was like either a um the RLM can just solve it basically uh but like we don't report this because um the way that it solves it is by using the code environment which is like not like it doesn't need to use the sub LM calls uh and like it's kind of a silly thing. So like one of one of the examples is like another benchmark or another task in that benchmark is like computing a long math sequence like a long arithmetic sequence. Um, but like if you plug this into Python, like of course you can just do it. Like there's no, it's not a hard problem. Um, but like the, you know, like you you'll get like GPT5 can't do it. Like you can't get everything correct. Um, and so like the the the scores you report are like, oh, the RLM gets 100% but GP5 gets 0%. Uh, but it's like, you know, that's not really an interesting result cuz like, well, of course, right? Like >> um, and the other problem is like, oh, the base model can just do it like it can actually just solve the task. Um, and like the one of the issues we ran into that was like really silly, um, was we we had examples where like the task would be about like a book or something uh, and you like take the book away. Like I I tested this. You just remove the book uh, and the model can still solve the task. So like it doesn't need to read the book because it knows what's what's already in there. Um, so this is like it's really really hard to find good eval. Um, and I I think like, you know, for anyone that's interested in in doing research or something right now, uh, this is an open problem. Like it it's it's genuinely like a lowhanging fruit of like you just need to find a good eval um that's like realistic and like um that that people, you know, you want to see models actually solve. Yeah, I think that that part is important like that you actually want the model to be able to solve because toy problems are cool, but I mean um if you want to bridge like when you're going to actually bring the model to your to your whatever that you're building being already knowing that you can solve like like a related task is is already something. Um okay, I think like we we we got a bit through the the uh the high level overview. Um can you talk talk to me a bit like about the intuition? >> Yeah. >> That led to like this because you could have went if you look at the related work you could have went all sort of different ways but you decided on RPL environment um what what triggered this. >> Yeah. So this is actually a really important question. I I think um there there's like I I I don't want to say that like you know um I'm a genius like this is a completely new idea like no one's ever thought of this. um which I I have seen online like you know there there there are there's a lot of like oh like doesn't cloud code do this uh doesn't like you know XYZ method do this or like aren't you know uh doesn't open AI already do this in in codeex um and I think to some extent uh yes like I I think um so the way that this idea kind of came about um was um basically like nowadays I think models are really really Good. Um, and and I I I want to preface by saying this because I think this is a very timely idea. So if you tried to do this like maybe last year, um, so like we've tried this with Deepseek R1 for example, um, it's actually not very good at at doing this whole thing. Uh, and like I think a lot of the fundamental model architecture slashtraining research uh, was really really important to lead up to this point. Um so I think the the best way to think about this is like um methods like claude code and codeex um that do this kind of like uh very smart codebase management thing where instead of feeding the codebase to a model uh they use like special tools. Um this I I actually would say this started with like su agent uh and open hands last year. Um but like this idea of not feeding in all of the information directly to the model uh and and using specialized tools to like navigate this information. Um this idea was like very exclusive to code or like even like software engineering tasks. Um and I think like the intuition came from like oh like you know a a programmer is not going to read the whole codebase at once. Um and I think like the the core intuition behind this idea is like well you can actually do this for any task. Um and actually uh like claude code and and all these scaffolds like they are highly uh like specific to code like you can use them for other tasks but the models themselves are like post-trained specifically to solve coding tasks. Um but I think like the nuance here is like you know yes rms like the the idea is really good. it can do long context things and this is really important like you can slot it in like you can slot in an existing model um and show that it works like I think for the purposes of the paper this is like this was the most important experiment we wanted to show um but I think actually like more the more subtle thing here uh and like why I'm excited beyond this initial paper is like the implication is that you can actually start post-training models uh to do this kind of paradigm Um, and this is a lot cheaper than trying to extend the context window of a model. Um, or like build a larger model. Um, and I think like we are starting to enter like an era where these models are actually really good. Like these transformer-based neural networks are are really really powerful. Um, but they're almost is like very little need like I want to be careful with my words here because like you know we should continue to improve models. Um but it's like exponentially more expensive to you know even like double the context window of a model. Um and like I think the point that I'm trying to make here is like you know they are already so powerful for you know transforming text of a certain context window size to uh text of a certain context window size. Um but you can actually like chain these things together uh and and you can produce a significantly more interesting system by doing this uh without you know incurring really really expensive uh scaling costs. So in my mind like this is another axis of scale uh that is like very interesting. Um now we don't talk about this a lot in the paper because as a paper we can't claim things that we can't prove. Yeah. >> Um but I think like this is like a really important point. I I think this is actually why for example like prime in select is really interested in this approach. Um it's not necessarily just for the long context part. I think there is like a piece of it that's like really really important which is you know maybe all future models are actually um interacting with your context in this way. So >> yeah, like for me what what what sparked my attention here is that it has a similar um uh shape in term of like um uh like just the the the thinking and then chain of thought um kind of trick, right? Because it's simple enough. It works all across the board, but then you can actually RL it so that like it's even better, right? And we're not we haven't done this yet, but I think like it will have as it has the same characteristic um as as this setup. Um and the other thing that I think is yes coding is is economically super important I think for like just the whole field because if it start to work even better on bigger codebase with just like a simple framework and everybody can do it then there's going to be more tools that will be able to work on bigger code bases and then like you get into enterprise and like the money is there right this I understand but like >> in my view it unlocked the whole like scientific uh research agent type stuff that is able to go for longer uh like amount of time >> and is digging into this because fundamentally this is a grad student [laughter] like like that's the feeling I had is like hey I have this big task and it's complicated right it's super long there's a lot of this stuff that I need to to take a look at but that's so I'm going to like parse through it a bit figure out like hey this is relevant this is less and I'm going to give it a task for next week and my task for next week is like try to figure this chunk and then you go and you figure this chunk and then you get it out right and then you work on and then but this flow and then literally it's in almost notebook right Jupyter notebook um like this flow leads to like some amount of facts or like information being found that then can be used to answer a bigger question right >> um this I really liked I I found that like it has these u this uh these setup. The other stuff that I really like and I want to like check with you like was this on purpose or not is that it's very minimalistic like there's no massive RS of any like very bell whistle. It's like a system prompt and then a RPL. Uh was this like done on purpose or did you try a whole bunch of stuff before getting to that? >> Yeah, so this was intentional but we also tried a lot of stuff before that. Um, so I think like what we ended up settling on was um like I I I don't love calling this a scaffold because it is it is like I I I want to be clear like it it totally is a scaffold. Um but I think uh scaffold makes it sound like like this is a type of like a new type of agent that we're building that we want you to use. Um I I think more fundamentally like what this is is like a very particular way to do model inference. Um and like keeping it as minimalistic as possible is is very important for this because you like you know you can't afford to train a model just to be used as an agent unless you're like anthropic and your goal is to sell like cloud code. Um, I think if we're more interested in like general model capabilities, like we want something like as, you know, thin as possible on top of the model. Uh, and I think that's, yeah, that was like kind of the guiding principle for for how we designed this. >> That makes a a whole bunch of sense. Um, wait, there's a question here. >> Yeah. U, just wondering, are there a problem where the context expand instead of contracting at the meta which you might want to all the LM connects can't fit in? Um I uh yeah I think it's talking about like >> fact generation and stuff like that like do you have a >> I think I understand the question uh like so there was a similar question earlier about like what if the context window of one of the models fills up like one of the intermediate models. Um yeah I mean this is a this is definitely a concern. I think like in the long term like the hope here is that or one of the the core ideas of an RLM is that no single model call should ever exceed a certain length. Like this is kind of like the this is the hope right now. Like how you actually guarantee this is not super easy. I think what we found in our experiments is that like if we just let this thing run as is, it actually just never fills up. Like it never even gets close to filling up. Uh but you can imagine if you go to a like a harder and harder task maybe it does fill up. Um I think what like ultimately what this should look like like in in its like full form is like a recursive language model will be spawning another recursive language model. Um and like in this sense like the actual intermediate like model calls will never exceed a certain length. Um yeah, I I you also could implement like tricks like compaction and stuff that like cloud code currently do. Um I don't love this because I think it kind of takes away from the core idea which is that like >> this entire process should be a quote unquote like no information loss process. And and what I mean by that is like at no point like the reason why we store everything in the ripple is um the model should technically like in theory have access to all information like in its purest form um not in a compressed way um and you you kind of want to maintain that like throughout the trajectory of of the model as well. Um but yeah I I I think like the sorry the short answer is like we haven't run into this issue in in like the current experiments we've run but this is totally like a plausible issue. Um and I think like it does get solved by like deeper and deeper recursion. Um because like the idea is you keep splitting up the context. Uh but I think like we don't have a a strong like robust uh guarantee here. [clears throat] But like maybe that's a a future work kind of thing as well. And like um in my view like compaction is kind of interesting in the sense that like if like you're working toward the specific goal and then the compaction kind of makes sense, right? But like um I I come back to this grad student type of workflow. Imagine if like the grad student was cleaning the data right in a specific way and then throwing the raw data, right? like you will be kicked out of the lab ASAP, right? You you do this and then you you messed up I don't know the filtering or whatever it is. Well, congratulation, right? Like what what the heck are we going to do now, right? >> Um and I think this is um because like I always come back to the scientific discovery type shape because I think this has a high potential in this this realm because like you have the thing in the raw form and then you're digging through it. Right. Right. >> Um and I think um also what Mataki is hinting toward is that like there there's there's these experiment now where like for scientific discovery where the model is running for like days and whatever it is right at some point like it's creating a lot of context because like let's say the ERPL is just like it's just uh like literally 100 meter long in term of a of a back and forth. um at some theoretical point it should fill up the whole context even if the output should be like one answer a name whatever it is. >> Yeah. >> Um but I think like um uh uh having the raw data having the thing digging through it um is already like a big piece of the puzzle of like being able to generate like insight and facts um out of it. I have a literally a dumb uh proposition here which [laughter] like amounts to like literally jamming a MySQL database in there and then storing a bunch of fact. Um yeah, I mean you already have RPM there like why not adding a MySQL instance and storing a bunch of stuff. Um but we're going to get to this before we dive into that. Um, I just want to get your raw thought here about like um long context in the model, right? Um, I've done a video on this and I've been digging through through this. It's just like literally a hard problem because you always get into like this kind of a trade-off, right? Um, and to make it good, let's say like let's say you linearize the attention or whatever the heck, you do block stuff, you do lightning attention, you do all sort of weird like wacky thing and then you optimize GPUs in order to make it work so that it's actually faster, right? Like do you think there's there's something there to juice up still or like um it's just the wrong way of thinking about like the the long context problem? >> Yeah. So um I think it is the wrong way to think about it and I'll explain why. So I think um scaling like the context window of a language model has two main challenges. Uh the first one is the systems challenge, the systems level challenge, right? Like oh attention I mean attention is actually not even uh the issue usually. Um but like let's say like you know attention is quadratic um maybe you can use linear attention uh sliding window uh stuff like this um like and and you know maybe you need more GPUs to train a larger model uh etc etc like you know maybe you're 10xing the cost of your training run. Um this is definitely a challenge and I I I think like my take is that if this was the only challenge um we would be able to continue improving the models uh significantly like we would be able to extend the context window um beyond like much further beyond what we currently have. Uh this I I um I believe uh and I I I think if you ask anyone in like the systems community I think they would likely agree. Um I think maybe not maybe maybe there's some like you know really strange reason why you can't do this but uh in its current form like you know just scaling compute and scaling model size is is not like it is just purely a cost issue like I I don't think we've we've hit that wall yet. Um, I think actually the more subtle issue uh is is the data and and this is this is a core reason why I think RLMs are are so cool because um I think that we we often take for granted that the way that we've trained language models um is effectively using like the internet or or using naturally occurring language um and kind of learning this distribution. Um but this like naturally occurring language distribution uh is not like un like unbounded in length. Um like I think the the the the sequences that we observe uh like in the wild uh tend to be like distributed according to some some like mean length and and and variance, right? Um and I I think like we have gotten away with the fact that like language models continue to improve because uh we have these like naturally occurring sequences uh that capture the distribution that we want. Um I think the way that we have done longer and longer context things uh is we have to generate synthetic sequences like synthetically long sequences and train on these sequences. Uh the problem though is that like it it's not fully convincing to me that like doing this will get you will like net you any longer term benefit. And I think the greatest example of this is uh the like practical failure of reasoning. So like reasoning models are really good, right? Like don't get me wrong, they're they're amazing. It's it was like a great breakthrough of of the past year. But there's like a lot of papers that have come out recently that basically show or like you know experimentally that um reasoning is this really silly thing because the actual content of the reasoning trace is like almost irrelevant to the final answer. Uh and like part of the reason that this is happening is like um at that scale like at this like sequence like I think uh the the way to think about this is like as you get longer and longer sequences you need exponentially more data to like fit a proper distribution right like this there's like an entropy argument here [snorts] um and I think what's going on is like with these reasoning like these long reasoning chains what we've kind of observed is like well the the the the good part of a long reasoning chain is that it conditions your model well to get like the right output like you can think of it as like a way to pick out the correct outputs from like what you actually want and reasoning we we've kind of seen as like a way to do this but what ends up happening is like you know with the quen experiment we saw this with the RLVR stuff like oh you can actually just post train on random stuff and like you can still get like some good answer this is a really odd thing um but the beauty of like the RLM part is like we can actually keep the language model input and output distribution uh within a length that is actually naturally occurring. This is like a like it's a it's a weird thing to to like wrap your head around. Um but I think like this is the the whole context raw thing we've been observing where like you know you you make the sequence like like really really long and all of a sudden the model performance like just tanks and and it's like why? Um, and and part of the reason is like I think we shouldn't, you know, it it's it's I I don't love thinking of models like in an anthropomorphized way. So like like a human obviously would not make these kinds of mistakes, but I I also think that like uh a human learns in a very different way. Uh and and so like we should think of these models like yes, they're very impressive. they generalize well all these things but at the end of the day like we we do have to think in a very mathematically principled way of like how were these trained and like what are they doing uh and I think if you think this way like it's very obvious that long like just the transformer model and and training it on long like huge uh context sequences uh is a is a really like difficult thing to do um and I think like with RLMs the idea is like we actually don't have to do this we we can do long context things without training them in a long context way. Uh and and there's a lot of benefits I think for doing this. But yeah, that's that's my my on that. >> No, I agree. I agree here. And um my my my other um view of of long context like if you want to kind of go and play around in the internal of attention or like how everything is set up. The issue also is that theoretically it's better. It's faster. >> Yes. But practically when you look at the practice like flash attention v whatever is much better in all metric because it's optimized for the GPUs. So then you get like these theoretically fantastic um like advancement that are absolutely dinky and worthless in in in practice and then nobody is going to like do the work of making the GPU stuff that you need in order to use them efficiently. So then it's like why did we do even do this stuff and I think that if if there is a method that can sight step this like for sure we're not going to go forward into the into the launch context like the slice of what they see is enough and um also reasoning for me is just like >> um I I see it in two ways right uh the first way I seeing it is like u it just helped the model like uh filter out like I don't know like wrong kind of answers and then just like bias it toward where it should the answer should be because technically you could just inference scale first shot it and then it will output maybe in this mess the right answer and if you were able to pluck it out you would you would be able to do it you could do it like either way but also if you look at it from a like uh an activation like perspective right you get kind of these trace of like activation that are happening in the model because the model doesn't have a state so you have this activation And then this activation will be with another input will be able to give you this other activation and then at at some point it get the right shape in order to do the right activation for the um for the stuff. Um so like reasoning is nice but the fact that it's outside is a bit kind of uh weird and like as it goes it just is like using like the the useful context inside. Um >> yeah. >> Okay cool. This is good. We have a have a lot of ground to cover. If you have a hard >> I'm I'm happy to stay on for longer by the way. So >> jeez guys, we're this is gonna be a marathon. We're going to be there for the whole day. >> Um wait, there's a a bunch of question. Um there's somebody says you mentioned cloud code and other agent already do intelligent context retrieval management to some extent. The exciting part is perhap more on the post training. Um so this is a question about like uh like is the interesting part of the RLM the RL that we can do on it? Um, yeah. What's your take on that? >> Yeah. So, I I I I want to be careful here because I think like um well, I'm getting used to this too as a new PhD student of of what I can claim and what I cannot claim. Um I I think the interesting part of the RLM paper uh is still probably the main result which is that like hey like you know first of all like uh one of the things that we wanted to solidify in this paper which I wanted to do in math but I think maybe the best way to do it is just in plain English. Um long context tasks are not equal. I I think this is one thing like I don't know why this wasn't made clear in the past but like obviously needle in the haststack is a very easy thing to do. Um but like if you were given a really dense like u long context it's really hard to process this thing. Um and the main contribution is like even without training like even with no training um you can do this really really simple like task agnostic method uh and you can take current models and you can scale their performance uh on on really really long sequences uh and you and it can actually process these really dense and also really sparse uh inputs very well. I think like this is the uh you know we have the results and I think this is like as on its own like this is already a really cool result like I think if there was no RL like no future thinking thing uh this is a cool paper like I I'm I'm very happy to publish something like this um I think the RL part though or like the the training part is more like why do I think this paper is more interesting like beyond this like for this year right so like um like why is my research still focused on RLMs even after I've come out with this paper and and probably we'll we'll try to publish it somewhere. And I think that's where like all this other speculation and and all these things come. Um and like I think a lot of it is grounded in intuition. Like I think a lot of people are also seeing that this is um a really interesting bet to make uh similar to like chain of thought and and some of the the other things that have worked in the past. Um but yeah. >> Yeah. And like this paper also remind me of like um I think it's the meta paper where they jammed >> like a coding environment into the world model of the thing. It got better >> and stuff. Uh just want to show this. I I knew that the this course of um I still don't want to dunk on anybody that is working on at meta, right? But like this um like at this point when I saw Lama for scout like this whole block of of needle and a hay stack to 10 million token I knew this this was absolutely worthless like it's not it's not the thing like it's not just like a needle and a stack problem. It's much more complicated than that and like I think you put it well and I would have light like that like this theory was pushed a bit even further um about like um uh this kind of angle between the long the the size of the context and the difficulty of the task. I think this is something that in generally in the discourse is not well um well explained and I think like the other part is also like um I don't know like a um the average uh capacity like the average um useful window size of the model like if we have these three kind of axis then it's a bit easier to say like roughly speaking for this specific task how hard it will be for this specific model to be able to to interact with it. Um, okay, cool. I had a question about um, we can maybe dive a bit more into the RLM structure because I think there's a whole bunch of question about it. Um, so my first question is that you choose our RPL for this, right? Uh, which I think makes a lot of sense, but you did ablation on the sub agent, right? So you remove the sub agent. Do you think you can do the ablation on the other way around where there is no RAPL and it's just like a whole bunch of sub aent that are just working on the on the context without having it fully loaded. I think this is this part I think was uh maybe missing in the sense that like you're not you don't have to load the full thing and then like then they just go and do their stuff. It's just like it's still an environment variable somewhere but they're not writing code. They're just like working on it. Are you did you think about this or this is is useless when you look at the other >> good question? No. Um so this is something that we that we missed and actually we're running it right now. So I I'm there's two new things I'm adding to the paper u which I'm not really going to make a big announcement about. I I think it's uh more for um also we're we're like submitting it um to a conference. But one of them is that it's exactly that um we we we need a baseline that is effectively like um can you take react or or Kodak or something and give it sub agents uh but you you don't you don't do this like offloading into a ripple kind of thing. Um, and the the point uh the point of showing this is like actually for RLMs like I I think another thing I want to be clear on with with the RLM thing is um you know the idea of taking a model and giving it access to subm models uh is not new. We're not the first people to do it. Um there's a few other works that have tried this. Obviously Claude Code intrinsically does this. I I think um there's an argument to be made that like I I think the sub agent way of doing it will get phased out. Um what I mean by this is like the way they do sub agents is like you define the sub agent and then like claude code will be smart about using it. I think ultimately like in the long run this will just be completely removed and cla will decide what sub aents it wants to use. Um, but even ignoring that, I think the the thing I want to be clear about is like the there's two key parts of the RLM. One of them obviously is the recursion part. Um, but the second one is like how do you actually do the recursion? This is like a nonobvious thing and I think the ripple is one way to do it. Like I I think the and and there have been some other proposed ways I've seen online uh like using a file system uh and and bash commands also great. Like I think um the reason we chose the ripple um is like was mentioned earlier like these models are pretty good at coding um and I think like um maybe claude or maybe opus 4.5 can also do uh file system management really well um and that's great and we we should definitely try it like I think um that's one of the things we want to implement in the open source library if people want to uh like use it but um the ripple I think is like the most intuitive way like you know I think code Python is really easy to period. Uh, and it's like really easy to like say something in English and and and you know, write it out in Python. Um, and so yeah, this baseline is very important. I I think we're currently running it. Um, it's the results are probably likely what you'd expect. Um, but I I would say the main thing is like this setup cannot handle long context um, for obvious reasons, right? Like it still has to ingest the full prompt. Um, but yeah. Yeah. >> Okay, that makes a lot of sense. Uh we have a bunch of question about sub agent. Well my first one is like does does the mo like two ways does the model know how many sub aent it is spawning and does the sub aent knows that it's a sub agent or just running on a task. >> Yeah so in the current setup no we actually provide as little information as possible. Um the reason being like we want it to be if it can work without this information like that's great because uh people can experiment with this if maybe like tune this if if they think it'll work better. Um the model I mean the model implicitly knows how many sub agents it's spawning because the code it generates like should tell it what like how many it's spawning. Um, but I think the Quen 3, you know, experiments clearly show like, well, it maybe doesn't have the greatest grasp of of how like how many, especially if it writes a for loop, right? Like it writes a for loop over >> Yeah, I think this is it. And also like I think like it might get confused because it also is writing a for loop and encapsulating the LM query inside a function call. Now you have like multiple layer of ab obstruction about what the heck you're doing. Poor dude is confused out of his mind, >> right? Yeah. So I I think like a lot of these things can get baked in. Um a lot of these things can also get post trained out to be honest. Um and as for if the sub agent knows that it's a sub agent, uh I actually think it shouldn't know. And and the reason like we we did it this way is like the a big part of like the thesis here is you can use an RLM on a single model. Like so yes, you can use an RLM to spawn like like you can use GPT5 to spawn Gemini uh 3. That's fine. Like that's that's great. And I think actually that's likely what it will look like for for maybe the next few months. But ultimately, I think what we really want is a single model that acts as both a regular model, like it should still be a regular model, but it should also be able to be used as an RLM. Um, and when it spawns itself as a sub agent, it should treat it like it's a regular call. It it it should have no it shouldn't need the prior that it's a sub agent. Like it's just being asked a question and it has to answer it. So I think like this is a going to be a key thing like moving forward like how do you train an RLM such that it still maintains its performance as a regular model but it also has the ability to be an RLM. These are this is like an interesting thing >> like I'm saying this because like um for the quent tree like in order for it to work what happened is that you had to literally tell it in the system prompt that like >> my guy just watch out for the compute cost because like this is too much right but I think like um it knowing like how deep it is right now in in sub agent call uh gave it this information without having to tweak the system prompt right so you get like one clean system prompt that just worked everywhere. And then like you're just like giving this this model um information about like because implicitly like the number of agent that is spawning is uh is correlated with the input cost. It's like 50,000 like 50 agent deep. >> Yeah. >> It should it should know at this point that like it's messing up and it also it was I was reading the permit blog and uh they gave it hint also on the difficulty. I think this is also another part that um is super important because they seems to be kind of poor at assessing >> roughly speaking what is the difficulty of this task right um and like how much they should maybe use in terms of compute in order to solve these task uh so like that's that was what prompted this thought the other thought is that if you we want to like do kind of like recursive and recursive sub agent calls right um in my view is that there like the the axis that you were talking about about like how long the input is and like how hard the the thing is should maybe be something that the model knows about, right? Like u look how much I spent right now and um look at what I'm giving to you. You're like I don't know agent number 52 and you're three layer deep right now but you have a easy task, right? This task is supposed to be easy. So in this specific scenario, the chances that like this this sub agent will like think about spawning another one's like, no man, I can just solve it. This is supposed to be easy for my breed of elements, right? Um I think like like the the part that is hard here is to like be able to implicitly know how hard the task is. Um and uh yeah but I I think I think if you if you're able to give it it should kind of direct the LM normally like you see GPT5 has seems to have a better reasoning about compete cost and just like how much thing are are hard or not right QM3s have no absolutely no idea >> uh about about that stuff >> yeah this is a great point yeah I think uh honestly uh this is something that should be experimented But like I I I I think what you're saying like could very well be actually the way that that it's done in the end. So >> and um uh we have a um information here. A question that like fit with the one that I want to ask is that um in the paper you said the system prompt is fixed across all experiment. So the sub agent doesn't even know the system prompt, right? Like it doesn't know that it's like in RLM type of stuff. Okay, cool. So that's >> yeah because we're doing depth equals one the sub aents are just models they're based models >> uh and fix one specific task and that's it >> um uh and it doesn't have access to its ancestor tree doesn't know anything about like uh about that stuff okay so this >> in this case I I think it's because the tree is not interesting I mean it's it's just a root and then a bunch of leaves uh I I think if you start to think about more like uh higher recursion depths. Um yes, then I think actually maybe we we should start thinking about like telling it where on the tree it is like what like maybe give it a little bit more context about its parent node. >> I I think that's because I go back to the lab analogy, right? which is like uh I don't know like a if somebody is getting handed like this project like hey can you do this and you the grad student is like okay and then he's take it look at it to say and give it to another one like hey dude can you do this and then just give it like the actual thing as is >> um without knowing that it's been the sixth guy that has been and then this stuff um that's one thing the other thing is that if it know it's easy right it is like arguably easy. The chances that it will go and actually do it are a bit higher. I I've pulled out a bunch of um uh neuro inspired research here on that like it do change the behavior of human when you know the difficulty of the test if you know it ahead of time right the chances that you're going to do great on the exam is is really high. If you know it straight when you have to do it and you had no time to for preparation whatever it is um like then it's different and depends on if you're anxious type or not anxious in this case I think quentry is anxious and is like pretty chill >> um but like it has an like on like a organic human type of intelligence an impact and I also found some paper that show that like it also has an impact on LM's uh ability they just not that great to assess test um like the the complexity of the test. >> This is this is interesting. Yeah. Um I the last thing I'll say is like I think there this is for [clears throat] the more theoretically inclined people um this is a really interesting problem actually of like local and global um observations. So if if if like um well I don't know how related this is to like uh uh palm dps like in RL but in general like what we're dealing with here is like this system where not every model or like not every actor in this system has all information about what's going on which is important right because I I think the thesis here is that it can't. Um but I like there is sort of um maybe some things to be said about how much information should you give each of the models at every layer like there there there is likely a way to like characterize this very well but anyways not that important. >> Yeah. >> Yeah. But no I think it's actually super duper important especially like if you're if we're thinking about asynchronous versus synchronous. Um >> yes >> like in the in the asynchronous case I think it doesn't matter too much because you just go and then like and it just it just do its stuff and that's it. But in the synchronous case I think there is a chunk that is missing which is the where we store all of the context or the facts or like whatever it is like that we're like directly mining right now. Um and then this is being used to kind of double check facts or like um in some shape or form kind of align like the rest of the of the model behavior. Um okay. Anyway, so um uh uh okay I had I had a question about um yeah this the the hardness question. I think we we touched already upon it. Um like uh what's your like raw intuition here about like why quen 3 coder is like making so many calls like a sub agent like I think I've said a whole bunch of stuff but like roughly speaking because it's still big it's like a 400 >> something uh B model. Uh what's your take here? I think the short answer is honestly like uh I I mean I don't have a a fully like principled uh um way to answer this but I I think like we have seen that some of these models like Quen 3 especially um is like a heavily benchmark maxing post-train model. Uh, and I I think like as much as you know we like to make fun of Open AI and all these companies like I I I think Chad GPT or like GT5 and and Claude and these models or like Gemini tend to be I I think pretty good at like even newer tasks that it hasn't really seen before. Um, they tend to make just more principled decisions. Um, I think Quen 3 coder is just a case of like it just isn't like it's not explicitly trained to do this kind of thing and so it makes very poor decisions. Um, yeah, that I mean that that's my speculation. I I I don't know. It it also could have like do with the task that that it has been trained on in the past. Um, and like maybe it's used to just spanning. I I don't know. But >> Right. Right. >> Because like I think this is pretty important because if they're kind of like fried up with RL um and we need to RL some more, right? like there's may like this may need to be happening um a bit earlier on in the in the post training of the of the model if like we need to RL >> the model on the RLM >> um also just rough intrusion here like what do you think all of the model are repeatedly verifying um their [clears throat] information because like this is something else like okay spelling sub agent is one thing but then you have the answers but You're verifying it again and again, right? >> Yeah. >> Like why this is you think? >> Yeah. So I think based on my experience using even like coding IDs, I think a lot of models or I I think trajectories are a very unnatural form of text. Like if you think about like like a trajectory is like a concatenated sequence of like input outputs from a model uh which doesn't these haven't really occurred until recently like this wasn't like a natural thing that you would really find um on the internet for example. Um, and like one of the things that is like kind of frustrating, like I I don't know, I still don't like fully know a way around this other than maybe post training is when these models like so if the model uh the one thing I've observed is like if the model comes up with the answer really quickly and the trajectory is really small um the model tends to just finish it right there. Like it tends to just be like I'm done. like there's there's nothing here. And I I think again uh I I don't want to like anthropomorphize uh this argument because I I I don't think I don't think the the argument I'm I'm making is not that as the sequence gets longer, the model is more like uncertain or something. I think what it really boils down to is like when the sequence is really long, um the models make suboptimal decisions. Like they just don't like they're just not very good in this setting. I think we've seen this in the past with like the the jokes we've made with with cursor like when you have a really long uh history it just starts to make really odd decisions and I think this is like a similar thing here where like um basically what is happening is uh for whatever reason like the high probability actions it should take is just to like retry what it just did and verify that it's correct and it gets stuck in this loop. I I think Quen 3 is is the biggest offender of this. Um, but this is like a known issue with Quen 3, right? Like it it tends to repeat things. Um, but it actually does happen with GPD5 as well. Uh, and I I think like this actually goes back to the the issue before about like training one context things. Like I I think even at this smaller context window, like maybe like 100K or like 50K, it still makes these really silly decisions. Um, so yeah, like I I um I I would say it's probably a training issue. Um >> Yeah. like in in my mind also it might have to do with like the fact that these are stateless >> and like you said like it's not like it's it's not moving them enough out of their this the distribution of like I'm going to have to retry this again right but for us that have states what we're seeing is that you dumb like piece of trash it's been four time already like it's it's enough like it most likely is good but um it's not enough it's No, this is still uncertain, but like the fact that you're trying it for time should make you more certain that like this thing is most likely okay. Um, which bring uh it back to my idea of like this kind of fact type database, right? Like you generated this fact and and two two other sub agent generated the same exact fact. You did a different trajectory, but the fact is the same. >> Mhm. like theoretically speaking you should take this into consideration right and then like like putting in there in some shape or form um I wanted to ask about like the RPL flow because um from what I understood it it's not Jupiter like it's literally or just straight up RPL like where in order to to output some text you need to print it >> am I am I correct here >> yes exactly >> so have you talked about like leaving it like room um I don't know to like write markdown or something like that. Um because I come back to the the grad student thing like if I had no room in in in my flow in order to write my thought about what I just saw um it kind of is a bit limiting like yes I'm going to like just engineer and do stuff but at some point like I'm not going to write print statement that that has my thought in it. I would much rather just switch to markdown and just start to draw it out like what's your general interest here? >> So um the okay I I I think like the interface that the model interacts with is super important. So like if it if it's a ripple or if it's a a notebook with markdown and it can also like plot stuff like this is all really important. Um I think though the the caveat here is that technically with the ripple environment it is able to represent almost anything that the Jupyter notebook can. So like for example if you want to store markdown you could store it in a variable. It's silly like it's it's not a natural thing to do but it can do this. And I think when when developing the paper we decided like we want it to be as simple as possible. Ripple is like the simplest possible thing. um let's just stick with that. But I think in the long term like you know let's say people who want to use this in production or or like want to squeeze out performance um yeah it's a good idea like I I think um storing stuff uh in like a Jupyter style thing. Uh there's another advantage of using a Jupyter notebook. Um, and also the reason we didn't use Jupyter just like really easy to set up. Uh, like in you know Jupyter notebook you have to do a bunch of stuff like if you were to write a library for this it's it's a little bit nasty. Um, but the the other advantage is like you know in a Jupyter notebook you can print out images you can like plot things. Um, and there was this question earlier um about like multimodal stuff. Uh, and the answer is like yes you can do multimodal stuff actually. Um the the problem in its current form is like we pass everything around as text like we have no way of passing around images but it's a really easy change like in the code. Um and I think one of the things again open research thing if people are interested is like multimodal RLMs or like looking at RLMs in multimodal settings. Uh the the reason this is so even even more cool is like I think code interacting with image stuff is like a very underexplored thing. uh and like how a model can interact with image stuff um or even like non-image stuff but like generating images like plots and stuff and and using this like uh GBD5 or they they have some tools that let it do this um but I I think like doing this in a more principled way is is a super interesting topic. Um I am planning on adding support for this in the RLM library. Um if anyone also wants to add it themselves feel free open PR um this is what to do but yeah uh I mean the representation matters a ton and and this comes to like uh what is considered in distribution what is not like these are all important things right so >> yeah that was my thought because like um because I have had this kind of grad student image in my mind >> my thought was that like okay what does the data look in distribution and the data look in the distribution for like analysis like this on long stuff um where you have to do some sort of like semi- mini analysis about like the things it look like you have to look at like uh programmatically interact with the the substrate but then you have to think about it right and say you're taught right and then these this become kind of the anchor uh that you use for the next step and then like you do and you keep on doing this so that like when somebody is going to like you're going to hand this to to somebody it's just going read like the you're taught and then that's kind of it's kind of summarization of all the code >> um that is happening. Um but I mean like u inherently the models have seen these Jupiter and seen this structure so like maybe they're going to be more um uh pushed toward like the same kind of u analysis behavior by being able to do that. I also like saw some research from um Microsoft I think these guys um enhancing LM data analysis capability with notebook and entrance time value gated they're doing like a Monte Carlo search type stuff >> um but basically they're just literally trained the model um to uh do data analysis uh with Jupiter style uh tooling and it seems to be working well um so I don't know it just sparked this uh Um uh this start um in general um we have uh wait a second um uh okay so um you you did some abl ab ablation without um sub call right >> yes >> um but then um uh like the in some cases the the the them without subal is able to perform better than one with like that can't do so call. So it's like it what's the what's the issue here is like the RM doesn't really know when it should be doing one or the other like uh what do you think? >> Yeah. Um I think it's a mix of things. One of them being of course like you said um it makes suboptimal decisions uh with the the recursive sub call. This is also another reason why it's very important to add the baseline that you talked about before um like in the final version of the paper because we do want to see like if you strip out the two most important parts or like independently strip out one of them like what happens. Um I think the the one of the big points of that ablation is that like a really important part of this paper is not actually the recursion part which is funny because that's the name but it is really about like this offloading the the context uh somewhere else that is really really important. Um I think another big part sadly is the noise. I like I I the annoying thing is like um you know I I will always criticize my own papers in this way which is that I don't have standard deviation bars and and and stuff like this. Sadly I just cannot afford to run you know like with them. Um and so like I I think in a lot of these instances it it likely is also due to some kind of noise. Uh which applies even to like comparing to the other baselines as well. Um but I think like generally yeah sub-optimal decision-m uh noise and then also the fact that like on the benchmarks where um it does perform worse. Uh these are ones where it actually can kind of get away with not using the recursive calls because those tasks are not very information dense. >> True. Um, so it can just find the thing it needs and then the main model can reason through what that information is. Like it doesn't need to do the sub calls. Um, so that's another explanation for like why there's a big a bigger gap for the other um for like Ulong and and Ulong Paris. But yeah, >> that makes a lot of sense. >> Yeah. Okay. So, um I'm going to spare you the database question. [laughter] Um >> I mean ask it. I'm happy to answer too. >> But but I mean like um um you just need to test it. I mean like how can we know um and I think this is like also adding some complexity to the system which like I think it it align with the other question which is do you think this could act as a replacement for like a fullblown rag system like if we we push it to the extreme here? Um, so I I don't think so. The reason I say this is I think the uh the usefulness of rag and and other retrieval methods is that or a big part of them is that you pre-index stuff like you pre-index like the things you're searching for which is not cheap right like it's and and it's a big reason why we actually don't compare to rag I mean also in our in our baselines rag just doesn't even make sense like the only the only setting where it makes sense is a browse uh plus, but in their paper they actually do rag and it doesn't do that well compared to BM25. It just wasn't even worth doing. Um but I I think like um I I still think there is value in methods that pre-index stuff. There is also value in equipping RLMs with tool calls and also equipping them with with rag as like a as as an extra thing. Um and I yeah in that way I think like rag is still an important thing or just retrieval methods in general are are are still like very relevant um in specific settings right so like in settings though where I think where RLMs really shine is where like you cannot afford to pre-endex or like you're just given something new on the spot which often happens for like a long agentic trajectory or something. Uh but I I one of the things I do want to explore in the future is like um a task where like the long context part actually doesn't come from the prompt. It comes from the trajectory itself. So you you can imagine like let's say we have a really really hard rag problem like a really hard retrieval problem where you need to like piece together everything. I think browse comp plus is an example of this but maybe even harder right like these like deep research style things. Um, and like the model is given a retriever like a some kind of like either BM25 or or rag thing. Um, but and so is the RLM. But the the difficult part is like as it retrieves more stuff, the trajectory gets really long and it it like an RLM actually is very well suited for this setting. Uh, and and this is something like I think is actually really interesting to explore. Um, and it it it goes back to this idea of like replacing a basic LM call with an RLM uh in your system and and seeing what happens. Um, but yeah. >> Yeah. Okay. Yeah, that makes a lot of sense. Like I also do do think that like rag still makes sense unless you equip this thing also with the database and the rag. [laughter] uh then like I mean like it's the the difference is that uh with the rag you have like just one shot type of situation but in this case it's actually mining for the information um which is the part that is uh the most interesting like whether you add a rag or tool call or like whatever library you want to add into like this the system like literally allow it to browse the internet and send an agent to browse whatever I mean like This is just adding onto this core a bit like the chain of thought um is also doing tool calling and like going and doing this this other stuff. Um it's just adding into the uh the same kind of core. >> Yeah. >> Um >> there's an element that um uh uh was interesting here um that there's a passing recursive lm output true variable for long output task. So if I understand correctly, um it's offloading this to another sub agent, right? Sub agent is doing a bunch of stuff and it has a prompt in it or whatever. It goes into the the variable. It's not looking at what's inside. >> Yeah. >> Uh to do the rest of the stuff. So it's saving a bit its context. It's kind of trusting that this is fine, right? >> Um this is literally what's happening here. So um yeah like you can imagine this is actually a really cool part of of this approach as well that I think uh is is highly underrated. Um and actually Prime Intellect's implementation of RLMs doesn't even allow the model the the model to produce a final answer. It has to like it has to output a variable string and that is the final answer of the RLM. um like that's like that is a an extreme version of what this is describing here. So um the point of of of this uh kind of uh this this part of the paper is that basically um another large limitation of large language models is their output context window which doesn't get talked about a lot like there is a limitation it's not infinite either. Um, and one of the really cool things you can do with an RLM is you can out also output nearly infinite or unbounded uh sequence lengths uh of of outputs. Um, and you can do this in various ways. So um the the trick that we use is basically the model can pick a variable and choose that variable as its actual output as its final output. Um, and this thing like in in the silliest case, right? [snorts] Uh, you can imagine what it does like you give the the RLM a prompt. It takes the prompt like as a variable. It passes that to a recursive model. The recursive model answers it, stores it in a variable, and then the RLM just outputs that variable. And that is the same as doing a model call. Like these are these are equivalent, right? Um but on like the more powerful part is that like you can do like for example if your task is um I have a uh one trillion token Excel sheet and I want you to transform every row into like a new Excel sheet. Um the RLM can actually do this. Um it it basically what it would do is it would chunk up the Excel sheet. It would spawn a recursive model on on on each chunk. it would save the output to a variable. It concatenate the var like all the output variables into one final one which is like maybe also a trillion tokens and it'll output that. Uh and and and this is very very cool because it also mixes programmatic things like you you don't have to use the language model itself to to do the final answer. Um, and this feature actually is what broke a lot of the benchmarks. Like basically the model because it's so flexible. For example, like all of the, which by the way, I think is kind of silly. All of these benchmarks that are like can a model do um like 30digit multiplication, right? And like sometimes the answer is no. And it's like why are we even evaluating this? Uh like in this setting like it will just compute it in a variable and output that. Uh, and I think you can do a lot of really really cool things like you can um you you effectively like what the model can do in the ripple is form an entire like workflow of like how it's going to generate the final answer and this includes both code and uh language model calls. So it's like almost building its own agent scaffold uh like in itself. It's it's a very interesting thing. Um, and it's part of the reason why ulong pairs is so hard because oolong pairs asks the model to generate all the pairs that satisfy some property. Um, and you can do this in a like a programmatic way using like this kind of um passing the outputs to like um across variables. So yeah. >> Yeah. Um I don't know like my my my intuition here is that like uh I feel it's the right idea at the right time because we already know that like I don't know office 4.5 fantastic coding agent whatever >> well fantastic now you put it into a setup where the only thing he has to do is code right like literally and like okay it doesn't can't do 30digit multiplication stuff it can it write the the script, do it and then it's done and then you can just move on to the other task and just stitch it up. It doesn't have massive context doesn't matter. It can spawn like six version of it and just go and then run. Um so I I I feel like it's it's the right idea at the right time. Um there's uh this line that was um I just wanted to get you like rough idea. I know that you you might be working on this right now. We hypotheize hypothesiz that RM trajectory can be used as a form of reasoning which can be trained to by bootstrapping existing franch here you think um that you have in mind in order to actually do that. >> Yeah. So this is actually really tricky in practice. Um but the the core idea of what I was trying to say here was that um like in the past right uh now I I want to say this in a way that's like not confusing to people. So if people find this confusing I can I can uh reframe the way that I say this um in in this last year the way that we have done reasoning models like like what is a reasoning model? A reasoning model is just a model that has been post-trained such that when it's given a question uh what it will do is it will output this long reasoning trace uh and this trace will also get fed back into the model right so it's like a form of conditioning um and given this reasoning trace it it it came up with and the original prompt it will come up with a maybe a better answer like a more informed answer to what it was trying to do. Um, and this is what I like to call reasoning in token space because quite literally it's it is just outputting tokens to come up with a with an answer. Uh, and the way that these were trained uh you know with with RL for example um although it doesn't have to be is like you you basically do this this kind of version of uh like rejection sampling. I don't know if that's the right word. Um, but like you you get the model to produce these long sequences. Um, and then if it gets the the the question correct, um, like you give it a positive signal, you do the update like this, yada yada. Like that's that's like a it's it's simple, but obviously in practice there's there's a lot of like really nasty parts of it. Um, but the reason why this works so well is like this is still the same as just training a model. Like you're you're just you're just training a model with RL. Um, and this the sequence like is still fed back into the model. So this whole pipeline is the same thing as if you were to train it in a non-reasoning way. There's actually no difference for the most part. Um the difficulty though with the RLM part and I which is why I think this is also so cool is that the RLM trajectory is way longer than what fits into the model's context window. So you can't just naively train it the way that you would like there there's there's nothing to like uh like the back propagation is really awkward here or like even the reward is really awkward here. We now have a uh like as we usually call in in like the RL community like a credit assignment problem. Um the other really weird thing is that we're not reasoning purely in token space. Now, if that's a confusing term, like I I can explain it a little bit better, but we are reasoning in code and in token space. And and not only that, we're reasoning across multiple model calls, which is really weird. Like it's a it's a really awkward thing to do. Um, and like how you actually train this model uh such that it never actually uses the full trajectory to train uh like to do this like reasoning training thing uh is kind of tricky. Like I I I think I mean it's an ongoing thing. Uh we're looking into it. Um I also would be happy if if you know Frontier Labs uh are interested in this as well. Um you know I I I don't care who who ends up having the the best model. I just want to see if it works. Um >> it could be you man. It could be you. >> Maybe maybe. Um but I I it's Yeah, that's I I guess that's that's kind of uh what what that means. >> Yeah. Have you thought about like evolutionary strategy here like to >> Yeah. So >> like but I say this because I I was talking to the egg roll um uh guy and then um uh like another researcher that is also working on this and like it is comparable to doing like GRPO in some of the benchmarks. um you just have to make sure that like you're doing it as optimized as possible like in um on the GPU space. But if you can pull it off then like it doesn't matter what's happening in the middle like it can literally spawn like 100 sub agent like recursively whatever it is you just wiggle the stuff right here and then you look at the output and you're like this is good this is we're we're we're going to make it more of this and less of the other stuff. >> Yeah. [laughter] Um so I think one question I had like do you think that the arling or like whatever post training the model for being an arm will have an impact on the number of sub agent being spawned and um just generally their understanding of the task difficulty. So basically like bringing bringing quen 3 closer to like GPT5 level of understanding of like not being silly. >> Yeah. Um, so I would recommend reading there's this paper called contextfolding something something I I think it's like a bite dance paper. Um, I actually don't remember if it's if it's Bike Dance or a different uh Chinese company, but um they like they they they do uh something a little bit different than what RLMs do, but it's a similar core idea of like we have multiple model calls and we want to train a model like with RL to be able to do this kind of thing. Um, and they do a lot of like really interesting tricks like to the the GRPL loss. Um, and like with the goal of like reducing the number of sub aent calls, like reducing the length of the the the the root language models, trajectory, stuff like this. Um, my uh my answer to this is it honestly depends a lot on like what your loss is. Uh, and I, um, this is just speculation, but I think that, uh, naively training with like how we've done it in the past, like with GRPO or or maybe some other like modified version is not going to work that well unless you have a lot of data. Um, I think we are inevitably going to need to bake in some things, at least in the beginning. I think in the future things will eventually just simplify out and and and maybe it will just return back to GRPO. I I don't know. But um I think for the time being, like you know, if we want to see some like initial cool results with RLMs, like we'll probably have to guide them in certain ways. Like for example, if we want to post train 3, we kind of have to add a little knob that says like, "Hey, don't do that many sub agent calls." Like I I think this is just what's going to happen. Um >> yeah, but yeah. Yeah. >> Yeah, that makes a lot of sense. And I also had this other thought, but I think it goes into the same direction of like if it knows that the task is easy or hard, it can decide what type of sub agent it will call and like just use less compute. So if it knows that like this is a dumb dumb task, it just doesn't want to do it, right? >> Well, we can just >> like spawn a llama tree and then it's done, >> right? But if it knows that he has to do like big thinking here, GPT 5.2, like go for it. We're going to wait 20 20 minutes. It doesn't matter because this is too complicated of a problem and then this thing can then orchestrate the rest. Um, but here we don't know. There's a funny question on the chat. [laughter] How does someone enjoy doing this all the time? >> Are you just sitting in your front of computer all day? [laughter] How is this fun? What's your take on this? >> Yeah. Um, well, so what I will say is, uh, I actually don't think I work that much. Um, so I would say now this is this might be surprising to some people. I I think the hardest I've ever worked is during my undergrad. I I genuinely think and I joke about this a lot with my with my friends. Um like like school can be hard. You you can really make it hard for yourself. Um and I I think like but for me personally like doing that was actually really helpful because um I actually spent most of my time in undergrad not doing uh deep learning or machine learning stuff. Um other than like research I I I did do some like research stuff but for the most part like the courses I took were just like math or or uh systems or like physics type stuff. Um, and like that honestly built up enough of a foundation for me to uh explore like I think really simple idea. Like I think the RLM idea for example is really simple. Um I I I don't think it's like something like some crazy uh like novel thing. Um I I do think it's it's quite clever. I I that's why I think it it it it's like u a little popular now. But um I think in general like honestly I am I am not of the opinion that people should work like you know 15 hours a day. I think that's that's kind of crazy. Um but I I do think just like you know if you enjoy it like naturally you will just spend time doing these things and you know um I I I also think I've been very lucky. So I I I will say that too. I I I think things work out differently for for different people. Uh but for me I I would credit my uh my successful streak of research ideas uh from when I started doing GPU mode stuff. Um which is really weird because I it's not related to most of the research I do. Um but I think that's that's when I I started like you know getting involved and and seeing like what kinds of uh problems ex exist out there. So yeah, I I I think a lot of problems in ML are still like um not lowhanging fruit, but more like there's a lot of clever ideas that haven't really been articulated very well. Uh and I I actually think the funny thing about this like just AI stuff in general is um I don't think like we need crazy ideas. Like I I actually think like a lot of ideas already exist and float around, but like the way that they become interesting is like when somebody formalizes it or articulates it in a way where people understand what's going on. I think star like quiet star and star which are basic. It's like Eric Zelikman's uh work that underpins all the reasoning model stuff um is a great example of this. Like I think the idea of bootstrapping reasoning traces is not like I'm sure many people thought of it at the same time or even earlier. Um but it's like his papers that made it clear to people that like this is actually a really good idea. Um and I think there's a lot of ideas like that that are still out there. Um some of them are rooted in you know more uh like theoretically minded individuals like people that like to think in math and um and some of them are just like super simple. Um, and so yeah, I I don't think there's any like secret recipe for these kinds of things. Like it really is just um like you spend time in the field and like I think these ideas just kind of um float around. So >> yeah also like um for those that are not aware of like research stuff it it's not necessarily like you just sit at the computer and look at the computer and like the idea will come from the computer like the the computer is just for doing or like getting information right like the idea you need to kind of get an intuition and then like start to read a bunch of stuff you can take stuff outside you can just print your your things and then just start to read it like out you chat with researcher in my view like chatting with the researchers is the best way to kind of get to the core of it right like you can read the paper yes it's all formal and stuff but like getting like just the background intuition idea also give you some sense of like where the stuff is maybe going so like actually like it's just a lot of chatting around [laughter] at some point you have to code something right >> yeah of course of I mean fundamentals are always even more important obviously like and I think uh honestly Maybe this is a hot take, but I think the fundamentals for the AI field are like quite shallow. So like you [clears throat] you know like if you wanted to get into pure math or physics research it's quite difficult like it takes years to like you know but AI there's so much like to do and I I think um you know verify it also like you can verify your okay you have this dumb ass ID just try it out man like let me know >> and then you'll see if it's it's worth u worth it or not if you need like massive amount of compute and it's going be super complicated like realistically won't happen at all right so you just have to go and gravitate toward like less compute insane idea or get an internship at like open AI >> I I actually think a lot of ideas don't require compute that's you know the funny thing is I think all of the boring ideas require compute like in the sense that like it's the easy way out like it's sort [laughter] of like oh of course you can train on this thing right but there's a there's a lot of like like RLMs for example does it does require compute, but the core part actually doesn't. Um, and I think like yeah, I it's there there's there's really a lot of things that are missing currently. Um, and these ideas can come from anywhere like genuinely like you don't need to be like uh super established or or things like that. >> Yeah. >> Well, there's a good follow-up question on this. I think that would be the last one. Um, how would one look for novelty when looking for research you want to publish? um somebody is uh studying their master thesis >> like um how how how do you get novel ideas? >> So I think novel ideas come from understanding what's going on in the field really well. Uh and that doesn't necessarily mean like reading a thousand papers like there are some people that do that. That's great. I mean, uh, I used to do that. I don't do that anymore. Uh, and part of the reason is I think, um, yeah, a lot of a lot of ideas get recycled. And I don't think that's a bad thing, by the way, either. Um, but I I think the way to think about this is like once you read into a field enough, you will get frustrated with certain things. Like some things will just be like this doesn't make sense. like why is it done this way? And honestly, the answer to that is usually like because maybe someone hasn't explored it thoroughly. And it's not because like oh it's it's this way because it's the best. Like it generally is not true. Um I think like with with what to pick, I mean, if you look at my my history of research, it's like all over the place. Like it's it's genuinely like a bunch of random stuff. I am not like a specialist in some like in post training, for example. um or like I am not a specialist in context engineering. Uh but I think it just tends to be that like there like as you just read into certain fields like you will naturally have questions. Um like you had lots of questions for me today and honestly a lot of those questions are research projects of their own. Like they they could very well be a thing to explore. Um, and I I think like oftent times like so this RLM project started basically where um I should give credit to my adviser like very great guy uh Omar he's he's the guy that did DSP pie um like he basically was just like hey like you know um what if we look at like these models that basically like tool call other models like I don't like I just I don't know what would happen like let's just see what happens. And like initially we started doing a bunch of stuff and it did not work. Like it was very silly. Uh it was like very dumb. And I I'm sure a lot of people have tried it too uh before we settled on like the final idea. But I think like this is just like it's things like that. It's just like oh like why hasn't this been done before? Uh and yeah uh there there are some sub fields where this is a lot harder to do. So like systems for example, it's a lot harder to pull this off because I think generally in systems like there is a it's not as much of a research question. It's more like someone needs to go do this and like you need to just learn how to do it. Um, flash tension is a great example of this. Um, but yeah, I I I think like there's lots of great ideas to be discovered that are >> No, that's exactly it honestly. Same same here. Like u I mean at some point you know the ideas you just have to like commit to one and like push it through. And it's it's really true like there are a lot of ideas train of thought that are have just taught um like four years ago because the only guy that worked on it graduated right now working at Mckenzie or whatever the heck right >> but it's not pushed like not all direction in the human like edge is being pushed at the same time. >> Um >> so what's next for this research direction and how can the community uh be involved here? >> Yes. Okay. So this is important. Um I would say uh the obvious next direction is training, right? Um and I don't think this is something that can be done that easily in the open. Uh unless like um like unless there is a community like things like Aluther and other stuff that like you know um have open uh compute and and stuff like this and like more centralized communities where they can train stuff. Um I think like uh a few companies premise like most notably is is working on this now. Um like just in general uh can we solve a lot of the problems we talked about today um through post- training and and can we get a model that is actually um can boost its own performance uh by post- training on the scaffold. Um very interesting problem. Uh I think it will likely will see some results maybe in the next six months maybe even earlier if some people are already working on this. Um, I think another big direction which is maybe the more open-source part that I've been thinking about is uh going back to this Jupyter notebook thing and and more broadly like what is the the actual like interface that we want to end up with. I and I say this because I think that to make progress on this problem like as a community there needs to be some standards that are set. Like I I just think if everybody works on this like concurrently with different ideas of what to do, we're it's just going to be a mess because um everything in ML is about being in distribution now, right? Like let's just be honest like for at least for language models, it's it's it's it's about being in distribution and and and and and trying to mold things in a way where the model likes to see like what you give it. Um and so I think thinking about how this is designed like for this like open source library that we have like um this is super important. Uh, number two is like this whole asynchrony thing. Like we want this to also be really fast. And so I can imagine like in the near future we might develop like another type of inference engine but specifically for RLMs um and like how it deals with uh like how it minimizes basically uh the longest depth of like chained language model calls. Um, and like how we designed these systems to be used uh on like your local server. Um, and like how we design the sandboxes, like what is this ripple going to be equipped with? Um, like what is it even going to be? Like is it going to be a Docker container or like a Docker image that runs on your your machine? Is it going to be like a sandbox like you use you hook it up to modal um or like your own kind of cluster and like you do something like this? I think these are all open questions that do not involve a lot of compute. um and that can be discussed and and solved like in in the open. Uh and so I think these two things are are are super important. Um number three, which I forgot about is evals, which going back all the way to the beginning, we need evals. Like and I I I don't even mean long context evals. Like of course those are important, but I genuinely think actually this is a great one. If you're looking for something to do, eval like you know I think Sweetbench I was fortunate enough to be there when it like when John and and Carlos and Ofer were were like developing this thing. Um but like genuinely it just comes from like hey like this is a naturally occurring problem. Can we get a model to solve this? I think we need more benchmarks like this where models just don't do that well. Um, and like these are like they try to reflect realistic tasks. Um, and they will get like hill climbed. That's of course, but I think like we need more diverse eval um because I think actually that's probably the single most important driver uh for model progress these days. >> Yeah. Yeah. 100%. >> Yeah. >> Um, no, I have nothing to add, man. Thank you very much and also for staying like so long afterward. >> Yeah. >> Um like uh folks go follow him on Twitter like all the links are in description. Read the paper. It's a really good one also. Um and thank you very much Alex for for coming. >> Of course. Thank you so much. >> Good. Perfect. See you man. Um uh the recording will be available folks on u YouTube so you'll be able to um uh take a look at it. Uh honestly I think like this is this I really like the idea. I really do believe that it's going to be um u they it has the same characteristic that we saw earlier on with like a um reasoning model. I think there's a lot of stuff to do with it. Like there's just like the the the the frontier here is kind of boundless. Um so uh if you want to get involved, this is a very good like um I said like um type of shape to be involved because you don't require training. You don't train these models, right? You set up the arness and then you tweak a bunch of stuff. Um, so like if you find these shape where like there's not a lot of like demanding compute and it's about like kind of understanding qualitatively what's going on and thinking creatively about like how to set things up. Um, it's it's a good place to start. The code is actually open source. Um, so he has uh I'll put it on GitHub. You can take a look at it, start to tinker about a bunch with it and you're going to have some uh some ideas that then you can share with the community. So, thank you very much everybody. It was super fun and I wish you all a fantastic rest of the week. Bye-bye.