Okay. Is this working?
It's complicated to go on stream.
Um,
okay. I think we're getting there.
Okay, this is good.
Okay, seems to works. Hello everyone.
Hey man, how are you my king? Chill out.
[laughter]
Chill out. Chill out.
How's everybody?
I think my mic is good. Um, your channel
has been so helpful as I embark on my
thesis and research on a man this is so
kind. Um,
you know, it's this is this is um I I'm
making this basically for for myself.
when I was um doing a PhD and it was
like desperately trying to like
understand what was going on. It was
hard to kind of ask questions and um I
don't know like find your way a bit. Um
so yeah, if you have any question man
like don't hesitate. Um yeah, for those
that don't know the format of this thing
is basically I'm going to there's a
paper right and I'm uh going to look at
it and dive deep into it. Um, and we
also have the first author of this paper
uh that will come on stream and you'll
be able to uh uh answer all my dumb
questions. [laughter] So, if you also
have dumb questions, don't hesitate to
uh to let us know. But also like if you
have just general question, the first
hour is just going to be me explaining
the RLM and after that at 11 I believe
um my guy will come in and we'll be able
to u uh to chat with him in uh on
everything uh related to Arlem.
I just need to make sure that like I'm
not messing up. Um I have a new setup
now so I'm able to like stream a bit
everywhere at the same time. Um, I still
don't have the hang of it. Uh, I think
it's getting there. It's getting there.
So, okay, this one is there and I have
the chat over here. So, this is good.
This one is working. Um, I think the
chat on Twitter is not working probably.
Uh,
wait a second.
There we go.
No, Twitter is working. Okay, perfect.
So, we're we're doing great here.
So, the paper we're going to review is
the RLM paper. Uh you might have seen it
uh float around a bit. Um it's a great
one. Like I really really liked how it
was written and I really like the just
the general flow of it. Right. Uh maybe
show you the paper over here.
Uh share screen.
Right. So, it's this one. Um, pretty
good paper and basically it's a it's
kind of a harness u paper where it shows
like a new way of of like a plugging an
LM into like a long context type of
harness so that you can actually have
like ingest more context without having
the context rot problem. Um, so overall
it's great. It's a lot of pages, but
like what what matter more is like the
first eight pages, right? So, it's not
like a a terribly intimidating uh paper
to read. Uh but what is cool is that
they added a lot of information about
like what did didn't go wrong, sorry,
what did it go right, like all of the
qualitative information about literally
what's up with um the RLM as it goes and
ingest uh the context in different
situations. Um, so no, it's a it's a
pretty uh pretty nice one. Uh, we're
going to be able to check this out uh in
its entirety and um like I said, uh
don't hesitate to ask any question about
uh about anything uh deep learning
related. Um I'm going to also check out
on uh over here.
Okay. Yeah, we're doing good on all
front. Perfect. So, let's start
and we're going to start with a nice uh
little overview of the whole thing
over here.
So, here it is. Uh so, Alexang will uh
uh hop on in a few um and we're going to
be able to ask him a bunch of question.
How about like I have like about 20
questions of varying varying usefulness.
Um, so we'll be able to to check these
out. If you have others, don't don't
hesitate
again. Just need to calibrate my stuff.
Okay, so this paper actually comes from
uh an idea that was uh like postulated
in October. Um so um he has been uh
actively working on it and refining uh
the idea and now we have a paper uh that
was written and um it's like it's
exactly like the blog post but like just
pushed uh a bit further. So if you
actually want to understand a bit more
also how to um um how ideas in uh
research are formed, take a look at the
blog post because it's much more um raw
in like the the intuition and like
what's what went into his head while he
was u thinking about like um long
context and what he followed in order to
like do the actual uh bigger experiment.
um and you see like uh what what's
scaled from here to here. So it's pretty
cool. Uh I suggest you to read both,
right? One for the intrusion, one for
like the the rigorousness of how it's uh
it's written, but it's very
approachable. Um and this is a long
context paper, right? So um it deal with
like having um model ingest much more um
context that they can theoretically
uh uh ingest normally or um they can
ingest it normally but there's there are
in the on the on the frame of the
context window where the context is kind
of deterating a bit right you you most
likely already encountered this when
you're like I don't know chatting with
chat GBD or whatever it is and then like
as as the discussion go longer and
longer it just it just get dumber and
dumber. So the intuition here is that
like what you do generally is like you
can summarize the information here or
like you take two chat uh windows and
try to kind of like merge them into like
a bigger bigger one and you try to
distill out some of the information.
It's kind of this idea but coupled with
um a fact about LM which is they're very
very good at um coding right uh so like
that's kind of the the core um intuition
here and um what they did in order to
like ingest even bigger context that the
window allow uh is very interesting.
they put the prompt that is absolutely
uh massive in some cases like 10 million
token um inside an environment uh like
variable right in a pl um setup right so
the the prompt that is gigantic is uh
there and then the model doesn't have it
just kind of dig into it again and again
um through like this type of workflow
right where it's a it's like uh asking
uh it's like digging into it with like
it's a print prompt like the first 100
lines or whatever it is and then you're
going to continue like this and then
going to slice it up a bit in a a bit in
a Jupiter kind of notebooky way right
where you have the data set and then you
kind of play with it to understand it.
It's kind of the same idea. Um and uh
what is interesting is in this case it
can actually uh kind of delegate some of
the tasks to another agent and the agent
will have a prompt and it will be able
to interact with the same thing right in
different way but he have subtask right
so uh here it's like in chapter one find
all it are listed to belong to whatever
right and then the lm will do it and
then it will return the response and
then we do this back and forth um it
doesn't do recursive like this right
with the depth two or three or four um
but do like at that point then come back
to the main main root node um and then
it uh it at at some point you need to
kind of give a final answer right and
this is all happening in a repl
environment so um
the information is getting stored into
variable that is getting created here
like a part one part two um pre-kata
post kata whatever it is right and it
will just like use these variable to
understand like Okay, this is what is in
this variable. So then I can use it. Uh
I can create for loop and create a whole
bunch of stuff. So it's really like free
form coding that is happening here. And
at the end it output a response and uh
that's it. So this big thing is the RLM.
It's the LLM with uh the environment um
that is attached to it. So um that's
just a bit and like if you if you look
at um at the performance, it's actually
pretty pretty uh interesting. Um if you
look here and um there's a tree here
that I shown the S
um needle in the haststack here. There's
the ulong and then another version of it
that is a bit harder. So this one is
like linear complexity. This one is like
quadratic complexity in term of like uh
length um uh of um of the of the prompt,
right? Um so it's much harder even
though it's like the same amount of
input.
Um and uh GPT5 here is good like a
needle in a stack because I don't know
it's not it's not enter like yes it's a
lot of context but it's not complicated.
You can just do one one scan through it
and you're done right these ones require
like doing a lot of computation but you
have to do it in one pass which is hard.
uh when you take the same model so not
post train or anything and you put into
the RLM RS um what you get is uh still
NA stack is good but it's even good
outside of the context window which is
super interesting like it's past the
theoretical limit of what the model can
ingest um which is um I'd say like
I mean if if you if you if you if you
seen a like um [clears throat] my stuff
for the past year, I've been talking a
lot about quality long context because I
I really think that like if you if you
can crack the quality long context
problem, there's a lot of stuff that
becomes immediately
um uh available, right, to to to the
models to be able to to do that are a
bit more complicated. And this is a
seems to be a way to do it for free
almost, right? There's a cost to do
this, but it's not crazy. So uh it be
like this whole region that was not
possible before um is now possible like
all of this right and on top of this if
you look at the curve they're just
getting straighten up uh throughout
right so uh the olong is getting better
and the the long pair which is like
quadrat quadratic in in like complexity
per input length um is also doing great
so at 1 million we're still good and
then I think they pushed it to 10
million um which is insane, right? And
and it's still it's still working fine.
And if you look at the average API cost
here for this figure, um it's not
terribly uh more costly. You see here
the RM GPD5 on on long pairs. So this
one is like kind of scaling uh a bit
like this because it's quadratic in
complexity. So it has to do a bit more
processing. Um but it's possible at this
this regime over here which is like just
impossible for the other ones. Um and
then like the the other ones for the RLM
on along is still doing uh super fine
while um these ones the GPD alone are
struggling a bit uh sorry the GP alone
are cheaper right but for a lesser
performance here. So it is uh more
costly in uh in general um to to use
that stuff. Things get a bit murkier uh
over there in a smaller um context
because like technically you don't even
need to do sub agent call or anything
like that. the LM can just like answer
straight up. Um, but this is just to
show that like yes, it's more costly to
do this, but also compared to the other
method that we're going to see, um, it's
not terribly, uh, costly. So, it seems
to be a paradigm that makes a lot of
sense. Um, so we have, uh, here, from
what I understood from the paper,
basically attach more LMS on the top of
the main one and use them as storage
kind of, but um, it's a bit more
complicated than that. Like, yeah, the
main storage is really this, right? It's
this environment that you have running
on the main node, right? But then you
can spawn agent to like just do
something, right? It's not like I mean
you can call it agent, but it's like
literally like just another RLM, right?
And this RLM could do stuff, right? Like
you could it has a ERPL and you can like
do some analysis and and work on there.
Um so that's kind of what's what's going
on. and you can spawn them at multiple
point and then aggregate their
information um in various degrees. So
that's kind of what you get uh on this
front. Um and uh it's kind of like some
sometimes like you could in in some
cases there they could not spawn the sub
agent and in some instances it's
actually better to not spawn the sub
agent. Um but in other cases if you have
to do some semantic search type of
stuff, right? or like semantic
understanding. Um you're much better to
delegate it to a LM and the LM might
just like take the information and just
ingest it and then just do something
with it. Um yeah, so that's kind of
that's kind of the the uh the element.
And if you look here the uh you see here
it split the prompt on chapter two part
one part two and it feed this um the
context of part one in it. Right? You
see that? So like it is like have have a
have a tighter slice of what it's um
it's it's going to look at. So it's more
like subprocessing of the information
and storage. Um but then after that you
kind of take all the the information and
you and you put it together. Um yeah so
this is a general idea right? Uh hey my
man it's learning. Hey man, how are you?
You doing good? Uh we have a lot of uh
uh old faces here.
It's a fun um it's a fun paper. I really
like this one.
Um
cool effect. We have I muffin that is
saying that he's working on this as he
speaks. Coolman let us know what's up.
Uh there's also the folks from Primate
Collect that um um did a whole bunch of
uh experiment on it. So if you if you
want to take a look at their paper also
uh it's also a good place to start. I
think they did a whole bunch of ablation
and they looked at the other other
models um the GLM and and stuff like
that. Okay. So, if you look at the long
benchmark uh that they're using, right?
Um for those that like are maybe not
familiar with long uh benchmark, there's
not terribly a lot of them, right? And
um it's not just about like having
massive uh task. It's also about like
having tasks that are a bit more
complicated, right? If you have that, if
you have like a bit more complicated
task um then like um what's it like you
will actually be able to understand a
bit better um how the models are
handling it because if you look at it
here like all of these task have the
same amount of input context, right?
This one is much easier. It's kind of
like constant, right? It's it's not like
it's never harder
uh that much. And this is why you see
like LM4 lama 4 scal being able to do it
to 10 million like because it's a easy
task but as soon as you have like bit
more complicated um um task it's a bit
more difficult and you have even more
difficult than them in the in the
benchmark used. [snorts]
So uh wait we have a question here how
that solve the context issue demand LM
eventually will run out. It's a bit more
different than that because like if you
see here
the prompt is not loaded in context,
right? It's not so you can jam a 100
million token there.
It's not it's not in its context, right?
The what is is in context is a system
prompt and then the tooling to do the
EPL on it, right? On the on the stuff
and then it will take slice of it and
then it will it will spawn the sub
agent. So the the RPL is sorry the this
the system prop is actually telling the
model to like do that stuff right. Um so
like maybe if we push it to 100 million
it will run out because it will it will
get to the point where like it's even
the slices that it's taking and the
steps are too much but it works at up to
10 million right now. Um so like it's
taking slices of the information. it
knows what the question is, right? And
it's being very careful to not run out
of context. And then it's like spawning
these guys which always have fresh
context in this this case. So you can
always like just spawn more of them and
then do that work. Um and then this is
getting aggrad gated. So like you're
getting the final answer by like digging
programmatically through um through the
stuff.
Um so okay that's good. And then the
benchmark that are used are there's what
this one the S ne ah so needle NA stack
everybody um kind of know about it a bit
but you have a document over here that
is a bit bigger super mega long text you
have the needles right and then usually
like you have just like Paul Graham
essays but in this case like you're
going to like shuffle them and then like
jam them into whatever right what it is
Paul Graham or like whatever is a long
context stuff Um and then you're going
to ask a question, right? And then the
question the model will get it and a
document and it will be have to go parse
through it and find the right stuff. Um
and then this is the answer, right? So
that's kind of kind of the idea uh of
this one. So this one is the easy um
kind of kind of easy uh long cotex
element. Um Alex Elzang MIT guy. Yes, he
is m MIT sale um [clears throat] uh guy.
Yeah, PHG over there. So the main model
just orchestrate the communication
between multiple agent with fresh
memory. Yes, it's kind of doing this but
not just this. It can also not spawn the
sub agent and it will still be good if
you just have the RPL uh environment. It
will still be um doing uh pretty pretty
good actually. So we have this one, we
have the browse comp plus, right? Um so
BRS comp is a benchmark for browsing
agent and then what they're doing is
that um they're actually augmenting it
right so they're augmenting the brump uh
like a corpus here with um um kind of
how say like a um information that is
gold right that is being retrieved by O3
over there right and that is human
verified and then uh information that is
kind of like noise right which is
information that makes sense in some way
but that is not the information that you
need. Right? So you have these R
negative and then you have this this
gold kind of information all piled up
into like the BRs um uh comp uh
benchmark and then this is what they're
using in this specific case and you can
go up to like
um uh here they're using 1K document uh
the benchmark provide a verified offline
corpor of 100k document that is
guaranteed content gold evidence and our
negative document for each task and then
you have like a whole bunch of of tasks
but we are using 150 random only sample
task as our evalation set and then we
provide 100 randomly chosen document to
the model. So 100 doc is given and then
there's a bunch of tasks that need to to
go there and once you give the 100
document they guarantee that there's a
gold evidence in there to so that they
can actually do the stuff. So this is a
bit more difficult than this one right.
So there's some gradation of difficulty
here. Then we have oolong, right? And
ulong is there's a whole bunch of stuff
document and then you have to piece the
information together on all the
document. It's not like you cannot it's
not just going to find it in one doc.
It's going to be like you have to piece
information um together in multiple
multiple stuff here. Same thing with
let's say like this transcript of like
12 hour of dialogue in this whatever
game, right? Um so that's Olong, right?
And then they decided to kind of make it
even harder with um uh so they split uh
they manually modify the three course uh
split of ulong to include 20 20 new
queries that specifically require
aggregating pairs of chunk to construct
the final answer. So a task will be like
this in the above data list all pair of
user ID where both user have at least
one incent with a numeric value or
allocation. Each one of the question can
be labeled as one of the label
description absorpt concept entity blah
blah blah. So like you have to kind of
take pairs of information and then like
like find them into this um this um kind
of mesh of information and then you can
uh output something useful. Um so a bit
harder. This is supposed to be like
quadratic in in difficulty. And the last
one is the long bench right V2. So the
long bench is like um there's a whole
bunch of document that are long that are
getting there. You're we're data
annotating them. Uh we're going to
review the their their stuff. There's
some revision that is uh being made and
then there's some manual review at the
end. Right? So it's like massive mega
big document and um you have single
document QA here. You have multiple
document QA. What is interesting for us
is this one is the code uh repository
understanding. So it's QA 50 questions
um on uh like literally code repository,
right? And then you have to kind of dig
into this code in order to understand
and answer a bunch of questions. Um so
it's this slice that we're taking over
there.
It's actually like it it's a bit weird,
but it's actually um uh how it like a
bit difficult to make good
um long context benchmark. Uh even on in
the while like let's say you have the
multi-turn type thing, right? you have
like there's a big corpus over there I
need to dig into but surprisingly
there's not that much um at least
there's more now but like there was not
that much like benchmark that already
were structured in a way that you can
test and train the models on
okay and also like it's maybe why some
of the model are not that great um on
all context because they never seen they
never were trained with a long uh long
enough context um at all.
Okay, that's good. Hey Arman, hope
you're doing great. So the me the me
method is like this, right? Like the
this is this is the stuff there's the rl
over there the uh environment uh hold
the prompt and then like it's loaded as
a variable and like you can play with
it. Um and that's kind of that like so
like the context is loaded in a variable
and then the model has access to it
right that's kind of the the the
situation that we have here it can spawn
these agent and the prompt look like
this you are tasked with answer a query
with associate context you can access
transform analyze this context
interactively in a environment that can
recursively query sub element lm like in
the ablation you they remove this which
are strongly encouraged use as much as
possible will be required iteratively
until you provide the final answer,
right? And then there's a whole
bunch of stuff, right? They're using the
same um roughly the same kind of system
prompt for both the
uh the um the GPD5 and the Quentry
coder. Uh but then the the quantry they
have to say this [laughter]
like they have to say this this little
thing here which is like uh be very
careful about using LMQ query as it
incur high runtime cost always batch as
much information as reasonably possible
into each call. Um
because the thing like the quentry was
always spawning agent right for every
line or something like that. So they had
to stop it and and just restrain it a
bit for doing this. GBD5 doesn't do this
out of the box, right? Like it will just
like use it properly and stuff. Um but
quentry is just it just doesn't care at
all. Uh [laughter]
[gasps]
which uh uh I don't know which I found
it kind of funny. We have Noah that
says, "I didn't know LinkedIn had live
stream." I didn't know either. It just
works. [laughter]
[clears throat] So don't hesitate to ask
ask question whatever platform you're
at. Um so this is a system prompt. You
can actually take a look. It's in the
paper. Um and this is the only thing
that is guiding the LM, right? So like
technically the RLM is this, right? This
is the only thing. There's no RL that is
being done on this um whatsoever. It's
literally literally just um um just a
good system prompt with the RPL
environment. That's kind of it, right?
Um, so this is good. We can then take a
look a bit uh more uh over here. There's
a bunch of patterns that are emerging um
on the on uh the the models, right?
They're not being told explicitly like
what to do, but they end up doing a
whole bunch of stuff, right? So the type
of thing that they're doing here is that
they can probe and then interact with
the probe and then like with some with
like reax or like semantic kind of sub
agent call. Um they can defer some of
the reasoning um over large context by
creating recursive LM calls, right? So
they're going to do a like a function
here and then they're going to like
recursively ask a bunch of question uh
everything in in code also here, right?
And then create the prompt and then
shove it there. Um and then here can
stitch recursive LM output to uh to form
longer uh composite output. Uh which is
pretty pretty interesting. Um I really
like um the way it's the I said it's set
up. It's very natural. Uh it feels a bit
like it's it discovering what it can do
and what it cannot do. So a bunch of
stuff that we uh pattern that they see
is that filtering input using code
execution based on model power. So here
ability to filter input context without
explicit explicitly seeing it. So they
don't know what's in the prompt, but
they kind of filter whatever they need
to filter um based on like their prior
information and prior ideas about like
what should be uh in there. Um so like
um that's it. It doesn't need to load a
lot in order to filter and get what it
needs, right? um stuff like just uh
doing reics on like a bunch of
information that should technically be
there, right? Or asking you see here
what they did like uh find all it that
are listed that belong to belong to
people. It's doing a lot of inferring
what should be in in this stuff and then
actually going over there and then uh
doing it or like having another model
doing it. Um there is doing they're
doing a lot of chunking like like we saw
over there like they're uh like
splitting this into chapters and then
asking it to to go and fetch information
in the chapter. Um they are doing a lot
of verification [laughter]
this part is the part that I find funny
right like
um I don't know like sometime they just
panic and then they verify verify
verify. Uh there is a failure case that
is super funny with I think creree they
just verify forever and then they they
do [laughter]
they got the right answer and then they
they they just discard everything that
they did and they they they they say
rank the wrong answer. Um yeah so like
um it can verify its information through
sub LM calls um and it can pass
recursive LM output to variable for long
output task. So like it kind of is a way
to kind of safeguard its um context
because it's going to like if you see
here right like uh over there right this
output might be gigantic you don't know
maybe it's like I don't know 20 20k but
it actually is just this for it
literally is one variable right um and
it's using these variable and stitching
them together it doesn't have to see it,
you can just trust it like it's good or
you can maybe verify a bit, whatever it
is. Um, but that's that's very
interesting. Or you can actually just
like look at the length of this and
decide if it's a good idea to like open
it up and load into it context. Um, so
the the variable encaps encapsulation is
a very interesting um element of how it
can safeguard its context. So it's
always kind of operating in like a
better u section of the of the context.
um than like a just raw
raw LM calls.
uh the baseline that are used we have
the RLM with the ERPL and we have the
RLM with no sub call right so like in
the no sub call uh uh section all of
this is just gone right you just don't
do that and you just do it there and
already it's it's it's it is better in
some cases um we have also like um the
benchmark it against a summary agent
which is another common methodology that
you can have which is basically like
literally like you you have
summarization happening there's
different ways of doing it right But it
always is like whatever happened in the
past is now summarized in like a smaller
format and then like you take that and
then you add the other piece of context
and then you keep on going and then once
you like once it's too big you compat
you compress it and then you keep on
doing it a bit like what's happening
with like claw and stuff right um kind
of cop action type of mythology the
issue with this is that if the
information is very sensitive right in
terms of like its actual format it can
there can it can kind of get lost
symmetrically after too many compaction.
Um so it's like lossy uh long context in
this specific case. So they're doing a
mix of this right and what they're
actually doing is this over here. Um in
it fashion the agent is giving input
until its context is full at which point
it's created to summarize all relevant
information and continue. Right? So
that's what's happening and they're
using like a smaller agent for
summarization. Um there
last one that they're uh benchmarking is
um code act which is kind of the like
the react type of framework where um I I
always forget like what uh what react
stand for like you see think action
environment take action and then
response right that's that's sort of
like kind of a um uh back and forth
right but in this case it's like you
think and then the action this code,
right? And then you shove this into the
environment and then you have a
response, right? So, or you can do that
multiple time. So, it's kind of similar
to the what whatever is happening with
the with the RLM um but with within the
kind of the React uh framework for
agent. So, the element is using code in
as the action instead of like doing a
bunch of tool calls uh and uh just doing
like JSON in just out.
Um so that's that the main result is
what we saw here right it's working so
this is good it's it's actually is
working so it's fantastic to uh uh to
see the pricing is not too crazy stuff
uh it depends also like it's highly
variable right so it depends on like
some calls are massively more expensive
than others um in like the 50 to 75th
percentile it's pretty similar all
across uh but then the prices start to
explode with GPD5 here on the 95 95th
percentile. You see here the Kodak and
the summary agent they are getting
massive uh u amount of um multiplier u
well while the problem is is like a kind
of double this one but it's still not uh
it's not not crazy high right it's much
cheaper than this these ones um and it
has less biases about like what they
actually can can be doing quen same
thing um but quen seems to be uh more
um pricey, right? And remember, Gwen is
the one that it has tendency to just go
ahead and not care too much about the um
compute cost. Um if we look at more
details here about the exact number. So
we have these benchmark like code QA,
browse comp plus oolong and oolong pair,
right? And like the task length uh
roughly, right? You see here br it can
go up to like 11 million is kind of kind
of wild here. Um but if we look
uh the base model is like is not able to
do this at all right this part is just
impossible for it like it's too big of
context to uh to ingest. So you need to
have an RNS to do something. So if you
put like the Kodak RS or the summary RS
right it can get better. So you have 12%
here and 30% 8% here it's doing better
in the code QA generally. Um, and these
ones, the olong pair, it's like
quadratic, so it's a bit bit more
difficult. Uh, it's doing not too bad in
the in the olong, right? But the uh RLM,
just the RM, right? Uh, sorry, the RM
with no sub call with quantry um
actually doing better uh over here on
these two benchmark. Um, and then um
it's still doing uh really fine here,
right? Just with no sub call at all. So
it's just one LLM in a harness with the
re a bit similar to Kodak than anything
else when you give it the sub agent. Um
funny enough this uh this one codec kier
the the performance drops a bit. Um but
here the performance increase for this
one and this one and this one is fairly
similar over there. Right? So allong and
long pair are more like semantically
kind of relevant. This is more exact
type of stuff. Right? kind of makes
sense, right? That like no sub call is
better. So, um this kind of means that
some model have some more trouble
deciding when to use what, right? For
which uh type of cases. Um anyway, and
then GPD5 just generally much more
strong um than uh Quinn except with the
no sub call here, but generally like the
performance is uh better all across and
with the sub call. So um it seems to be
better at like deciding when to use it
stuff. Uh so we have a bunch of
observation here. I don't if you have a
question by the way like I'm just going
to go and uh talk about the whole thing.
Wait a second. Yeah good. Um yeah the
first observation here you can scale the
RLM to 10 million token regime and it
outperform base LM. Um so like 10
million is crazy right? Uh but it's just
like is able to do it kind of out of the
box uh without even like sub call just
like just by having the RPL environment.
Um for long input RPA is necessary while
the recursive sub calling provides
strong benefit on information dense
input. So if it's like coding a bit less
uh relevant you see it even here with
GPD5 it's it's not that relevant because
it's not like that semantically uh
complic complicated bit more sparse
um here lm performance degrade as a
function of input length and problem
complexity right you can see it here uh
clearly right like the degradation here
right so it's it's not just one axis of
length it's an axis of length and then
like task complexity and And this will
tell you how much degradation you should
see. So it's really task specific. Um
yeah, so that's a something that you
have to keep in mind if you're like
working on like long context type stuff.
If you can make your task easier, it's
always uh better. Um yeah, just always
generally better. The inference cost of
RM remain comparable to a base model
call but are high variance due to
different in trajectory length. So we
saw the high variance over here, right?
that stuff. But you look, you have to
look here. I didn't catch it at first,
but like this is the cost, right? And
this is the variance. So you see the
variance here is kind of high for no sub
call. Uh the variance here is also kind
of high. Um yeah, so you just have to to
check these out, but it's generally not
like crazy um crazy costly.
Um that's kind of it. And I think they
did an experiment where they scale
number of document, right? Um so here
you have 100 you have 10,000 uh document
in context and uh in this experiment in
the appendix you see like the look what
happened in the degradation right
because remember the comp plus the
models are not able to do it like these
ones the base model are just not able to
do that one uh but like if you just
increase the number of document you can
see that degradation let's say of GP5 is
just nose dive straight up right with 10
document it's good with uh 50 barely and
100 is just like almost inexistent,
right? While the um uh the RLM is kind
of doing okay all across and even better
when it has all the information over
here for some reason. I'm not sure what
um and without the crazy cost. Here you
see the React is kind of uh much higher
at,000. Uh but RM is still relatively um
uh cheap. So are able to scale well
without performance degradation and the
inference cost is scaling not too bad.
Um,
uh, yeah, that's kind of that. So,
that's it. Like, that's the RLM. Any any
question, folks, uh, on what we just
saw? Uh, we see there's a comment here.
I believe it's that plus modifying
prompt plus RPL. The RPL is doing a lot
of heavy work, right? Because it's is
stored inside uh, this sandbox
environment, right? And it's getting
processed and manipulated. Um so uh
that's what kind of the the magic of it.
Um and then like there's also the sub
agent call that is uh that are being
fired up um over and over again
especially in quen 3 uh [laughter]
we're going to see we're going to see
what's going on. And so there's a bunch
of trajectory that they put in the paper
um for which are pretty cool, right? You
have the this one with um GPT 5 which is
like a happy path, right? You have a
thousand document over there and then
like you can literally see what the
model is doing in code because this is
how it interact with like the
environment. Um and so it's searching
for a specific keyword and it's like
looking at very specific steps. You see
here the window and it's trying to find
snippet that makes sense. And this is I
don't know [laughter]
the the
um
uh the keyword that it's looking for
right so it created the keywords and
it's looking for them and running the
red jet query you have like the response
right and I think this one is making
sublm call to find the answer which is
uh like these these this is like what
it's asking um from uh one of a dm right
so the root node is asking this for a
sub agent which is extract the following
from this article. What festival in town
is this about and what year? What is
this specific celebration held? Blah
blah. And then it will respond all of
this. And then based on this, it will
check the information, right? This is
what he's uh gonna gonna have. Um and
then it will be able to figure out that
like yeah, this is the response. It's
Maria Dalmasio. And then this is what we
will output at the end, right? Print
winner first last and then first. This
is what he has.
So it's cool. Like none of this is kind
of hardcoded or toolled. They they just
go and do Python.
um over here. Yeah. Uh what's this in
platform where these screenshots were
collected? Look cool. Um this in
platform is they actually mention it in
the paper. They just vibe coded it. Um
like literally they literally voded it
and then they they're using it. Um so
yeah uh maybe the is in the the GitHub
repository. You can check it out. All
the code is open source for uh the RLM.
Um,
so I haven't read the paper, but how do
well does this work with other modality,
particularly audio and visual? That's a
great question. We're going to ask him
that. Well, Alex is there. Uh, so Alex,
you can answer this one if you want to,
or you can wait like 8 minutes before
you come and stream, but this is mainly
like text based. Um, yeah. Um, is this
the one the funny one? This is the funny
one. We're going to take a look at this
one. Right. This last one, right? Um, so
this one cost a dollar [laughter]
and this is the question and it's on the
olong pair. So it's kind of difficult.
It's like you have to like double check
and verify a whole bunch of stuff. Um,
and then uh it says that the uh the
model will begin by probing the context
with various code snippets, right? It's
probing it stuff. Um, and then like it
decide to check semantically like and
classify the data using a sub agent
call. Um and uh then it goes right and
it processed this in batch and you see
it's like doing this in um in uh what's
it like it created this this uh function
which is doing the lm query and it's
like continuously like calling this
function right and then like the
recursive lm calls will do their stuff
um and then the root lm will look right
if the uh whatever instance satisfied
the query in this case and then has the
pair and it has like the the information
right and then it will look at it
and what it will it will do here um it
will continuously verify its answer this
is quen 3 cooler right um I think in
this case it [laughter]
so it will repeat this process over
there right and attempt again to
regenerate the answer and we'll do that
uh five times like over and over again
and I return the same answer and when it
has to be like actually like um give a
final answer because it's starting to
run out of context um yeah it's just
going to be the root lm generating an
answer out of nowhere and then just like
like spitting it out which is going to
be the wrong answer [laughter] so uh I
don't know like this this part I found
it absolutely
um I don't know there's it tells you a
lot right like the quen 3 coder will do
that the GP5 five seems to be doing this
a bit less like this kind of neuroche
uh doublech checking but just the fact
that it doesn't trust this stuff that
it's being produced by like itself
technically like the these these sub
agent um and then just verifying again
and again it tells you a lot about like
the benchmark the difficulty of it uh
but also like what what the LMS are
actually doing when they get to um um to
difficult uh stuff like that right um so
I think there's there's maybe stuff that
can happens in order to make the model
and and steer the model in like being
more sure about what it is uh looking
at. Uh but it really looked like I don't
know like a student which has like a
thousand document open and just
double-checking and double checking and
just like not trusting itself. Um yeah,
this one was funny. There's uh other
ones. Um but uh you can read the paper.
It's all there right and it give you a
sense of what's going on. And I really
like that in the paper where like the
quantitative information is fine, right?
It's all good, but I really like to see
the exact qualitative vibe of what how
these things are running. Um it can, by
the way, you can also try to run it uh
on your own, right? Um it's just
literally a harness that you put around
it. Um and the last part that I really
liked is this one is negative result.
Honestly, I think like all papers should
have that. Like just tell like you you
found something cool, perfect. just tell
us what didn't work so that we don't try
to do it again. Um so using the exact
same RM system prop across all model
could be problematic. So like this is
the quentry thing they had to make this
change otherwise like quentry was just
like spinning like crazy. Um
[clears throat] and uh like this part is
interesting. So model without sufficient
coding capability struggle as RLM. So
the correlary
the opposite is also u seems to be true
like model that are good at coding seems
to be good at long context because they
can manipulate the pl
um environment efficiently
um yeah thinking model without
sufficient output tokens struggle as rm
that's also an interesting one so they
tested a whole bunch you can check out
like this other um blog post uh from
prime recursive language model the
paradigm of 2026 6 by Sebastian Mame. Um
he tried a whole bunch of other LMS also
so you can have a good kind of overview
what's going on. Um RLM without
asynchronous LM calls are slow right
like in this case like the u the calls
are blocking so like you're doing step
uh stuff step by step maybe like doing
it asynchronously will will will help
but then you get into the weird setup of
like what happened? How do you continue
if like stuff is not done yet with
another sub agent which is an
interesting problem I think um and
depending on the model distinguish
between the file and thought is brittle
for RLM which is also super interesting
and maybe like it tells us a lot about
like this failure cases right because it
couldn't recognize maybe that it has the
final answer for real this time so like
it should just like go and commit um and
it still think that it's maybe a thought
and you just need to kind of get the
final answer again and again. Um
I don't know it's really grad student
coded uh this whole interaction which is
uh a bit fun. Um there's a bunch of
limitation here that you can read in
order like what the stuff hold but we're
going to actually uh ask Alex a whole
bunch of question uh right about now.
Um, so yeah, uh, we're we can have you
hop on whenever you want. Um,
and then we can, uh, start this, uh,
this up.
Okay. So, I'm going to try to I think um
think you're all set up to come over
here.
And I'm going to pull out my um
20 plus questions. Hello.
&gt;&gt; Hey. How are you?
&gt;&gt; Good. Good. How are you?
&gt;&gt; Good. Good. Not too bad. Uh well, thank
you for uh coming and [laughter]
uh answering um answering question for
for all of us. I think uh it's really
it's really cool. I really like the
paper. Um,
&gt;&gt; thank you.
&gt;&gt; Real witch also like a a remind me of
the TRM actually paper like a very I
don't know like a very not simple in the
sense that like it's nothing impressive
but simple in the sense that it was easy
to follow.
&gt;&gt; Um,
&gt;&gt; so we like that. To start off, can you
tell us a bit about yourself and like u
just your your background and like
research interest uh for everybody here?
&gt;&gt; 100%. Yeah. Um and yeah I I also I will
preface by saying um hopefully the the
paper was was easy to follow. I think um
actually some of the earlier iterations
of the paper we had uh I I had wanted or
I I I wanted to write in some like
theory related motivation for like why
there was actually a whole thing about
like why um certain problems are harder
than others. uh which I will actually
talk about today but I I scrapped in the
paper mainly because I think uh we
didn't have enough like
uh strong evidence to support that like
what I was claiming was actually true
but I think intuitively you can see a
lot of it um but anyways uh so I um uh
for context I am a first year PhD
student at MIT um I think it is very
funny when when people say I'm an MIT
researcher um I've only been there for
like three months Uh but I um yeah, I I
graduated from my undergrad uh at
Princeton uh back in uh 2024.
Um I originally wanted to do math,
funnily enough. Um and I guess like it's
what the the school is known for. Um but
I I did a bunch of like really random
stuff. Uh so like I I I I did some like
exploration with a friend into like
blockchain schemes and then all this
random stuff. Uh I I did a bunch of like
more core RL work um earlier on uh in my
undergrad and then um eventually near
the end uh I started working uh with the
Swebench team uh on like SUB bench
related stuff uh mainly SUB bench
multimodal uh which came out last year.
Uh and then I got really involved in GPU
mode which is also really random. Uh so
I I you know I I I was really interested
in GPU programming. Um I currently help
host all the competitions. uh which is
not at all related to RLMs. Um but the
point I'm trying to make is that like uh
I think a lot of uh my current research
interests uh and ideas are um kind of
motivated by uh lots of random things
that I've uh done in the past. Um and I
think the field moves super fast. So
there isn't really like one thing to be
like fixated or or or focused on. Um but
yeah, currently I think um RLMs are my
my main research focus. Um and it's not
necessarily long context problems which
I will also kind of elaborate on as we
go. Um but I I think that like in
general um for this year there is a lot
of really really interesting research uh
that we can do for language models uh
that isn't necessarily like um systems
work or like infra work um but yeah
&gt;&gt; cool yeah it's really cool and I see
that you've done a whole bunch of
benchmarking
uh yes
&gt;&gt; that's uh that's very Can you tell us a
bit about like what motivated uh uh that
work also.
&gt;&gt; So I I will say I think um like I'm not
going to lie and say like you know I
love making benchmarks. I I don't think
making benchmarks is fun. I you know
everyone wants to do the flashy thing,
right? Like you know you want to train a
100b model and then do all this cool
stuff. Um and and so for context too the
the benchmarks uh that is referring to
uh is well sweet bench multimodal is the
first one with the sweet bench team. Uh
and then kernel bench uh which is like
um LLM generated uh GPU kernels and
evaluating those. Uh and then the most
recent one is this thing called video
game bench uh which is like trying to
[snorts] evaluate uh vision language
models that can play an assortment of
games. Uh, so these are like games from
the '9s like Doom, uh, Mario, Kirby, uh,
like these kinds of things. Um, but I I
think that this uh you actually talked
about it earlier, but like this RLM work
uh I think kind of highlights too that
like we don't have a lot of good
benchmarks. And I I think when like the
reason I ended up working on benchmarks
every single time was because I wanted
to work on a particular problem, but
there was just no eval for it.
&gt;&gt; That's it.
&gt;&gt; Um and I think like this actually is a
is a really big problem in in the field
in general. I think uh I I have strong
opinions on like what eval should look
like. Um, I think that like we kind of
have a problem of like eval currently
are not, you know, a great indicator of
how good a model actually is even on the
task that you're evaling. So like for
example, you know, there's a lot of
evals for math and coding. Um, but
they're not even great evals for like
whether or not a model is good at math
or coding. Um, so this is something
that, you know, like I I think, you
know, it needs to change and um I
I don't think a lot of people like
working on evals. Like I, you know, I'm
going be honest, I don't think it's fun,
but I I think it's really important
work. Um, and
uh, unfortunately, you know, I might
have to come up with evals for, you
know, RLMs and and and future works like
this, but we'll see. Yeah, I I think
generally long context because long
context is worse because like
you really have to think hard, right?
And then like there's a lot of document
then like what you going to do? You
going to verify each one? Like no man,
so you have to kind of make it and then
do it properly. I mean the actually like
the people that have uh a lot of long
context stuff that are relevant for LM
are actually like the closed lab which
they have no why would they publish
their internal data right so that's
that's kind of a um a bit of the problem
and I realized that when I was reading
the minimax paper um text 01 uh text 01
&gt;&gt; um
&gt;&gt; because um they were talking about like
this um uh uh like this translation long
context task where there's this dead
language only 100 people talk talk it
and then like there's like this like
this book that you give in context and
then you check if they can do the
translation right in paper makes a lot
of sense but then like what happened is
like at some point even the models the
newer newer model are getting good at
the language even without the book
because they're seeing the book at some
point. So like then it's even harder to
come up with like um good long uh long
stuff. So you don't really know if the
model is actually able to process the
the the longer input. So when you
actually give it to your stuff, it just
suck. And then you're like why does it
suck? It doesn't have the stuff in
context. It never seen that big of a
input. And then that's that's kind of
the end of the story. There's nothing
you can do with it. Which is actually
what interested me a lot with this paper
is like um instead of trying to like
bring in context all this information
right and try to come up with like
massive amount of training data for long
context well they can have a small
context it's it's all fine but they get
like the tool that they need in order to
kind of mine uh like literally they mine
u the information a bit like a I don't
know a PhD or like a graduate student
like they they don't have everything in
context they have the data sitting
there. They have their Jupyter notebook
and then they just go right and then
they they learn about and at the end of
a session the notebook
&gt;&gt; they get some insight right out of it.
Maybe it's wrong but they get some
insight out of it and [laughter] then
they can
&gt;&gt; move. Um yeah also funny story I so in
the paper uh there's only four
benchmarks uh and even for like the code
QA I think you mentioned it it's part of
a larger benchmark uh long bench v2
which I think arguably is is is maybe
the current most difficult long context
benchmark um but the reason so we
actually evaluated on a lot of
benchmarks like a lot more than is shown
in the paper uh and the problem with a
lot of these benchmarks that we ended up
finding was
like either a um the RLM can just solve
it basically uh but like we don't report
this because um the way that it solves
it is by using the code environment
which is like not like it doesn't need
to use the sub LM calls uh and like it's
kind of a silly thing. So like one of
one of the examples is like another
benchmark or another task in that
benchmark is like computing a long math
sequence like a long arithmetic
sequence. Um, but like if you plug this
into Python, like of course you can just
do it. Like there's no, it's not a hard
problem. Um, but like the, you know,
like you you'll get like GPT5 can't do
it. Like you can't get everything
correct. Um, and so like the the the
scores you report are like, oh, the RLM
gets 100% but GP5 gets 0%. Uh, but it's
like, you know, that's not really an
interesting result cuz like, well, of
course, right? Like
&gt;&gt; um, and the other problem is like, oh,
the base model can just do it like it
can actually just solve the task. Um,
and like the one of the issues we ran
into that was like really silly, um, was
we we had examples where like the task
would be about like a book or something
uh, and you like take the book away.
Like I I tested this. You just remove
the book uh, and the model can still
solve the task. So like it doesn't need
to read the book because it knows what's
what's already in there. Um, so this is
like it's really really hard to find
good eval. Um, and I I think like, you
know, for anyone that's interested in in
doing research or something right now,
uh, this is an open problem. Like it
it's it's genuinely like a lowhanging
fruit of like you just need to find a
good eval um that's like realistic and
like um that that people, you know, you
want to see models actually solve. Yeah,
I think that that part is important like
that you actually want the model to be
able to solve because toy problems are
cool, but I mean um if you want to
bridge like when you're going to
actually bring the model to your to your
whatever that you're building being
already knowing that you can solve like
like a related task is is already
something. Um okay, I think like we we
we got a bit through the the uh the high
level overview. Um can you talk talk to
me a bit like about the intuition?
&gt;&gt; Yeah.
&gt;&gt; That led to like this because you could
have went if you look at the related
work you could have went all sort of
different ways but you decided on RPL
environment um what what triggered this.
&gt;&gt; Yeah. So this is actually a really
important question. I I think um
there there's
like I I I don't want to say that like
you know um I'm a genius like this is a
completely new idea like no one's ever
thought of this. um which I I have seen
online like you know there there there
are there's a lot of like oh like
doesn't cloud code do this uh doesn't
like you know XYZ method do this or like
aren't you know uh doesn't open AI
already do this in in codeex um and I
think to some extent uh yes like I I
think um so the way that this idea kind
of came about um was um basically like
nowadays I think models are really
really Good. Um, and and I I I want to
preface by saying this because I think
this is a very timely idea. So if you
tried to do this like maybe last year,
um, so like we've tried this with
Deepseek R1 for example, um, it's
actually not very good at at doing this
whole thing. Uh, and like I think a lot
of the fundamental model architecture
slashtraining research uh, was really
really important to lead up to this
point. Um so I think the the best way to
think about this is like um methods like
claude code and codeex um that do this
kind of like uh very smart codebase
management thing where instead of
feeding the codebase to a model uh they
use like special tools. Um this I I
actually would say this started with
like su agent uh and open hands last
year. Um but like this idea of not
feeding in all of the information
directly to the model uh and and using
specialized tools to like navigate this
information. Um this idea was like very
exclusive to code or like even like
software engineering tasks. Um and I
think like the intuition came from like
oh like you know a a programmer is not
going to read the whole codebase at
once. Um and I think like the the core
intuition behind this idea is like well
you can actually do this for any task.
Um and actually uh like claude code and
and all these scaffolds like they are
highly uh like specific to code like you
can use them for other tasks but the
models themselves are like post-trained
specifically to solve coding tasks. Um
but I think like the nuance here is like
you know yes rms like the the idea is
really good. it can do long context
things and this is really important like
you can slot it in like you can slot in
an existing model um and show that it
works like I think for the purposes of
the paper this is like this was the most
important experiment we wanted to show
um but I think actually like more the
more subtle thing here uh and like why
I'm excited beyond this initial paper is
like the implication is that you can
actually start post-training models uh
to do this kind of paradigm Um, and this
is a lot cheaper than trying to extend
the context window of a model. Um, or
like build a larger model. Um, and I
think like we are starting to enter like
an era where these models are actually
really good. Like these
transformer-based neural networks are
are really really powerful. Um, but
they're almost is like very little need
like I want to be careful with my words
here because like you know we should
continue to improve models. Um but it's
like exponentially more expensive to you
know even like double the context window
of a model. Um and like I think the
point that I'm trying to make here is
like you know they are already so
powerful for you know transforming text
of a certain context window size to uh
text of a certain context window size.
Um but you can actually like chain these
things together uh and and you can
produce a significantly more interesting
system by doing this uh without you know
incurring really really expensive uh
scaling costs. So in my mind like this
is another axis of scale uh that is like
very interesting. Um now we don't talk
about this a lot in the paper because as
a paper we can't claim things that we
can't prove. Yeah.
&gt;&gt; Um but I think like this is like a
really important point. I I think this
is actually why for example like prime
in select is really interested in this
approach. Um it's not necessarily just
for the long context part. I think there
is like a piece of it that's like really
really important which is you know maybe
all future models are actually um
interacting with your context in this
way. So
&gt;&gt; yeah, like for me what what what sparked
my attention here is that it has a
similar um uh shape in term of like um
uh like just the the the thinking and
then chain of thought um kind of trick,
right? Because it's simple enough. It
works all across the board, but then you
can actually RL it so that like it's
even better, right? And we're not we
haven't done this yet, but I think like
it will have as it has the same
characteristic
um as as this setup. Um and the other
thing that I think is yes coding is is
economically super important I think for
like just the whole field because if it
start to work even better on bigger
codebase with just like a simple
framework and everybody can do it then
there's going to be more tools that will
be able to work on bigger code bases and
then like you get into enterprise and
like the money is there right this I
understand but like
&gt;&gt; in my view it unlocked the whole like
scientific
uh research agent type stuff that is
able to go for longer uh like amount of
time
&gt;&gt; and is digging into this because
fundamentally this is a grad student
[laughter]
like like that's the feeling I had is
like hey I have this big task and it's
complicated right it's super long
there's a lot of this stuff that I need
to to take a look at but that's so I'm
going to like parse through it a bit
figure out like hey this is relevant
this is less
and I'm going to give it a task for next
week and my task for next week is like
try to figure this chunk and then you go
and you figure this chunk and then you
get it out right and then you work on
and then but this flow and then
literally it's in almost notebook right
Jupyter notebook um
like this flow leads to like some amount
of facts or like information being found
that then can be used to answer a bigger
question right
&gt;&gt; um this I really liked I I found that
like it has these u this uh these setup.
The other stuff that I really like and I
want to like check with you like was
this on purpose or not is that it's very
minimalistic like there's no massive RS
of any like very bell whistle. It's like
a system prompt and then a RPL. Uh was
this like done on purpose or did you try
a whole bunch of stuff before getting to
that?
&gt;&gt; Yeah, so this was intentional but we
also tried a lot of stuff before that.
Um, so I think like what we ended up
settling on was um
like I I I don't love calling this a
scaffold because it is it is like I I I
want to be clear like it it totally is a
scaffold. Um but I think uh scaffold
makes it sound like like this is a type
of like a new type of agent that we're
building that we want you to use. Um I I
think more fundamentally like what this
is is like a very particular way to do
model inference. Um and like keeping it
as minimalistic as possible is is very
important for this because you like you
know you can't afford to train a model
just to be used as an agent unless
you're like anthropic and your goal is
to sell like cloud code. Um, I think if
we're more interested in like general
model capabilities, like we want
something like as, you know, thin as
possible on top of the model. Uh, and I
think that's, yeah, that was like kind
of the guiding principle for for how we
designed this.
&gt;&gt; That makes a a whole bunch of sense. Um,
wait, there's a question here.
&gt;&gt; Yeah. U, just wondering, are there a
problem where the context expand instead
of contracting at the meta which you
might want to all the LM connects can't
fit in? Um
I uh yeah I think it's talking about
like
&gt;&gt; fact generation and stuff like that like
do you have a
&gt;&gt; I think I understand the question uh
like so there was a similar question
earlier about like what if the context
window of one of the models fills up
like one of the intermediate models. Um
yeah I mean this is a this is definitely
a concern. I think like in the long term
like the hope here is that or one of the
the core ideas of an RLM is that no
single model call should ever exceed a
certain length. Like this is kind of
like the this is the hope right now.
Like how you actually guarantee this is
not super easy. I think what we found in
our experiments is that like if we just
let this thing run as is, it actually
just never fills up. Like it never even
gets close to filling up. Uh but you can
imagine if you go to a like a harder and
harder task maybe it does fill up. Um I
think what like ultimately what this
should look like like in in its like
full form is like a recursive language
model will be spawning another recursive
language model. Um and like in this
sense like the actual intermediate like
model calls will never exceed a certain
length. Um yeah, I I you also could
implement like tricks like compaction
and stuff that like cloud code currently
do. Um I don't love this because I think
it kind of takes away from the core idea
which is that like
&gt;&gt; this entire process should be a quote
unquote like no information loss
process. And and what I mean by that is
like at no point like the reason why we
store everything in the ripple is um the
model should technically like in theory
have access to all information like in
its purest form um not in a compressed
way um and you you kind of want to
maintain that like throughout the
trajectory of of the model as well. Um
but yeah I I I think like the sorry the
short answer is like we haven't run into
this issue in in like the current
experiments we've run but this is
totally like a plausible issue. Um and I
think like it does get solved by like
deeper and deeper recursion. Um because
like the idea is you keep splitting up
the context. Uh but I think like we
don't have a a strong like robust uh
guarantee here. [clears throat] But like
maybe that's a a future work kind of
thing as well. And like um in my view
like compaction is kind of interesting
in the sense that like if like you're
working toward the specific goal and
then the compaction kind of makes sense,
right? But like um I I come back to this
grad student type of workflow. Imagine
if like the grad student was cleaning
the data right in a specific way and
then throwing the raw data, right? like
you will be kicked out of the lab ASAP,
right? You you do this and then you you
messed up I don't know the filtering or
whatever it is. Well, congratulation,
right? Like what what the heck are we
going to do now, right?
&gt;&gt; Um and I think this is um because like I
always come back to the scientific
discovery type shape because I think
this has a high potential in this this
realm because like you have the thing in
the raw form and then you're digging
through it. Right. Right.
&gt;&gt; Um and I think um also what Mataki is
hinting toward is that like there
there's there's these experiment now
where like for scientific discovery
where the model is running for like days
and whatever it is right at some point
like it's creating a lot of context
because like let's say the ERPL is just
like it's just uh like literally 100
meter long in term of a of a back and
forth. um at some theoretical point it
should fill up the whole context even if
the output should be like one answer a
name whatever it is.
&gt;&gt; Yeah.
&gt;&gt; Um but I think like um uh uh having the
raw data having the thing digging
through it um is already like a big
piece of the puzzle of like being able
to generate like insight and facts um
out of it. I have a literally a dumb uh
proposition here which [laughter]
like amounts to like literally jamming a
MySQL database in there and then storing
a bunch of fact. Um yeah, I mean you
already have RPM there like why not
adding a MySQL instance and storing a
bunch of stuff. Um but we're going to
get to this before we dive into that.
Um, I just want to get your raw thought
here about like um long context in the
model, right? Um, I've done a video on
this and I've been digging through
through this. It's just like literally a
hard problem because you always get into
like this kind of a trade-off, right?
Um, and to make it good, let's say like
let's say you linearize the attention or
whatever the heck, you do block stuff,
you do lightning attention, you do all
sort of weird like wacky thing and then
you optimize GPUs in order to make it
work so that it's actually faster,
right? Like do you think there's there's
something there to juice up still or
like um it's just the wrong way of
thinking about like the the long context
problem?
&gt;&gt; Yeah. So um I think it is the wrong way
to think about it and I'll explain why.
So I think um scaling like the context
window of a language model has two main
challenges. Uh the first one is the
systems challenge, the systems level
challenge, right? Like oh attention I
mean attention is actually not even uh
the issue usually. Um but like let's say
like you know attention is quadratic um
maybe you can use linear attention uh
sliding window uh stuff like this um
like and and you know maybe you need
more GPUs to train a larger model uh etc
etc like you know maybe you're 10xing
the cost of your training run. Um this
is definitely a challenge and I I I
think like my take is that if this was
the only challenge um we would be able
to continue improving the models uh
significantly like we would be able to
extend the context window um beyond like
much further beyond what we currently
have. Uh this I I
um I believe uh and I I I think if you
ask anyone in like the systems community
I think they would likely agree. Um I
think maybe not maybe maybe there's some
like you know really strange reason why
you can't do this but uh in its current
form like you know just scaling compute
and scaling model size is is not like it
is just purely a cost issue like I I
don't think we've we've hit that wall
yet. Um, I think actually the more
subtle issue uh is is the data and and
this is this is a core reason why I
think RLMs are are so cool because um I
think that we we often take for granted
that the way that we've trained language
models um is effectively using like the
internet or or using naturally occurring
language um and kind of learning this
distribution. Um but this like naturally
occurring language distribution uh is
not like un like unbounded in length. Um
like I think the the the the sequences
that we observe uh like in the wild uh
tend to be like distributed according to
some some like mean length and and and
variance, right? Um and I I think like
we have gotten away with the fact that
like language models continue to improve
because uh we have these like naturally
occurring sequences uh that capture the
distribution that we want. Um I think
the way that we have done longer and
longer context things uh is we have to
generate synthetic sequences like
synthetically long sequences and train
on these sequences. Uh the problem
though is that like it it's not fully
convincing to me that like doing this
will get you will like net you any
longer term benefit. And I think the
greatest example of this is uh the like
practical failure of reasoning. So like
reasoning models are really good, right?
Like don't get me wrong, they're they're
amazing. It's it was like a great
breakthrough of of the past year. But
there's like a lot of papers that have
come out recently that basically show or
like you know experimentally that um
reasoning is this really silly thing
because the actual content of the
reasoning trace is like almost
irrelevant to the final answer. Uh and
like part of the reason that this is
happening is like um at that scale like
at this like sequence like I think uh
the the way to think about this is like
as you get longer and longer sequences
you need exponentially more data to like
fit a proper distribution right like
this there's like an entropy argument
here [snorts] um and I think what's
going on is like with these reasoning
like these long reasoning chains what
we've kind of observed is like well the
the the the good part of a long
reasoning chain is that it conditions
your model well to get like the right
output like you can think of it as like
a way to pick out the correct outputs
from like what you actually want and
reasoning we we've kind of seen as like
a way to do this but what ends up
happening is like you know with the quen
experiment we saw this with the RLVR
stuff like oh you can actually just post
train on random stuff and like you can
still get like some good answer this is
a really odd thing um but the beauty of
like the RLM part is like we can
actually keep the language model input
and output distribution uh within a
length that is actually naturally
occurring. This is like a like it's a
it's a weird thing to to like wrap your
head around. Um but I think like this is
the the whole context raw thing we've
been observing where like you know you
you make the sequence like like really
really long and all of a sudden the
model performance like just tanks and
and it's like why? Um, and and part of
the reason is like I think we shouldn't,
you know, it it's it's I I don't love
thinking of models like in an
anthropomorphized way. So like like a
human obviously would not make these
kinds of mistakes, but I I also think
that like uh a human learns in a very
different way. Uh and and so like we
should think of these models like yes,
they're very impressive. they generalize
well all these things but at the end of
the day like we we do have to think in a
very mathematically principled way of
like how were these trained and like
what are they doing uh and I think if
you think this way like it's very
obvious that long like just the
transformer model and and training it on
long like huge uh context sequences uh
is a is a really like difficult thing to
do um and I think like with RLMs the
idea is like we actually don't have to
do this we we can do long context things
without training them in a long context
way. Uh and and there's a lot of
benefits I think for doing this. But
yeah, that's that's my my on that.
&gt;&gt; No, I agree. I agree here. And um my my
my other um view of of long context like
if you want to kind of go and play
around in the internal of attention or
like how everything is set up. The issue
also is that theoretically it's better.
It's faster.
&gt;&gt; Yes. But practically when you look at
the practice like flash attention v
whatever is much better in all metric
because it's optimized for the GPUs. So
then you get like these theoretically
fantastic
um like advancement that are absolutely
dinky and worthless in in in practice
and then nobody is going to like do the
work of making the GPU stuff that you
need in order to use them efficiently.
So then it's like why did we do even do
this stuff and I think that if if there
is a method that can sight step this
like for sure we're not going to go
forward into the into the launch context
like the slice of what they see is
enough and um also reasoning for me is
just like
&gt;&gt; um I I see it in two ways right uh the
first way I seeing it is like u it just
helped the model like uh filter out like
I don't know like wrong kind of answers
and then just like bias it toward where
it should the answer should be because
technically you could just inference
scale first shot it and then it will
output maybe in this mess the right
answer and if you were able to pluck it
out you would you would be able to do it
you could do it like either way but also
if you look at it from a like uh an
activation like perspective right you
get kind of these trace of like
activation that are happening in the
model because the model doesn't have a
state so you have this activation And
then this activation will be with
another input will be able to give you
this other activation and then at at
some point it get the right shape in
order to do the right activation for the
um for the stuff. Um so like reasoning
is nice but the fact that it's outside
is a bit kind of uh weird and like as it
goes it just is like using like the the
useful context inside. Um
&gt;&gt; yeah.
&gt;&gt; Okay cool. This is good. We have a have
a lot of ground to cover. If you have a
hard
&gt;&gt; I'm I'm happy to stay on for longer by
the way. So
&gt;&gt; jeez guys, we're this is gonna be a
marathon. We're going to be there for
the whole day.
&gt;&gt; Um wait, there's a a bunch of question.
Um there's somebody says you mentioned
cloud code and other agent already do
intelligent context retrieval management
to some extent. The exciting part is
perhap more on the post training. Um so
this is a question about like uh like is
the interesting part of the RLM the RL
that we can do on it? Um, yeah. What's
your take on that?
&gt;&gt; Yeah. So, I I I I want to be careful
here because I think like um well, I'm
getting used to this too as a new PhD
student of of what I can claim and what
I cannot claim. Um I I think the
interesting part of the RLM paper uh is
still probably the main result which is
that like hey like you know first of all
like uh one of the things that we wanted
to solidify in this paper which I wanted
to do in math but I think maybe the best
way to do it is just in plain English.
Um long context tasks are not equal. I I
think this is one thing like I don't
know why this wasn't made clear in the
past but like obviously needle in the
haststack is a very easy thing to do. Um
but like if you were given a really
dense like u long context it's really
hard to process this thing. Um and the
main contribution is like even without
training like even with no training um
you can do this really really simple
like task agnostic method uh and you can
take current models and you can scale
their performance uh on on really really
long sequences uh and you and it can
actually process these really dense and
also really sparse uh inputs very well.
I think like this is the uh you know we
have the results and I think this is
like as on its own like this is already
a really cool result like I think if
there was no RL like no future thinking
thing uh this is a cool paper like I I'm
I'm very happy to publish something like
this um I think the RL part though or
like the the training part is more like
why do I think this paper is more
interesting like beyond this like for
this year right so like um like why is
my research still focused on RLMs even
after I've come out with this paper and
and probably we'll we'll try to publish
it somewhere. And I think that's where
like all this other speculation and and
all these things come. Um and like I
think a lot of it is grounded in
intuition. Like I think a lot of people
are also seeing that this is um a really
interesting bet to make uh similar to
like chain of thought and and some of
the the other things that have worked in
the past. Um but yeah.
&gt;&gt; Yeah. And like this paper also remind me
of like um I think it's the meta paper
where they jammed
&gt;&gt; like a coding environment into the world
model of the thing. It got better
&gt;&gt; and stuff. Uh just want to show this. I
I knew that the this course of um I
still don't want to dunk on anybody that
is working on at meta, right? But like
this
um like at this point when I saw Lama
for scout like this whole block of of
needle and a hay stack to 10 million
token I knew this this was absolutely
worthless like it's not it's not the
thing like it's not just like a needle
and a stack problem. It's much more
complicated than that and like I think
you put it well and I would have light
like that like this theory was pushed a
bit even further um about like um uh
this kind of angle between the long the
the size of the context and the
difficulty of the task. I think this is
something that in generally in the
discourse is not well um well explained
and I think like the other part is also
like
um I don't know like a um the average uh
capacity like the average um useful
window size of the model like if we have
these three kind of axis then it's a bit
easier to say like roughly speaking for
this specific task how hard it will be
for this specific model to be able to to
interact with it. Um, okay, cool. I had
a question about um, we can maybe dive a
bit more into the RLM structure because
I think there's a whole bunch of
question about it. Um, so my first
question is that you choose our RPL for
this, right? Uh, which I think makes a
lot of sense, but you did ablation on
the sub agent, right? So you remove the
sub agent. Do you think you can do the
ablation on the other way around where
there is no RAPL and it's just like a
whole bunch of sub aent that are just
working on the on the context without
having it fully loaded. I think this is
this part I think was uh maybe missing
in the sense that like you're not you
don't have to load the full thing and
then like then they just go and do their
stuff. It's just like it's still an
environment variable somewhere but
they're not writing code. They're just
like working on it. Are you did you
think about this or this is is useless
when you look at the other
&gt;&gt; good question? No. Um so this is
something that we that we missed and
actually we're running it right now. So
I I'm there's two new things I'm adding
to the paper u which I'm not really
going to make a big announcement about.
I I think it's uh more for um also we're
we're like submitting it um to a
conference. But one of them is that it's
exactly that um we we we need a baseline
that is effectively like um can you take
react or or Kodak or something and give
it sub agents uh but you you don't you
don't do this like offloading into a
ripple kind of thing. Um, and the the
point uh the point of showing this is
like actually for RLMs like I I think
another thing I want to be clear on with
with the RLM thing is um you know the
idea of taking a model and giving it
access to subm models uh is not new.
We're not the first people to do it. Um
there's a few other works that have
tried this. Obviously Claude Code
intrinsically does this. I I think um
there's an argument to be made that like
I I think the sub agent way of doing it
will get phased out. Um what I mean by
this is like the way they do sub agents
is like you define the sub agent and
then like claude code will be smart
about using it. I think ultimately like
in the long run this will just be
completely removed and cla will decide
what sub aents it wants to use. Um, but
even ignoring that, I think the the
thing I want to be clear about is like
the there's two key parts of the RLM.
One of them obviously is the recursion
part. Um, but the second one is like how
do you actually do the recursion? This
is like a nonobvious thing and I think
the ripple is one way to do it. Like I I
think the and and there have been some
other proposed ways I've seen online uh
like using a file system uh and and bash
commands also great. Like I think um the
reason we chose the ripple um is like
was mentioned earlier like these models
are pretty good at coding um and I think
like um maybe claude or maybe opus 4.5
can also do uh file system management
really well um and that's great and we
we should definitely try it like I think
um that's one of the things we want to
implement in the open source library if
people want to uh like use it but um the
ripple I think is like the most
intuitive way like you know I think code
Python is really easy to period. Uh, and
it's like really easy to like say
something in English and and and you
know, write it out in Python. Um, and so
yeah, this baseline is very important. I
I think we're currently running it. Um,
it's the results are probably likely
what you'd expect. Um, but I I would say
the main thing is like this setup cannot
handle long context um, for obvious
reasons, right? Like it still has to
ingest the full prompt. Um, but yeah.
Yeah.
&gt;&gt; Okay, that makes a lot of sense. Uh we
have a bunch of question about sub
agent. Well my first one is like does
does the mo like two ways does the model
know how many sub aent it is spawning
and does the sub aent knows that it's a
sub agent or just running on a task.
&gt;&gt; Yeah so in the current setup no we
actually provide as little information
as possible. Um the reason being like we
want it to be if it can work without
this information like that's great
because uh people can experiment with
this if maybe like tune this if if they
think it'll work better. Um the model I
mean the model implicitly knows how many
sub agents it's spawning because the
code it generates like should tell it
what like how many it's spawning. Um,
but I think the Quen 3, you know,
experiments clearly show like, well, it
maybe doesn't have the greatest grasp of
of how like how many, especially if it
writes a for loop, right? Like it writes
a for loop over
&gt;&gt; Yeah, I think this is it. And also like
I think like it might get confused
because it also is writing a for loop
and encapsulating the LM query inside a
function call. Now you have like
multiple layer of ab obstruction about
what the heck you're doing. Poor dude is
confused out of his mind,
&gt;&gt; right? Yeah. So I I think like a lot of
these things can get baked in. Um a lot
of these things can also get post
trained out to be honest. Um and as for
if the sub agent knows that it's a sub
agent, uh I actually think it shouldn't
know. And and the reason like we we did
it this way is like the a big part of
like the thesis here is you can use an
RLM on a single model. Like so yes, you
can use an RLM to spawn like like you
can use GPT5 to spawn Gemini uh 3.
That's fine. Like that's that's great.
And I think actually that's likely what
it will look like for for maybe the next
few months. But ultimately, I think what
we really want is a single model that
acts as both a regular model, like it
should still be a regular model, but it
should also be able to be used as an
RLM. Um, and when it spawns itself as a
sub agent, it should treat it like it's
a regular call. It it it should have no
it shouldn't need the prior that it's a
sub agent. Like it's just being asked a
question and it has to answer it. So I
think like this is a going to be a key
thing like moving forward like how do
you train an RLM such that it still
maintains its performance as a regular
model but it also has the ability to be
an RLM. These are this is like an
interesting thing
&gt;&gt; like I'm saying this because like um for
the quent tree like in order for it to
work what happened is that you had to
literally tell it in the system prompt
that like
&gt;&gt; my guy just watch out for the compute
cost because like this is too much right
but I think like um it knowing like how
deep it is right now in in sub agent
call uh gave it this information without
having to tweak the system prompt right
so you get like one clean system prompt
that just worked everywhere. And then
like you're just like giving this this
model um information about like because
implicitly like the number of agent that
is spawning is uh is correlated with the
input cost. It's like 50,000 like 50
agent deep.
&gt;&gt; Yeah.
&gt;&gt; It should it should know at this point
that like it's messing up and it also it
was I was reading the permit blog and uh
they gave it hint also on the
difficulty. I think this is also another
part that um is super important because
they seems to be kind of poor at
assessing
&gt;&gt; roughly speaking what is the difficulty
of this task right um and like how much
they should maybe use in terms of
compute in order to solve these task uh
so like that's that was what prompted
this thought the other thought is that
if you we want to like do kind of like
recursive and recursive sub agent calls
right um in my view is that there like
the the axis that you were talking about
about like how long the input is and
like how hard the the thing is should
maybe be something that the model knows
about, right? Like u look how much I
spent right now and um look at what I'm
giving to you. You're like I don't know
agent number 52 and you're three layer
deep right now but you have a easy task,
right? This task is supposed to be easy.
So in this specific scenario, the
chances that like this this sub agent
will like think about spawning another
one's like, no man, I can just solve it.
This is supposed to be easy for my breed
of elements, right? Um I think like like
the the part that is hard here is to
like be able to implicitly know how hard
the task is. Um and uh yeah but I I
think I think if you if you're able to
give it it should kind of direct the LM
normally like you see GPT5 has seems to
have a better reasoning about compete
cost and just like how much thing are
are hard or not right QM3s have no
absolutely no idea
&gt;&gt; uh about about that stuff
&gt;&gt; yeah this is a great point yeah I think
uh honestly uh this is something that
should be experimented But like I I I I
think what you're saying like could very
well be actually the way that that it's
done in the end. So
&gt;&gt; and um uh we have a um information here.
A question that like fit with the one
that I want to ask is that um in the
paper you said the system prompt is
fixed across all experiment. So the sub
agent doesn't even know the system
prompt, right? Like it doesn't know that
it's like in RLM type of stuff. Okay,
cool. So that's
&gt;&gt; yeah because we're doing depth equals
one the sub aents are just models
they're based models
&gt;&gt; uh and
fix one specific task and that's it
&gt;&gt; um uh and it doesn't have access to its
ancestor tree doesn't know anything
about like uh about that stuff okay so
this
&gt;&gt; in this case I I think it's because the
tree is not interesting I mean it's it's
just a root and then a bunch of leaves
uh I I think if you start to think about
more like uh higher recursion depths. Um
yes, then I think actually maybe we we
should start thinking about like telling
it where on the tree it is like what
like maybe give it a little bit more
context about its parent node.
&gt;&gt; I I think that's because I go back to
the lab analogy, right? which is like
uh I don't know like a if somebody is
getting handed like this project like
hey can you do this and you the grad
student is like okay and then he's take
it look at it to say and give it to
another one like hey dude can you do
this and then just give it like the
actual thing as is
&gt;&gt; um without knowing that it's been the
sixth guy that has been and then this
stuff um that's one thing the other
thing is that if it know it's easy right
it is like arguably easy. The chances
that it will go and actually do it are a
bit higher. I I've pulled out a bunch of
um uh neuro inspired research here on
that like it do change the behavior of
human when you know the difficulty of
the test if you know it ahead of time
right the chances that you're going to
do great on the exam is is really high.
If you know it straight when you have to
do it and you had no time to for
preparation whatever it is um like then
it's different and depends on if you're
anxious type or not anxious in this case
I think quentry is anxious and is like
pretty chill
&gt;&gt; um but like it has an like
on like a organic human type of
intelligence an impact and I also found
some paper that show that like it also
has an impact on LM's uh ability they
just not that great to assess test um
like the the complexity of the test.
&gt;&gt; This is this is interesting. Yeah. Um I
the last thing I'll say is like I think
there this is for [clears throat] the
more theoretically inclined people um
this is a really interesting problem
actually of like local and global um
observations. So if if if like um well I
don't know how related this is to like
uh uh palm dps like in RL but in general
like what we're dealing with here is
like this system where not every model
or like not every actor in this system
has all information about what's going
on which is important right because I I
think the thesis here is that it can't.
Um but I like there is sort of um
maybe some things to be said about how
much information should you give each of
the models at every layer like there
there there is likely a way to like
characterize this very well but anyways
not that important.
&gt;&gt; Yeah.
&gt;&gt; Yeah. But no I think it's actually super
duper important especially like if
you're if we're thinking about
asynchronous versus synchronous. Um
&gt;&gt; yes
&gt;&gt; like in the in the asynchronous case I
think it doesn't matter too much because
you just go and then like and it just it
just do its stuff and that's it. But in
the synchronous case I think there is a
chunk that is missing which is the where
we store all of the context or the facts
or like whatever it is like that we're
like directly mining right now. Um and
then this is being used to kind of
double check facts or like um in some
shape or form kind of align like the
rest of the of the model behavior. Um
okay. Anyway, so um uh uh okay I had I
had a question about um yeah this the
the hardness question. I think we we
touched already upon it. Um like uh
what's your like raw intuition here
about like why quen 3 coder is like
making so many calls like a sub agent
like I think I've said a whole bunch of
stuff but like roughly speaking because
it's still big it's like a 400
&gt;&gt; something uh B model. Uh what's your
take here? I think the short answer is
honestly like
uh
I I mean I don't have a a fully like
principled uh um way to answer this but
I I think like we have seen that some of
these models like Quen 3 especially um
is like a heavily benchmark maxing
post-train model. Uh, and I I think like
as much as you know we like to make fun
of Open AI and all these companies like
I I I think Chad GPT or like GT5 and and
Claude and these models or like Gemini
tend to be I I think pretty good at like
even newer tasks that it hasn't really
seen before. Um, they tend to make just
more principled decisions. Um, I think
Quen 3 coder is just a case of like it
just isn't like it's not explicitly
trained to do this kind of thing and so
it makes very poor decisions. Um, yeah,
that I mean that that's my speculation.
I I I don't know. It it also could have
like do with the task that that it has
been trained on in the past. Um, and
like maybe it's used to just spanning. I
I don't know. But
&gt;&gt; Right. Right.
&gt;&gt; Because like I think this is pretty
important because if they're kind of
like fried up with RL um and we need to
RL some more, right? like there's may
like this may need to be happening um a
bit earlier on in the in the post
training of the of the model if like we
need to RL
&gt;&gt; the model on the RLM
&gt;&gt; um also just rough intrusion here like
what do you think all of the model are
repeatedly verifying um their
[clears throat] information because like
this is something else like okay
spelling sub agent is one thing but then
you have the answers but You're
verifying it again and again, right?
&gt;&gt; Yeah.
&gt;&gt; Like why this is you think?
&gt;&gt; Yeah. So I think
based on my experience using even like
coding IDs, I think a lot of models or I
I think trajectories are a very
unnatural form of text. Like if you
think about like like a trajectory is
like a concatenated sequence of like
input outputs from a model uh which
doesn't these haven't really occurred
until recently like this wasn't like a
natural thing that you would really find
um on the internet for example. Um, and
like one of the things that is like kind
of frustrating, like I I don't know, I
still don't like fully know a way around
this other than maybe post training is
when these models like so if the model
uh the one thing I've observed is like
if the model comes up with the answer
really quickly and the trajectory is
really small um the model tends to just
finish it right there. Like it tends to
just be like I'm done. like there's
there's nothing here. And I I think
again uh I I don't want to like
anthropomorphize uh this argument
because I I I don't think I don't think
the the argument I'm I'm making is not
that as the sequence gets longer, the
model is more like uncertain or
something. I think what it really boils
down to is like when the sequence is
really long, um the models make
suboptimal decisions. Like they just
don't like they're just not very good in
this setting. I think we've seen this in
the past with like the the jokes we've
made with with cursor like when you have
a really long uh history it just starts
to make really odd decisions and I think
this is like a similar thing here where
like um basically what is happening is
uh for whatever reason like the high
probability actions it should take is
just to like retry what it just did and
verify that it's correct and it gets
stuck in this loop. I I think Quen 3 is
is the biggest offender of this. Um, but
this is like a known issue with Quen 3,
right? Like it it tends to repeat
things. Um, but it actually does happen
with GPD5 as well. Uh, and I I think
like this actually goes back to the the
issue before about like training one
context things. Like I I think even at
this smaller context window, like maybe
like 100K or like 50K, it still makes
these really silly decisions. Um, so
yeah, like I I um I I would say it's
probably a training issue. Um
&gt;&gt; Yeah.
like in in my mind also it might have to
do with like the fact that these are
stateless
&gt;&gt; and like you said like it's not like
it's it's not moving them enough out of
their this the distribution of like I'm
going to have to retry this again right
but for us that have states what we're
seeing is that you dumb like piece of
trash it's been four time already like
it's it's enough like it most likely is
good but um it's not enough it's No,
this is still uncertain, but like the
fact that you're trying it for time
should make you more certain that like
this thing is most likely okay. Um,
which bring uh it back to my idea of
like this kind of fact type database,
right? Like you generated this fact and
and two two other sub agent generated
the same exact fact. You did a different
trajectory, but the fact is the same.
&gt;&gt; Mhm. like theoretically speaking you
should take this into consideration
right and then like like putting in
there in some shape or form um I wanted
to ask about like the RPL flow because
um from what I understood it it's not
Jupiter like it's literally or just
straight up RPL like where in order to
to output some text you need to print it
&gt;&gt; am I am I correct here
&gt;&gt; yes exactly
&gt;&gt; so have you talked about like leaving it
like room um I don't know to like write
markdown or something like that. Um
because I come back to the the grad
student thing like if I had no room in
in in my flow in order to write my
thought about what I just saw um it kind
of is a bit limiting like yes I'm going
to like just engineer and do stuff but
at some point like I'm not going to
write print statement that that has my
thought in it. I would much rather just
switch to markdown and just start to
draw it out like what's your general
interest here?
&gt;&gt; So um the
okay I I I think like the interface that
the model interacts with is super
important. So like if it if it's a
ripple or if it's a a notebook with
markdown and it can also like plot stuff
like this is all really important. Um I
think though the the caveat here is that
technically with the ripple environment
it is able to represent almost anything
that the Jupyter notebook can. So like
for example if you want to store
markdown you could store it in a
variable. It's silly like it's it's not
a natural thing to do but it can do
this. And I think when when developing
the paper we decided like we want it to
be as simple as possible. Ripple is like
the simplest possible thing. um let's
just stick with that. But I think in the
long term like you know let's say people
who want to use this in production or or
like want to squeeze out performance um
yeah it's a good idea like I I think um
storing stuff uh in like a Jupyter style
thing. Uh there's another advantage of
using a Jupyter notebook. Um, and also
the reason we didn't use Jupyter
just like really easy to set up. Uh,
like in you know Jupyter notebook you
have to do a bunch of stuff like if you
were to write a library for this it's
it's a little bit nasty. Um, but the the
other advantage is like you know in a
Jupyter notebook you can print out
images you can like plot things. Um, and
there was this question earlier um about
like multimodal stuff. Uh, and the
answer is like yes you can do multimodal
stuff actually. Um the the problem in
its current form is like we pass
everything around as text like we have
no way of passing around images but it's
a really easy change like in the code.
Um and I think one of the things again
open research thing if people are
interested is like multimodal RLMs or
like looking at RLMs in multimodal
settings. Uh the the reason this is so
even even more cool is like I think code
interacting with image stuff is like a
very underexplored thing. uh and like
how a model can interact with image
stuff um or even like non-image stuff
but like generating images like plots
and stuff and and using this like uh
GBD5 or they they have some tools that
let it do this um but I I think like
doing this in a more principled way is
is a super interesting topic. Um I am
planning on adding support for this in
the RLM library. Um if anyone also wants
to add it themselves feel free open PR
um this is what to do but yeah uh I mean
the representation matters a ton and and
this comes to like uh what is considered
in distribution what is not like these
are all important things right so
&gt;&gt; yeah that was my thought because like um
because I have had this kind of grad
student image in my mind
&gt;&gt; my thought was that like okay what does
the data look in distribution and the
data look in the distribution for like
analysis like this on long stuff um
where you have to do some sort of like
semi- mini analysis about like the
things it look like you have to look at
like uh programmatically interact with
the the substrate but then you have to
think about it right and say you're
taught right and then these this become
kind of the anchor uh that you use for
the next step and then like you do and
you keep on doing this so that like when
somebody is going to like you're going
to hand this to to somebody it's just
going read like the you're taught and
then that's kind of it's kind of
summarization of all the code
&gt;&gt; um that is happening. Um but I mean like
u inherently the models have seen these
Jupiter and seen this structure so like
maybe they're going to be more um uh
pushed toward like the same kind of u
analysis behavior by being able to do
that. I also like saw some research from
um Microsoft I think these guys um
enhancing LM data analysis capability
with notebook and entrance time value
gated they're doing like a Monte Carlo
search type stuff
&gt;&gt; um but basically they're just literally
trained the model um to uh do data
analysis uh with Jupiter style uh
tooling and it seems to be working well
um so I don't know it just sparked this
uh Um uh this start um in general um we
have uh wait a second
um
uh
okay so um you you did some abl ab
ablation without um sub call right
&gt;&gt; yes
&gt;&gt; um
but then um
uh like the in some cases the the the
them
without subal is able to perform better
than one with like that can't do so
call. So it's like it what's the what's
the issue here is like the RM doesn't
really know when it should be doing one
or the other like uh what do you think?
&gt;&gt; Yeah. Um I think it's a mix of things.
One of them being of course like you
said um it makes suboptimal decisions uh
with the the recursive sub call. This is
also another reason why it's very
important to add the baseline that you
talked about before um like in the final
version of the paper because we do want
to see like if you strip out the two
most important parts or like
independently strip out one of them like
what happens. Um I think the the one of
the big points of that ablation is that
like a really important part of this
paper is not actually the recursion part
which is funny because that's the name
but it is really about like this
offloading the the context uh somewhere
else that is really really important. Um
I think another big part sadly is the
noise. I like I I the annoying thing is
like um you know I I will always
criticize my own papers in this way
which is that I don't have standard
deviation bars and and and stuff like
this. Sadly I just cannot afford to run
you know like with them. Um and so like
I I think in a lot of these instances it
it likely is also due to some kind of
noise. Uh which applies even to like
comparing to the other baselines as
well. Um but I think like generally yeah
sub-optimal decision-m uh noise and then
also the fact that like on the
benchmarks where um it does perform
worse. Uh these are ones where it
actually can kind of get away with not
using the recursive calls because those
tasks are not very information dense.
&gt;&gt; True. Um, so it can just find the thing
it needs and then the main model can
reason through what that information is.
Like it doesn't need to do the sub
calls. Um, so that's another explanation
for like why there's a big a bigger gap
for the other um for like Ulong and and
Ulong Paris. But yeah,
&gt;&gt; that makes a lot of sense.
&gt;&gt; Yeah. Okay. So, um I'm going to spare
you the database question. [laughter] Um
&gt;&gt; I mean ask it. I'm happy to answer too.
&gt;&gt; But but I mean like um um you just need
to test it. I mean like how can we know
um and I think this is like also adding
some complexity to the system which like
I think it it align with the other
question which is do you think this
could act as a replacement for like a
fullblown rag system like if we we push
it to the extreme here?
Um, so
I
I don't think so. The reason I say this
is
I think the uh the usefulness of rag and
and other retrieval methods is that or a
big part of them is that you pre-index
stuff like you pre-index like the things
you're searching for which is not cheap
right like it's and and it's a big
reason why we actually don't compare to
rag I mean also in our in our baselines
rag just doesn't even make sense like
the only the only setting where it makes
sense is a browse
uh plus, but in their paper they
actually do rag and it doesn't do that
well compared to BM25. It just wasn't
even worth doing. Um but I I think like
um I I still think there is value in
methods that pre-index stuff. There is
also value in equipping RLMs with tool
calls and also equipping them with with
rag as like a as as an extra thing. Um
and I yeah in that way I think like rag
is still an important thing or just
retrieval methods in general are are are
still like very relevant um in specific
settings right so like in settings
though where I think where RLMs really
shine is where like you cannot afford to
pre-endex or like you're just given
something new on the spot which often
happens for like a long agentic
trajectory or something. Uh but I I one
of the things I do want to explore in
the future is like um a task where like
the long context part actually doesn't
come from the prompt. It comes from the
trajectory itself. So you you can
imagine like let's say we have a really
really hard rag problem like a really
hard retrieval problem where you need to
like piece together everything. I think
browse comp plus is an example of this
but maybe even harder right like these
like deep research style things. Um, and
like the model is given a retriever like
a some kind of like either BM25 or or
rag thing. Um, but and so is the RLM.
But the the difficult part is like as it
retrieves more stuff, the trajectory
gets really long and it it like an RLM
actually is very well suited for this
setting. Uh, and and this is something
like I think is actually really
interesting to explore. Um, and it it it
goes back to this idea of like replacing
a basic LM call with an RLM uh in your
system and and seeing what happens. Um,
but yeah.
&gt;&gt; Yeah. Okay. Yeah, that makes a lot of
sense. Like I also do do think that like
rag still makes sense unless you equip
this thing also with the database and
the rag. [laughter]
uh then like I mean like it's the the
difference is that uh with the rag you
have like just one shot type of
situation but in this case it's actually
mining for the information um which is
the part that is uh the most interesting
like whether you add a rag or tool call
or like whatever library you want to add
into like this the system like literally
allow it to browse the internet and send
an agent to browse whatever I mean like
This is just adding onto this core a bit
like the chain of thought um is also
doing tool calling and like going and
doing this this other stuff. Um it's
just adding into the uh the same kind of
core.
&gt;&gt; Yeah.
&gt;&gt; Um
&gt;&gt; there's an element that um uh uh was
interesting here um that there's a
passing recursive lm output true
variable for long output task. So if I
understand correctly, um it's offloading
this to another sub agent, right? Sub
agent is doing a bunch of stuff and it
has a prompt in it or whatever. It goes
into the the variable. It's not looking
at what's inside.
&gt;&gt; Yeah.
&gt;&gt; Uh to do the rest of the stuff. So it's
saving a bit its context. It's kind of
trusting that this is fine, right?
&gt;&gt; Um this is literally what's happening
here.
So um yeah like you can imagine this is
actually a really cool part of of this
approach as well that I think uh is is
highly underrated. Um and actually Prime
Intellect's implementation of RLMs
doesn't even allow the model the the
model to produce a final answer. It has
to like it has to output a variable
string and that is the final answer of
the RLM. um like that's like that is a
an extreme version of what this is
describing here. So um the point of of
of this uh kind of uh
this this part of the paper is that
basically um
another large limitation of large
language models is their output context
window which doesn't get talked about a
lot like there is a limitation it's not
infinite either. Um, and one of the
really cool things you can do with an
RLM is you can out also output nearly
infinite or unbounded uh sequence
lengths uh of of outputs. Um, and you
can do this in various ways. So um the
the trick that we use is basically the
model can pick a variable and choose
that variable as its actual output as
its final output. Um, and this thing
like in in the silliest case, right?
[snorts] Uh, you can imagine what it
does like you give the the RLM a prompt.
It takes the prompt like as a variable.
It passes that to a recursive model. The
recursive model answers it, stores it in
a variable, and then the RLM just
outputs that variable. And that is the
same as doing a model call. Like these
are these are equivalent, right? Um but
on like the more powerful part is that
like you can do like for example if your
task is um I have a uh one trillion
token Excel sheet and I want you to
transform every row into like a new
Excel sheet. Um the RLM can actually do
this. Um it it basically what it would
do is it would chunk up the Excel sheet.
It would spawn a recursive model on on
on each chunk. it would save the output
to a variable. It concatenate the var
like all the output variables into one
final one which is like maybe also a
trillion tokens and it'll output that.
Uh and and and this is very very cool
because it also mixes programmatic
things like you you don't have to use
the language model itself to to do the
final answer. Um, and this feature
actually is what broke a lot of the
benchmarks. Like basically the model
because it's so flexible. For example,
like all of the, which by the way, I
think is kind of silly. All of these
benchmarks that are like can a model do
um like 30digit multiplication, right?
And like sometimes the answer is no. And
it's like why are we even evaluating
this? Uh like in this setting like it
will just compute it in a variable and
output that. Uh, and I think you can do
a lot of really really cool things like
you can um you you effectively like what
the model can do in the ripple is form
an entire like workflow of like how it's
going to generate the final answer and
this includes both code and uh language
model calls. So it's like almost
building its own agent scaffold uh like
in itself. It's it's a very interesting
thing. Um, and it's part of the reason
why ulong pairs is so hard because
oolong pairs asks the model to generate
all the pairs that satisfy some
property. Um, and you can do this in a
like a programmatic way using like this
kind of um passing the outputs to like
um across variables. So yeah.
&gt;&gt; Yeah. Um
I don't know like my my my intuition
here is that like
uh I feel it's the right idea at the
right time because we already know that
like I don't know office 4.5 fantastic
coding agent whatever
&gt;&gt; well fantastic now you put it into a
setup where the only thing he has to do
is code right like literally and like
okay it doesn't can't do 30digit
multiplication stuff it can it write the
the script, do it and then it's done and
then you can just move on to the other
task and just stitch it up. It doesn't
have massive context doesn't matter. It
can spawn like six version of it and
just go and then run. Um so I I I feel
like it's it's the right idea at the
right time. Um there's uh this line that
was um I just wanted to get you like
rough idea. I know that you you might be
working on this right now. We hypotheize
hypothesiz that RM trajectory can be
used as a form of reasoning which can be
trained to by bootstrapping existing
franch
here you think um that you have in mind
in order to actually do that.
&gt;&gt; Yeah. So this is actually really tricky
in practice. Um but the the core idea of
what I was trying to say here was that
um like in the past right uh now I I
want to say this in a way that's like
not confusing to people. So if people
find this confusing I can I can uh
reframe the way that I say this um in in
this last year the way that we have done
reasoning models like like what is a
reasoning model? A reasoning model is
just a model that has been post-trained
such that when it's given a question uh
what it will do is it will output this
long reasoning trace uh and this trace
will also get fed back into the model
right so it's like a form of
conditioning
um and given this reasoning trace it it
it came up with and the original prompt
it will come up with a maybe a better
answer like a more informed answer to
what it was trying to do. Um, and this
is what I like to call reasoning in
token space because quite literally it's
it is just outputting tokens to come up
with a with an answer. Uh, and the way
that these were trained uh you know with
with RL for example um although it
doesn't have to be is like you you
basically do this this kind of version
of uh like rejection sampling. I don't
know if that's the right word. Um, but
like you you get the model to produce
these long sequences. Um, and then if it
gets the the the question correct, um,
like you give it a positive signal, you
do the update like this, yada yada. Like
that's that's like a it's it's simple,
but obviously in practice there's
there's a lot of like really nasty parts
of it. Um, but the reason why this works
so well is like this is still the same
as just training a model. Like you're
you're just you're just training a model
with RL. Um, and this the sequence like
is still fed back into the model. So
this whole pipeline is the same thing as
if you were to train it in a
non-reasoning way. There's actually no
difference for the most part. Um the
difficulty though with the RLM part and
I which is why I think this is also so
cool is that the RLM trajectory is way
longer than what fits into the model's
context window. So you can't just
naively train it the way that you would
like there there's there's nothing to
like uh like the back propagation is
really awkward here or like even the
reward is really awkward here. We now
have a uh like as we usually call in in
like the RL community like a credit
assignment problem. Um the other really
weird thing is that we're not reasoning
purely in token space. Now, if that's a
confusing term, like I I can explain it
a little bit better, but we are
reasoning in code and in token space.
And and not only that, we're reasoning
across multiple model calls, which is
really weird. Like it's a it's a really
awkward thing to do. Um, and like how
you actually train this model uh such
that it never actually uses the full
trajectory to train uh like to do this
like reasoning training thing uh is kind
of tricky. Like I I I think I mean it's
an ongoing thing. Uh we're looking into
it. Um I also would be happy if if you
know Frontier Labs uh are interested in
this as well. Um you know I I I don't
care who who ends up having the the best
model. I just want to see if it works.
Um
&gt;&gt; it could be you man. It could be you.
&gt;&gt; Maybe maybe. Um but I I it's Yeah,
that's I I guess that's that's kind of
uh what what that means.
&gt;&gt; Yeah. Have you thought about like
evolutionary strategy here like to
&gt;&gt; Yeah. So
&gt;&gt; like but I say this because I I was
talking to the egg roll um uh guy and
then um uh like another researcher that
is also working on this and like it is
comparable to doing like GRPO in some of
the benchmarks. um you just have to make
sure that like you're doing it as
optimized as possible like in um on the
GPU space. But if you can pull it off
then like it doesn't matter what's
happening in the middle like it can
literally spawn like 100 sub agent like
recursively whatever it is you just
wiggle the stuff right here and then you
look at the output and you're like this
is good this is we're we're we're going
to make it more of this and less of the
other stuff.
&gt;&gt; Yeah. [laughter]
Um so I think one question I had like do
you think that the arling or like
whatever post training the model for
being an arm will have an impact on the
number of sub agent being spawned and um
just generally their understanding of
the task difficulty. So basically like
bringing bringing quen 3 closer to like
GPT5 level of understanding of like not
being silly.
&gt;&gt; Yeah. Um, so I would recommend reading
there's this paper called contextfolding
something something I I think it's like
a bite dance paper. Um, I actually don't
remember if it's if it's Bike Dance or a
different uh Chinese company, but um
they like
they they they do uh something a little
bit different than what RLMs do, but
it's a similar core idea of like we have
multiple model calls and we want to
train a model like with RL to be able to
do this kind of thing. Um, and they do a
lot of like really interesting tricks
like to the the GRPL loss. Um, and like
with the goal of like reducing the
number of sub aent calls, like reducing
the length of the the the the root
language models, trajectory, stuff like
this. Um, my uh my answer to this is it
honestly depends a lot on like what your
loss is. Uh, and I, um, this is just
speculation, but I think that, uh,
naively training with like how we've
done it in the past, like with GRPO or
or maybe some other like modified
version is not going to work that well
unless you have a lot of data. Um, I
think we are inevitably going to need to
bake in some things, at least in the
beginning. I think in the future things
will eventually just simplify out and
and and maybe it will just return back
to GRPO. I I don't know. But um I think
for the time being, like you know, if we
want to see some like initial cool
results with RLMs, like we'll probably
have to guide them in certain ways. Like
for example, if we want to post train 3,
we kind of have to add a little knob
that says like, "Hey, don't do that many
sub agent calls." Like I I think this is
just what's going to happen. Um
&gt;&gt; yeah, but yeah. Yeah.
&gt;&gt; Yeah, that makes a lot of sense. And I
also had this other thought, but I think
it goes into the same direction of like
if it knows that the task is easy or
hard, it can decide what type of sub
agent it will call and like just use
less compute. So if it knows that like
this is a dumb dumb task, it just
doesn't want to do it, right?
&gt;&gt; Well, we can just
&gt;&gt; like spawn a llama tree and then it's
done,
&gt;&gt; right? But if it knows that he has to do
like big thinking here, GPT 5.2, like go
for it. We're going to wait 20 20
minutes. It doesn't matter because this
is too complicated of a problem and then
this thing can then orchestrate the
rest. Um, but here we don't know.
There's a funny question on the chat.
[laughter]
How does someone enjoy doing this all
the time?
&gt;&gt; Are you just sitting in your front of
computer all day? [laughter]
How is this fun? What's your take on
this?
&gt;&gt; Yeah. Um, well, so what I will say is,
uh, I actually don't think I work that
much. Um, so
I would say now this is this might be
surprising to some people. I I think the
hardest I've ever worked is during my
undergrad. I I genuinely think and I
joke about this a lot with my with my
friends. Um like like school can be
hard. You you can really make it hard
for yourself. Um and I I think like but
for me personally like doing that was
actually really helpful because um I
actually spent most of my time in
undergrad not doing uh deep learning or
machine learning stuff. Um other than
like research I I I did do some like
research stuff but for the most part
like the courses I took were just like
math or or uh systems or like physics
type stuff. Um, and like that honestly
built up enough of a foundation for me
to
uh explore like I think really simple
idea. Like I think the RLM idea for
example is really simple. Um I I I don't
think it's like something like some
crazy uh like novel thing. Um I I do
think it's it's quite clever. I I that's
why I think it it it it's like u a
little popular now. But um I think in
general like honestly I am I am not of
the opinion that people should work like
you know 15 hours a day. I think that's
that's kind of crazy. Um but I I do
think just like you know if you enjoy it
like naturally you will just spend time
doing these things and you know um I I I
also think I've been very lucky. So I I
I will say that too. I I I think things
work out differently for for different
people. Uh but for me I I would credit
my uh my successful streak of research
ideas uh from when I started doing GPU
mode stuff. Um which is really weird
because I it's not related to most of
the research I do. Um but I think that's
that's when I I started like you know
getting involved and and seeing like
what kinds of uh problems ex exist out
there. So yeah, I I I think a lot of
problems in ML are still like um not
lowhanging fruit, but more like there's
a lot of clever ideas that haven't
really been articulated very well. Uh
and I I actually think the funny thing
about this like just AI stuff in general
is um I don't think like we need crazy
ideas. Like I I actually think like a
lot of ideas already exist and float
around, but like the way that they
become interesting is like when somebody
formalizes it or articulates it in a way
where people understand what's going on.
I think star like quiet star and star
which are basic. It's like Eric
Zelikman's uh work that underpins all
the reasoning model stuff um is a great
example of this. Like I think the idea
of bootstrapping reasoning traces is not
like I'm sure many people thought of it
at the same time or even earlier. Um but
it's like his papers that made it clear
to people that like this is actually a
really good idea. Um and I think there's
a lot of ideas like that that are still
out there. Um some of them are rooted in
you know more uh like theoretically
minded individuals like people that like
to think in math and um and some of them
are just like super simple. Um, and so
yeah, I I don't think there's any like
secret recipe for these kinds of things.
Like it really is just um like you spend
time in the field and like I think these
ideas just kind of um float around. So
&gt;&gt; yeah also like um for those that are not
aware of like research stuff it it's not
necessarily like you just sit at the
computer and look at the computer and
like the idea will come from the
computer like the the computer is just
for doing or like getting information
right like the idea you need to kind of
get an intuition and then like start to
read a bunch of stuff you can take stuff
outside you can just print your your
things and then just start to read it
like out you chat with researcher in my
view like chatting with the researchers
is the best way to kind of get to the
core of it right like you can read the
paper yes it's all formal and stuff but
like getting like just the background
intuition idea also give you some sense
of like where the stuff is maybe going
so like actually like it's just a lot of
chatting around [laughter] at some point
you have to code something right
&gt;&gt; yeah of course of I mean fundamentals
are always even more important obviously
like and I think uh honestly Maybe this
is a hot take, but I think the
fundamentals for the AI field are like
quite shallow. So like you
[clears throat]
you know like if you wanted to get into
pure math or physics research it's quite
difficult like it takes years to like
you know but AI there's so much like to
do and I I think um you know verify it
also like you can verify your okay you
have this dumb ass ID just try it out
man like let me know
&gt;&gt; and then you'll see if it's it's worth u
worth it or not if you need like massive
amount of compute and it's going be
super complicated like realistically
won't happen at all right so you just
have to go and gravitate toward like
less compute insane idea or get an
internship at like open AI
&gt;&gt; I I actually think a lot of ideas don't
require compute that's you know the
funny thing is I think all of the boring
ideas require compute like in the sense
that like it's the easy way out like
it's sort [laughter] of like oh of
course you can train on this thing right
but there's a there's a lot of like like
RLMs for example does it does require
compute, but the core part actually
doesn't. Um, and I think like yeah, I
it's there there's there's really a lot
of things that are missing currently.
Um, and these ideas can come from
anywhere like genuinely like you don't
need to be like uh super established or
or things like that.
&gt;&gt; Yeah.
&gt;&gt; Well, there's a good follow-up question
on this. I think that would be the last
one. Um, how would one look for novelty
when looking for research you want to
publish? um somebody is uh studying
their master thesis
&gt;&gt; like um how how how do you get novel
ideas?
&gt;&gt; So I think novel ideas come from
understanding
what's going on in the field really
well. Uh and that doesn't necessarily
mean like reading a thousand papers like
there are some people that do that.
That's great. I mean, uh, I used to do
that. I don't do that anymore. Uh, and
part of the reason is I think, um, yeah,
a lot of a lot of ideas get recycled.
And I don't think that's a bad thing, by
the way, either. Um, but I I think the
way to think about this is like
once you read into a field enough, you
will get frustrated with certain things.
Like some things will just be like this
doesn't make sense. like why is it done
this way? And honestly, the answer to
that is usually like because maybe
someone hasn't explored it thoroughly.
And it's not because like oh it's it's
this way because it's the best. Like it
generally is not true. Um I think like
with with what to pick, I mean, if you
look at my my history of research, it's
like all over the place. Like it's it's
genuinely like a bunch of random stuff.
I am not like a specialist in some like
in post training, for example. um or
like I am not a specialist in context
engineering. Uh but I think it just
tends to be that like there
like as you just read into certain
fields like you will naturally have
questions. Um like you had lots of
questions for me today and honestly a
lot of those questions are research
projects of their own. Like they they
could very well be a thing to explore.
Um, and I I think like oftent times like
so this RLM project started basically
where um I should give credit to my
adviser like very great guy uh Omar he's
he's the guy that did DSP pie um like he
basically was just like hey like you
know um what if we look at like these
models that basically like tool call
other models like I don't like I just I
don't know what would happen like let's
just see what happens. And like
initially we started doing a bunch of
stuff and it did not work. Like it was
very silly. Uh it was like very dumb.
And I I'm sure a lot of people have
tried it too uh before we settled on
like the final idea. But I think like
this is just like it's things like that.
It's just like oh like why hasn't this
been done before? Uh and yeah uh there
there are some sub fields where this is
a lot harder to do. So like systems for
example, it's a lot harder to pull this
off because I think generally in systems
like there is a it's not as much of a
research question. It's more like
someone needs to go do this and like you
need to just learn how to do it. Um,
flash tension is a great example of
this. Um, but yeah, I I I think like
there's lots of
great ideas to be discovered that are
&gt;&gt; No, that's exactly it honestly. Same
same here. Like u I mean at some point
you know the ideas you just have to like
commit to one and like push it through.
And it's it's really true like there are
a lot of ideas train of thought that are
have just taught um like four years ago
because the only guy that worked on it
graduated right now working at Mckenzie
or whatever the heck right
&gt;&gt; but it's not pushed like not all
direction in the human like edge is
being pushed at the same time.
&gt;&gt; Um
&gt;&gt; so what's next for this research
direction and how can the community uh
be involved here?
&gt;&gt; Yes. Okay. So this is important. Um I
would say
uh
the
obvious next direction is training,
right? Um and I don't think this is
something that can be done that easily
in the open. Uh unless like um like
unless there is a community like things
like Aluther and other stuff that like
you know um have open uh compute and and
stuff like this and like more
centralized communities where they can
train stuff. Um I think like uh a few
companies premise like most notably is
is working on this now. Um like just in
general uh can we solve a lot of the
problems we talked about today um
through post- training and and can we
get a model that is actually um can
boost its own performance uh by post-
training on the scaffold. Um very
interesting problem. Uh I think it will
likely will see some results maybe in
the next six months maybe even earlier
if some people are already working on
this. Um, I think another big direction
which is maybe the more open-source part
that I've been thinking about is uh
going back to this Jupyter notebook
thing and and more broadly like what is
the the actual like interface that we
want to end up with. I and I say this
because I think that
to make progress on this problem like as
a community there needs to be some
standards that are set. Like I I just
think if everybody works on this like
concurrently with different ideas of
what to do, we're it's just going to be
a mess because um
everything in ML is about being in
distribution now, right? Like let's just
be honest like for at least for language
models, it's it's it's it's about being
in distribution and and and and and
trying to mold things in a way where the
model likes to see like what you give
it. Um and so I think thinking about how
this is designed like for this like open
source library that we have like um this
is super important. Uh, number two is
like this whole asynchrony thing. Like
we want this to also be really fast. And
so I can imagine like in the near future
we might develop like another type of
inference engine but specifically for
RLMs um and like how it deals with uh
like how it minimizes basically uh the
longest depth of like chained language
model calls. Um, and like how we
designed these systems to be used uh on
like your local server. Um, and like how
we design the sandboxes, like what is
this ripple going to be equipped with?
Um, like what is it even going to be?
Like is it going to be a Docker
container or like a Docker image that
runs on your your machine? Is it going
to be like a sandbox like you use you
hook it up to modal um or like your own
kind of cluster and like you do
something like this? I think these are
all open questions that do not involve a
lot of compute.
um and that can be discussed and and
solved like in in the open. Uh and so I
think these two things are are are super
important. Um number three, which I
forgot about is evals, which going back
all the way to the beginning, we need
evals. Like and I I I don't even mean
long context evals. Like of course those
are important, but I genuinely think
actually this is a great one. If you're
looking for something to do, eval like
you know I think Sweetbench I was
fortunate enough to be there when it
like when John and and Carlos and Ofer
were were like developing this thing. Um
but like genuinely it just comes from
like hey like this is a naturally
occurring problem. Can we get a model to
solve this? I think we need more
benchmarks like this where models just
don't do that well. Um, and like these
are like they try to reflect realistic
tasks. Um, and they will get like hill
climbed. That's of course, but I think
like we need more diverse eval
um because I think actually that's
probably the single most important
driver uh for model progress these days.
&gt;&gt; Yeah. Yeah. 100%.
&gt;&gt; Yeah.
&gt;&gt; Um, no, I have nothing to add, man.
Thank you very much and also for staying
like so long afterward.
&gt;&gt; Yeah.
&gt;&gt; Um like uh folks go follow him on
Twitter like all the links are in
description. Read the paper. It's a
really good one also. Um and thank you
very much Alex for for coming.
&gt;&gt; Of course. Thank you so much.
&gt;&gt; Good. Perfect. See you man. Um uh the
recording will be available folks on u
YouTube so you'll be able to um uh take
a look at it. Uh honestly I think like
this is this I really like the idea. I
really do believe that it's going to be
um u they it has the same characteristic
that we saw earlier on with like a um
reasoning model. I think there's a lot
of stuff to do with it. Like there's
just like the the the the
frontier here is kind of boundless. Um
so uh if you want to get involved, this
is a very good like um I said like um
type of shape to be involved because you
don't require training. You don't train
these models, right? You set up the
arness and then you tweak a bunch of
stuff. Um, so like if you find these
shape where like there's not a lot of
like demanding compute and it's about
like kind of understanding qualitatively
what's going on and thinking creatively
about like how to set things up. Um,
it's it's a good place to start. The
code is actually open source. Um, so he
has uh I'll put it on GitHub. You can
take a look at it, start to tinker about
a bunch with it and you're going to have
some uh some ideas that then you can
share with the community. So, thank you
very much everybody. It was super fun
and I wish you all a fantastic rest of
the week. Bye-bye.