nielstron (u/nielstron)

nielstron commented on Evaluating AGENTS.md: are they helpful for coding agents? arxiv.org/abs/2602.11988... · Posted by u/mustaphah

deaux · 23 days ago

Thank you for turning up here and replying!

> The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.

I think the coding agent recommended LLM-generated AGENTS.md files are almost without exception really bad. Because the AGENTS.md, to perform well, needs to point out the _non_-obvious. Every single LLM-generated AGENTS.md I've seen - including by certain vendors who at one point in time out-of-the-box included automatic AGENTS.md generation - wrote about the obvious things! The literal opposite of what you want. Indeed a complete and utter waste of tokens that does nothing but induce context rot.

I believe this is because creating a good one consumes a massive amount of resources and some engineering for any non-trivial codebase. You'd need multiple full-context iterations, and a large number of thinking tokens.

On top of that, and I've said this elsewhere, most of the best stuff to put in AGENTS.md is things that can't be inferred from the repo. Things like "Is this intentional?", "Why is this the case?" and so on. Obviously, the LLM nor a new-to-the-project human could know this or add them to the file. And the gains from this are also hard to capture by your performance metric, because they're not really about the solving of issues, they're often about direction, or about the how rather than the what.

As for the extra tokens, the right AGENTS.md can save lots of tokens, but it requires thinking hard about them. Which system/business logic would take the agent 5 different file reads to properly understand, but can we summarize in 3 sentences?

nielstron · 23 days ago

Yes that's a great summary and I agree broadly.

Note with different prompt types I refer to different types of meta-prompts to generate the AGENTS.md. All of these are quite useless. Some additional experiments not in the paper showed that other automated approaches are also useless ("memory" creating methods, broadly speaking).

nielstron commented on Evaluating AGENTS.md: are they helpful for coding agents? arxiv.org/abs/2602.11988... · Posted by u/mustaphah

regularfry · 23 days ago

> Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.

Ok so that's interesting in itself. Apologies if you go into this in the paper, not had time to read it yet, but does this tell us something about the models themselves? Is there a benchmark lurking here? It feels like this is revealing something about the training, but I'm not sure exactly what.

nielstron · 23 days ago

It could... but as pointed out by other the significance is unclear and per-model results have even less samples than the benchmark average. So: maybe :)

nielstron commented on Evaluating AGENTS.md: are they helpful for coding agents? arxiv.org/abs/2602.11988... · Posted by u/mustaphah

deaux · 24 days ago

I read the study. I think it does the opposite of what the authors suggest - it's actually vouching for good AGENTS.md files.

> Surprisingly, we observe that developer-provided files only marginally improve performance compared to omitting them entirely (an increase of 4% on average), while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average).

This "surprisingly", and the framing seems misplaced.

For the developer-made ones: 4% improvement is massive! 4% improvement from a simple markdown file means it's a must-have.

> while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average)

This should really be "while the prompts used to generate AGENTS files in our dataset..". It's a proxy for prompts, who knows if the ones generated through a better prompt show improvement.

The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency. Exactly the kind of thing very common in closed-source, yet incredibly rare in public Github projects that have an AGENTS.md file - the huge majority of which are recent small vibecoded projects centered around LLMs. If 4% gains are seen on the latter kind of project, which will have a very mixed quality of AGENTS files in the first place, then for bigger projects with high-quality .md's they're invaluable when working with agents.

nielstron · 23 days ago

Hey thanks for your review, a paper author here.

Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.

The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.

But ultimately I agree with your post. In fact we do recommend writing good AGENTS.md, manually and targetedly. This is emphasized for example at the end of our abstract and conclusion.

nielstron commented on Evaluating AGENTS.md: are they helpful for coding agents? arxiv.org/abs/2602.11988... · Posted by u/mustaphah

theLiminator · 24 days ago

I'd take any paper like this with a grain of salt. I imagine what holds true for models in time period X could drastically be different just given a little more time.

Doesn't mean it's not worth studying this kind of stuff, but this conclusion is already so "old" that it's hard to say it's valid anymore with the latest batch of models.

nielstron · 24 days ago

This is life of an LLM researcher. We literally ran the last experiments only a month ago on what were the latest models back then...

nielstron commented on Evaluating AGENTS.md: are they helpful for coding agents? arxiv.org/abs/2602.11988... · Posted by u/mustaphah

rmnclmnt · 24 days ago

Yesterday while i was adding some nitpicks to a CLAUDE.md/AGENTS.md file, I thought « this file could be renamed CONTRIBUTING.md and be done with it ».

Maybe I’m wrong but sure feels like we might soon drop all of this extra cruft for more rationale practices

nielstron · 24 days ago

Exactly my thoughts... the model should just auto ingest README and CONTRIBUTING when started.

nielstron commented on Evaluating AGENTS.md: are they helpful for coding agents? arxiv.org/abs/2602.11988... · Posted by u/mustaphah

tartakovsky · 24 days ago

Well, task == Resolving real GitHub Issues

Languages == Python only

Libraries (um looks like other LLM generated libraries -- I mean definitely not pure human: like Ragas, FastMCP, etc)

So seems like a highly skewed sample and who knows what can / can't be generalized. Does make for a compelling research paper though!

nielstron · 24 days ago

Hey, paper author here. We did try to get an even sample - we include both SWE-bench repos (which are large, popular and mostly human-written) and a sample of smaller, more recent repositories with existing AGENTS.md (these tend to contain LLM written code of course). Our findings generalize across both these samples. What is arguably missing are small repositories of completely human-written code, but this is quite difficult to obtain nowadays.

nielstron commented on Evaluating AGENTS.md: are they helpful for coding agents? arxiv.org/abs/2602.11988... · Posted by u/mustaphah

energy123 · 24 days ago

Their definition of context excludes prescriptive specs/requirements files. They are only talking about a file that summarizes what exists in the codebase, which is information that's otherwise discoverable by the agent through CLI (ripgrep, etc), and it's been trained to do that as efficiently as possible.

Also important to note that human-written context did help according to them, if only a little bit.

Effectively what they're saying is that inputting an LLM generated summary of the codebase didn't help the agent. Which isn't that surprising.

nielstron · 24 days ago

Hey, a paper author here :) I agree, if you know well about LLMs it shouldn't be too surprising that autogenerated context files are not helping - yet this is the default recommendation by major AI companies which we wanted to scrutinize.

> Their definition of context excludes prescriptive specs/requirements files.

Can you explain a bit what you mean here? If the context file specifies a desired behavior, we do check whether the LLM follows it, and this seems generally to work (Section 4.3).

nielstron commented on K2-Think: A Parameter-Efficient Reasoning System huggingface.co/LLM360/K2-... · Posted by u/Anon84

nielstron · 6 months ago

Debunking the Claims of K2-Think https://www.sri.inf.ethz.ch/blog/k2think

nielstron commented on K2-Think: A Parameter-Efficient Reasoning System arxiviq.substack.com/p/k2... · Posted by u/che_shr_cat

nielstron · 6 months ago

Debunking the Claims of K2-Think https://www.sri.inf.ethz.ch/blog/k2think