Evaluating AGENTS.md: are they helpful for coding agents?

deaux · 25 days ago

I read the study. I think it does the opposite of what the authors suggest - it's actually vouching for good AGENTS.md files.

> Surprisingly, we observe that developer-provided files only marginally improve performance compared to omitting them entirely (an increase of 4% on average), while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average).

This "surprisingly", and the framing seems misplaced.

For the developer-made ones: 4% improvement is massive! 4% improvement from a simple markdown file means it's a must-have.

> while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average)

This should really be "while the prompts used to generate AGENTS files in our dataset..". It's a proxy for prompts, who knows if the ones generated through a better prompt show improvement.

The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency. Exactly the kind of thing very common in closed-source, yet incredibly rare in public Github projects that have an AGENTS.md file - the huge majority of which are recent small vibecoded projects centered around LLMs. If 4% gains are seen on the latter kind of project, which will have a very mixed quality of AGENTS files in the first place, then for bigger projects with high-quality .md's they're invaluable when working with agents.

nielstron · 25 days ago

Hey thanks for your review, a paper author here.

Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.

The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.

But ultimately I agree with your post. In fact we do recommend writing good AGENTS.md, manually and targetedly. This is emphasized for example at the end of our abstract and conclusion.

vidarh · 25 days ago

Without measuring quality of output, this seems irrelevant to me.

My use of CLAUDE.md is to get Claude to avoid making stupid mistakes that will require subsequent refactoring or cleanup passes.

Performance is not a consideration.

If anything, beyond CLAUDE.md I add agent harnesses that often increase the time and tokens used many times over, because my time is more expensive than the agents.

sdenton4 · 25 days ago

You're measuring binary outcomes, so you can use a beta distribution to understand the distribution of possible success rates given your observations, and thereby provide a confidence interval on the observed success rates. This week help us see whether that 4% success rate is statistically significant, or if it is likely to be noise.

regularfry · 25 days ago

> Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.

Ok so that's interesting in itself. Apologies if you go into this in the paper, not had time to read it yet, but does this tell us something about the models themselves? Is there a benchmark lurking here? It feels like this is revealing something about the training, but I'm not sure exactly what.

deaux · 25 days ago

Thank you for turning up here and replying!

> The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.

I think the coding agent recommended LLM-generated AGENTS.md files are almost without exception really bad. Because the AGENTS.md, to perform well, needs to point out the _non_-obvious. Every single LLM-generated AGENTS.md I've seen - including by certain vendors who at one point in time out-of-the-box included automatic AGENTS.md generation - wrote about the obvious things! The literal opposite of what you want. Indeed a complete and utter waste of tokens that does nothing but induce context rot.

I believe this is because creating a good one consumes a massive amount of resources and some engineering for any non-trivial codebase. You'd need multiple full-context iterations, and a large number of thinking tokens.

On top of that, and I've said this elsewhere, most of the best stuff to put in AGENTS.md is things that can't be inferred from the repo. Things like "Is this intentional?", "Why is this the case?" and so on. Obviously, the LLM nor a new-to-the-project human could know this or add them to the file. And the gains from this are also hard to capture by your performance metric, because they're not really about the solving of issues, they're often about direction, or about the how rather than the what.

As for the extra tokens, the right AGENTS.md can save lots of tokens, but it requires thinking hard about them. Which system/business logic would take the agent 5 different file reads to properly understand, but can we summarize in 3 sentences?

SerCe · 25 days ago

In Theory There Is No Difference Between Theory and Practice, While In Practice There Is.

In large projects, having a specific AGENTS.md makes the difference between the agent spending half of its context window searching for the right commands, navigating the repo, understanding what is what, etc., and being extremely useful. The larger the repository, the more things it needs to be aware of and the more important the AGENTS.md is. At least that's what I have observed in practice.

giancarlostoro · 25 days ago

> The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency.

This. I have Claude write about the codebase because I get tired of it grepping files constantly. I rather it just know “these files are for x, these files have y methods” and I even have it breakdown larger files so it fits the entire context window several times over.

Funnily enough this makes it easier for humans to parse.

belval · 25 days ago

My pet peeve with AI is that it tends to work better in codebase where humans do well and for the same reason.

Large orchestration package without any tests that relies on a bunch of microservices to work? Claude Code will be as confused as our SDEs.

This in turns lead to broader effort to refactor our antiquated packages in the name of "making it compatible with AI" which actually means compatible with humans.

bootsmann · 25 days ago

This reads a lot like bargaining stage. If agentic AI makes me a 10 times more productive developer, surely a 4% improvement is barely worth the token cost.

koiueo · 25 days ago

> If agentic AI makes me a 10 times more productive

I'm not sure what you are suggesting exactly, but wanted to highlight this humongous "if".

staticassertion · 25 days ago

If something makes you 10x as effective and then you improve that thing by 4%...

zero_k · 25 days ago

It's not only about the token cost! It's also my TIME cost! Much-much more expensive than tokens, it turns out ;)

Deleted Comment

croes · 25 days ago

10x is that quantity or quality?

wolfejam · 17 days ago

Well said. And it's potentially a 7% swing when you think about it — +4% with good human-written context vs. -3% with LLM-generated noise. That's a significant delta from just the quality of the information.

The real value is exactly what you described: the tribal knowledge, the "we tried X and it broke because Y", the constraints that live in someone's head and nowhere in the code. LLM-generated files miss this because the LLM is just restating what it can already see. Of course that doesn't help.

zero_k · 25 days ago

Honestly, the more research papers I read, the more I am suspicious. This "surprisingly" and other hyperbole is just to make reviewers think the authors actually did something interesting/exciting. But the more "surprises" there are in a paper, the more I am suspicious of it. Often such hyperbole ought to be at best ignored, at worst the exact opposite needs to be examined.

It seems like the best students/people eventually end up doing CS research in their spare time while working as engineers. This is not the case for many other disciplines, where you need e.g. a lab to do research. But in CS, you can just do it from your basement, all you need is a laptop.

MITSardine · 25 days ago

Well, you still need time (and permission from your employer)! Research is usually a more than full time job on its own.

pgt · 25 days ago

4% is yuuuge. In hard projects, 1% is the difference between getting it right with an elegant design or going completely off the rails.

pamelafox · 25 days ago

This is why I only add information to AGENTS.md when the agent has failed at a task. Then, once I've added the information, I revert the desired changes, re-run the task, and see if the output has improved. That way, I can have more confidence that AGENTS.md has actually improved coding agent success, at least with the given model and agent harness.

I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.

viraptor · 25 days ago

You can also save time/tokens if you see that every request starts looking for the same information. You can front-load it.

sebazzz · 25 days ago

Also take the randomness out of it. Sometimes the agent executing tests one way, sometimes the other way.

NicoJuicy · 25 days ago

Don't forget to update it regularly then

imiric · 25 days ago

That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable.

pamelafox · 25 days ago

So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents.

ChrisGreenHeur · 25 days ago

same with people, no matter what info you give a person you cant be sure they will follow it the same every time

averrous · 25 days ago

Agree. I also found out that rule discovery approach like this perform better. It is like teaching a student, they probably have already performed well on some task, if we feed in another extra rule that they already well verse at, it can hinder their creativity.

avhception · 25 days ago

When an agent just plows ahead with a wrong interpretation or understanding of something, I like to ask them why they didn't stop to ask for clarification. Just a few days ago, while refactoring minor stuff, I had an agent replace all sqlite-related code in that codebase with MariaDB-based code. Asked why that happened, the answer was that there was a confusion about MariaDB vs. sqlite because the code in question is dealing with, among other things, MariaDB Docker containers. So the word MariaDB pops up a few times in code and comments.

I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this. So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding. But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that. The first three:

- Do not try to fill gaps in your knowledge with overzealous assumptions.

- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.

- If a task seems to require extra changes, pause and ask before proceeding.

If these are not enough to prevent stuff like that, I don't know what could.

Sevii · 25 days ago

Are agents actually capable of answering why they did things? An LLM can review the previous context, add your question about why it did something, and then use next token prediction to generate an answer. But is that answer actually why the agent did what it did?

gas9S9zw3P9c · 25 days ago

It depends. If you have an LLM that uses reasoning the explanation for why decisions are made can often be found in the reasoning token output. So if the agent later has access to that context it could see why a decision was made.

bananapub · 25 days ago

of course not, but it can often give a plausible answer, and it's possible that answer will actually happen to be correct - not because it did any - or is capable of any - introspection, but because it's token outputs in response to the question might semi-coincidentally be a token input that changes the future outputs in the same way.

Onavo · 25 days ago

Well, the entire field of explainable AI has mostly thrown in the towel..

bandrami · 25 days ago

Isn't that question a category error? The "why" the agent did that is that it was the token that best matched the probability distribution of the context and the most recent output (modulo a bit of randomness). The response to that question will, again, be the tokens that best match the probability distribution of the context (now including the "why?" question and the previous failed attempt).

tibbar · 25 days ago

if the agent can review its reasoning traces, which i think is often true in this era of 1M token context, then it may be able to provide a meaningful answer to the question.

tomashubelbauer · 25 days ago

Just this morning I have run across an even narrower case of how AGENTS.md (in this case with GPT-5.3 Codex) can be completely ignored even if filled with explicit instructions.

I have a line there that says Codex should never use Node APIs where Bun APIs exist for the same thing. Routinely, Claude Code and now Codex would ignore this.

I just replaced that rule with a TypeScript-compiler-powered AST based deterministic rule. Now the agent can attempt to commit code with banned Node API usage and the pre-commit script will fail, so it is forced to get it right.

I've found myself migrating more and more of my AGENTS.md instructions to compiler-based checks like these - where possible. I feel as though this shouldn't be needed if the models were good, but it seems to be and I guess the deterministic nature of these checks is better than relying on the LLM's questionable respect of the rules.

iamflimflam1 · 25 days ago

Not that much different from humans.

We have pre-commit hooks to prevent people doing the wrong thing. We have all sorts of guardrails to help people.

And the “modern” approach when someone does something wrong is not to blame the person, but to ask “how did the system allow this mistake? What guardrails are missing?”

MITSardine · 25 days ago

I wonder if some of these could be embedded in the write tool calls?

Deleted Comment

geraneum · 25 days ago

> So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding.

You may want to ask the next LLM versions the same question after they feed this paper through training.

sensanaty · 25 days ago

I really hate that the anthropomorphizing of these systems has successfully taken hold in people's brains. Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.

Even the "thinking" blocks in newer models are an illusion. There is no functional difference between the text in a thought block and the final answer. To the model, they are just more tokens in a linear sequence. It isn't "thinking" before it speaks, the "thought" is the speech.

Treating those thoughts as internal reflection of some kind is a category error. There is no "privileged" layer of reasoning happening in the silicon that then gets translated into the thought block. It’s a specialized output where the model is forced to show its work because that process of feeding its own generated strings back into its context window statistically increases the probability of a correct result. The chatbot providers just package this in a neat little window to make the model's "thinking" part of the gimmick.

I also wouldn't be surprised if asking it stuff like this was actually counter productive, but for this I'm going off vibes. The logic being that by asking that, you're poisoning the context, similar to how if you try generate an image by saying "It should not have a crocodile in the image", it will put a crocodile into the image. By asking it why it did something wrong, it'll treat that as the ground truth and all future generation will have that snippet in it, nudging the output in such a way that the wrong thing itself will influence it to keep doing the wrong thing more and more.

Bolwin · 25 days ago

You're entirely correct in that it's a different model with every message, every token. There's no past memory for it to reference.

That said it can still be useful because you have a some weird behavior and 199k tokens of context, with no idea where the info is that's nudging it to do the weird thing.

In this case you can think of it less as "why did you do this?" And more "what references to doing this exist in this pile of files and instructions?"

bavell · 25 days ago

Agreed. I wish more people understood the difference between tokens, embeddings, and latent space encodings. The actual "thinking" if you can call it that, happens in latent space. But many (even here on HN) believe the thinking tokens are the thoughts themselves. Silly meatbags!

Majromax · 25 days ago

> I really hate that the anthropomorphizing of these systems has successfully taken hold in people's brains. Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.

"Thinking meat! You're asking me to believe in thinking meat!"

While next-token prediction based on matrix math is certainly a literal, mechanistic truth, it is not a useful framing in the same sense that "synapses fire causing people to do things" is not a useful framing for human behaviour.

The "theory of mind" for LLMs sounds a bit silly, but taken in moderation it's also a genuine scientific framework in the sense of the scientific method. It allows one to form hypothesis, run experiments that can potentially disprove the hypothesis, and ultimately make skillful counterfactual predictions.

> By asking it why it did something wrong, it'll treat that as the ground truth and all future generation will have that snippet in it, nudging the output in such a way that the wrong thing itself will influence it to keep doing the wrong thing more and more.

In my limited experience, this is not the right use of introspection. Instead, the idea is to interrogate the model's chain of reasoning to understand the origins of a mistake (the 'theory of mind'), then adjust agents.md / documentation so that the mistake is avoided for future sessions, which start from an otherwise blank slate.

I do agree, however, that the 'theory of mind' is very close to the more blatantly incorrect kind of misapprehension about LLMs, that since they sound humanlike they have long-term memory like humans. This is why LLM apologies are a useless sycophancy trap.

seanmcdirmid · 25 days ago

> Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.

Asking it why it did something isn’t useless, it just isn’t fullproof. If you really think it’s useless, you are way too heavily into binary thinking to be using AI.

Perfect is the enemy of useful in this case.

lebuin · 25 days ago

It seems like LLMs in general still have a very hard time with the concepts of "doubt" and "uncertainty". In the early days this was very visible in the form of hallucinations, but it feels like they fixed that mostly by having better internal fact-checking. The underlying problem of treating assumptions as truth is still there, just hidden better.

hnbad · 25 days ago

LLMs are basically improv theater. If the agent starts out with a wildly wrong assumption it will try to stick to it and adapt it rather than starting over. It can only do "yes and", never "actually nevermind, let me try something else".

I once had an agent come up with what seemed like a pointlessly convoluted solution as it tried to fit its initial approach (likely sourced from framework documentation overemphasizing the importance of doing it "the <framework> way" when possible) to a problem for which it to me didn't really seem like a good fit. It kept reassuring me that this was the way to go and my concerns were invalid.

When I described the solution and the original problem to another agent running the same model, it would instantly dismiss it and point out the same concerns I had raised - and it would insist on those being deal breakers the same way the other agent had dimissed them as invalid.

In the past I've often found LLMs to be extremely opinionated while also flipping their positions on a dime once met with any doubt or resistance. It feels like I'm now seeing the opposite: the LLM just running with whatever it picked up first from the initial prompt and then being extremely stubborn and insisting on rationalizing its choice no matter how much time it wastes trying to make it work. It's sometimes better to start a conversation over than to try and steer it in the right direction at that point.

avhception · 25 days ago

Doubt and uncertainty is left for us humans.

mustaphah · 25 days ago

This is like trying to fix hallucination by telling LLM not to hallucinate.

delaminator · 25 days ago

so many times have ended up here :

"You're absolutely correct. I should have checked my skills before doing that. I'll make sure I do it in the future."

amluto · a month ago

My personal experience is that it’s worthwhile to put instructions, user-manual style, into the context. These are things like:

- How to build.

- How to run tests.

- How to work around the incredible crappiness of the codex-rs sandbox.

I also like to put in basic style-guide things like “the minimum Python version is 3.12.” Sadly I seem to also need “if you find yourself writing TypeVar, think again” because (unscientifically) it seems that putting the actual keyword that the agent should try not to use makes it more likely to remember the instructions.

mlaretallack · a month ago

I also try to avoid negative instructions. No scientific proof, just a feeling the same as you, "do not delete the tmp file" can lead too often to deleting the tmp file.

strokirk · 25 days ago

It’s like instructing a toddler.

likium · 25 days ago

For TypeVar I’d reach for a lint warning instead.

amluto · 25 days ago

Then little toddler LLM will announce something like “I implemented what you requested and we’re all done. You can run the lint now.” And I’ll reply “do it yourself.”

I can only assume that everyone reporting amazing success with agent swarms and very long running tasks are using a different harness than I am :)

bonesss · 25 days ago

I also have felt like these kinds of efforts at instructions and agent files have been worthwhile, but I am increasingly of the opinion that such feelings represent self-delusion from seeing and expecting certain things aided by a tool that always agrees with my, or its, take on utility. The agent.md file looks like it’d work, it looks how you’d expect, but then it fails over and over. And the process of tweaking is pleasant chatting with supportive supposed insights and solutions, which means hours of fiddling with meta-documentation without clear rewards because of partial adherence.

The papers conclusions align with my personal experiments at managing a small knowledge base with LLM rules. The application of rules was inconsistent, the execution of them fickle, and fundamental changes in processing would happen from week-to-week as the model usage was tweaked. But, rule tweaking always felt good. The LLM said it would work better, and the LLM said it had read and understood the instructions and the LLM said it would apply them… I felt like I understoood how best to deliver data to the LLMs, only to see recurrent failures.

LLMs lie. They have no idea, no data, and no insights into specific areas, but they’ll make pleasant reality-adjacent fiction. Since chatting is seductive, and our time sense is impacted by talking, I think the normal time versus productivity sense is further pulled out of ehack. Devs are notoriously bad at estimating where they’re using time, long feedback loops filled with phone time and slow ass conversation don’t help.

medler · a month ago

Quite a surprising result: “across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%.”

tartakovsky · 25 days ago

Well, task == Resolving real GitHub Issues

Languages == Python only

Libraries (um looks like other LLM generated libraries -- I mean definitely not pure human: like Ragas, FastMCP, etc)

So seems like a highly skewed sample and who knows what can / can't be generalized. Does make for a compelling research paper though!

nielstron · 25 days ago

Hey, paper author here. We did try to get an even sample - we include both SWE-bench repos (which are large, popular and mostly human-written) and a sample of smaller, more recent repositories with existing AGENTS.md (these tend to contain LLM written code of course). Our findings generalize across both these samples. What is arguably missing are small repositories of completely human-written code, but this is quite difficult to obtain nowadays.

locknitpicker · 25 days ago

I think that is a rather fitting approach to the problem domain. A task being a real GitHub issue is a solid definition by any measure, and I see no problem picking language A over B or C.

If you feel strongly about the topic, you are free to write your own article.

bootsmann · 25 days ago

> Libraries (um looks like other LLM generated libraries -- I mean definitely not pure human: like Ragas, FastMCP, etc)

How does this invalidate the result? Aren't AGENTS.md files put exactly into those repos that are partly generated using LLMs?

rmnclmnt · 25 days ago

Yesterday while i was adding some nitpicks to a CLAUDE.md/AGENTS.md file, I thought « this file could be renamed CONTRIBUTING.md and be done with it ».

Maybe I’m wrong but sure feels like we might soon drop all of this extra cruft for more rationale practices

nielstron · 25 days ago

Exactly my thoughts... the model should just auto ingest README and CONTRIBUTING when started.

delaminator · 25 days ago

You could have claude --init create this hook and then it gets into the context at start and resume

Or create it in some other way

    {
      "hookSpecificOutput": {
        "hookEventName": "SessionStart",
        "additionalContext": "<contents of your file here>"
      }
    }

I thought it was such a good suggestion that I made this just now and made it global to inject README at startup / resume / post compact - I'll see how it works out

https://gist.github.com/lawless-m/fa5d261337dfd4b5daad4ac964...

    #!/bin/bash
    # ~/.claude/hooks/inject-readme.sh

    README="$(pwd)/README.md"

    if [ -f "$README" ]; then
      CONTENT=$(jq -Rs . < "$README")
      echo "{\"hookSpecificOutput\" :{\"hookEventName\":\"SessionStart\",\"additionalContext\":${CONTENT}}}"
      exit 0
    else
      echo "README.md not found" >&2
      exit 1
    fi

with this hook

    {
      "hooks": {
        "SessionStart": [
          {
            "matcher": "startup|clear|compact",
            "hooks": [
              {
                "type": "command",
                "command": "~/.claude/hooks/inject-readme.sh"
              }
            ]
          }
        ]
      }
    }

rmnclmnt · 25 days ago

And that makes total sense. Honestly working since a few days with Opus 4.6, it really feels like a competent coworker, but need some explicit conventions to follow … exactly when onboarding a new IC! So i think there is a bright light to be seen: this will force having proper and explicit contribution rules and conventions, both for humans and robots

gordonhart · 25 days ago

Exactly, it's the same documentation any contributor would need, just actually up-to-date and pared down to the essentials because it's "tested" continuously. If I were starting out on a new codebase, AGENTS.md is the first place I'd look to get my bearings.

prodigycorp · 25 days ago

LLMs are generally bad at writing non-noisy prompts and instructions. It's better to have it write instructions post hoc. For instance, I paste this prompt into the end of most conversations:

  If there’s a nugget of knowledge learned at any point in this conversation (not limited to the most recent exchange), please tersely update AGENTS.md so future agents can access it. If nothing durable was learned, no changes are needed. Do not add memories just to add memories.
  
  Update AGENTS.md **only** if you learned a durable, generalizable lesson about how to work in this repo (e.g., a principle, process, debugging heuristic, or coding convention). Do **not** add bug- or component-specific notes (for example, “set .foo color in bar.css”) unless they reflect a broader rule.
  
  If the lesson cannot be stated without referencing a specific selector or file, skip the memory and make no changes. Keep it to **one short bullet** under an appropriate existing section, or add a new short section only if absolutely necessary.

It hardly creates rules, but when it does, it affects rules in a way that positively affects behavior. This works very well.

Another common mistake is to have very long AGENTS.md files. The file should not be long. If it's longer than 200 lines, you're certainly doing it wrong.

joquarky · 25 days ago

> If nothing durable was learned, no changes are needed.

Off topic, but oh my god if you don't do this, it will always do the thing you conditionally requested it to do. Not sure what to call this but it's my one big annoyance with LLMs.

It's like going to a sub shop and asking for just a tiny bit of extra mayo and they heap it on.

Bolwin · 25 days ago

Llms generally seem trained with the assumption that if you mention it, you want it.

I don't think the instruction following benches test for this much and I don't know how you'd measure it well

wolfejam · 17 days ago

This paper validates what we've been building toward. The core issue isn't the idea of context files — it's that prose is the wrong format for structured facts.

AI crushes structured data like package.json but struggles with free-form markdown. Two developers describe the same repo completely differently. There's no schema, no validation, no scoring.

Our paper on CERN's Zenodo proposes FAF — a structured YAML format (IANA-registered as application/vnd.faf+yaml) that replaces prose with validated fields. One .faf file generates native outputs for CLAUDE.md, AGENTS.md, .cursorrules, and GEMINI.md. The instruction files stay — they just sit on top of a structured foundation instead of floating independently.

Paper: https://zenodo.org/records/18251362