Rendered at 19:35:56 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
RugnirViking 11 hours ago [-]
yes, they do. I think people overindex on this paper, I remember when it came out we had a lot of discussion in my company about it. But its clear to see they do at least change the agent's behavior, and things like telling it "always use xyz version of java, use gradle to build the project, use this command to run the tests" are really important instead of letting it fumble about trying to find the right thing every time you ask it anything
I think the problem some people fall into, and especially LLM authored ones (which is where they see the documents not helping here) is instead describing the code, or the structure of the code. Which I don't think helps much - the agent can already see you have 4 modules called a b c and d, and can read the readmes inside of them just fine if it has questions.
One more marginal thing I find helpful but im less sure has positive impact is describing the right terminology for the agent so it can be smarter at communicating with the developer. Things like different names for the product, products it interfaces with, resource names in infra, terms from the customer and product team. I don't think it helps the agent code (much) but it does help communication if it knows what we mean when we speak (and naming things is, as we know, one of the hard problems in CS)
Overall, most of my agents.md now are a list of useful bash commands for working and testing with the project & tests. (heres how to spin up docker services, heres how to update the libraries, heres how to run a command against the local db, heres how to insert a document to be run etc)
and then at the end a terminology blob, which I find myself referencing too.
nicolailolansen 10 hours ago [-]
Yeah. It's very easy to give a definitive "yes", if you've ever worked with coding agents in any capacity. I use them in much the same way as you, there are a bunch of things that would be nice for the agent to know, specifically for me, the project, or how I would like it to run as an example: git or coding review loops ( i use roborev ). Asking the agent to do these things every time, is very time consuming.
iLoveOncall 9 hours ago [-]
Maybe read the paper... It says that the only thing that are helpful is indeed what you describe here: basic commands and context about running / working on the project, rather than information about the business or technological aspects of the project itself.
ch4s3 9 hours ago [-]
Yeah I wrote my own language and obviously it’s not in any training data which end tip being an interesting experiment with agents. I’ve found that a few concise skills and an agents.md make a huge difference in guiding the LLMs. Specifically getting them to use the all in one build tool, which the LLMs won’t use without direction.
hnlmorg 8 hours ago [-]
As a language author myself, I'd be interested to learn more about how you utilised agents.md and skills for this language of yours.
ch4s3 8 hours ago [-]
The core design decision that supports everything is having a sort of multi-tool that does a lot for the language called forge. It builds, tests, runs, initializes, formats, manages dpes, and so on. It also has a search tool that apparently works kind of like Haskell's hoogle(which I discovered later), and that search tool helps agents find code.
I have some usage instructions about this tool in my various agents files which works well enough alongside a syntax.md that can be copied around.
From there I have some concise skills under /.claude/commands/ like build-test, forge-search, and a few things for working on the compiler. The specific skills include short snippets, descriptions, and some concise usage guidelines.
It all works reasonably well.
My biggest issue is that as I build out libraries I keep finding edge cases with my Perceus GC[1] implementation.
That is all the stuff that should be in the README.md of the project in the first place though.
RugnirViking 8 hours ago [-]
right! I don't disagree. README and agents.md probably will end up looking similar (or being the same) in the long run - readmes should probably have MORE information about the structure of the code if anything
swingboy 7 hours ago [-]
Yes, but harnesses don’t automatically include the README.md in the system prompt like they do AGENTS.md.
Paper was discussed here 4 months ago, and the linked tweet on this post doesn't add any insights and completely misses the huge caveats that come with the found result: the main benefits of using AGENTS.md files are inherently opposed to the characteristics of _median_ "public github project that has an AGENTS.md file".
prodigycorp 11 hours ago [-]
Yeah it was a good discussion. I'm definitely on the side of "a well written agents.md is very good for the agent". The file should not be static. The agents.md from a year ago is not the right agents.md for today. Many of them are overly long, overly instructive, and include a lot of unccessary code bits.
I think it's useful for people using agent harness to regularly evaluate your skills, agent rules, and memory implementation to ensure there's no conflict across them all. Also, best to rely as little as possible on the agent to write its own agents.md.
It can be be tedious, but that's why agentic coding can still be considered a "skill".
nikcub 9 hours ago [-]
> "a well written agents.md is very good for the agent"
while even a mildly bad agents.md can be _very_ bad for the agent. they rot very quickly which is why human curation is essential.
same with memory - a lot of the self-learning tools that are becoming popular now degrade agents over time - which is why you end up being able to run an eval with no context and it performs better
> but that's why agentic coding can still be considered a "skill".
yes - far too many cases of throwing a kitchen sink of prompts, skills, tools etc. thinking the llm will sort it out. you need to constantly prune, eval, tweak, observe, update etc. in a loop
sebra 11 hours ago [-]
The tweet misses the conclusion from the paper that handcrafted AGENTS.md might help. To me its no surprise that 100% vibed AGENTS.md are unproductive. Not reviewing your design docs is probably even worse than not reviewing your code? I've seen some AI-generated agents.md which were just plain wrong. No surprise agents perform worse after reading those.
I use AGENTS.md to make sure my agents loop effectively (tests, quality, etc). Not to describe the code / architecture.
graeber_28927 10 hours ago [-]
> To me its no surprise that 100% vibed AGENTS.md are unproductive.
To me it is! I thought if an AGENTS.md is researched with much effort once, next time reading the brief conclusion would save on effort and time and cost. Would have seemed logical to me. Like a glossary, looking up a manual, getting briefed before starting.
But I accept the research findings, and I have even had bad experiences with an outdated AGENTS.md derailing the agent.
So now I'm trying to keep to the basics, e.g. just a sentence or two saying "use bun instead of npm" only because I've had to manually prompt that too many times.
It's remarkable how good agents are at exploring, and at treating relatively sparse code reads and get a good picture nevertheless.
theshrike79 10 hours ago [-]
Don't describe the code / architecture in AGENTS.md - BUT tell the agent where the documentation for both can be found for each module etc.
This way it can refer to the docs only when relevant instead of dragging around the full knowledge of the front-end structure while it's working on the backend.
kandros 11 hours ago [-]
The amount of cargo-culting around AI tooling and practices is so weird to me.
Why not just try and see? The fast feedback loop allow testing all kind of weird theories in a matter of 30m-1h during normal working sessions, most results are obvious
CodesInChaos 10 hours ago [-]
The configuration and workflow space is big, and exploring it isn't fun for most people.
netdevphoenix 10 hours ago [-]
>Why not just try and see?
There is a difference between trying something and performing a test whose results are meant to be representative of most setups. A lot of people (in tech surprisingly) who see themselves as practitioners of applied science, apply flawed methodologies and try to generalise the results.
Just because testing is fast, it does not mean that it is free or cheap. Plus time is money and verifying every LLM setup every koolaid dev is proclaiming on bluesky could easily be a full time job.
popcorncowboy 9 hours ago [-]
Agents.md is just a prompt pre-pend. This is like asking "do prompts help coding agents".
dofm 9 hours ago [-]
But that is an open question. For example some smaller, local-friendly LLMs (Gemma, Qwen) clearly do better without a long system prompt. Faster ad just no worse at code. Both seem eminently capable of writing high quality, eloquent code without being specifically prompted to, both need little instruction on tool calling. The consensus is leading us towards very long system prompts, agent files, stuffing with skills, when it might be cargo-culting (again).
A culture of really long preambles certainly benefits cloud AI companies though!
avadodin 8 hours ago [-]
You can just ask any model to go do a code review and it will probably do better than no review at all but with rules you can make sure at least it processed inputs enough to produce a hallucinated reason which you can check for why a rule doesn't apply.
To me, that's worth losing some theoretical model performance and token efficiency unless you're using the small local models which basically lose all of that while trying to follow the rules.
dofm 4 hours ago [-]
Definitely. Though my own experimentation is showing that with the small models it is worth working through the fewest words. I don't think Gemma will need too much prompting to not sound like a web dev blogger.
zuzululu 12 hours ago [-]
Not enough on its own you'd need artifacts to store contexts/TOC/lists
I think shorter the better.
also a strange finding from my own experiences: specific empirical formats seem to yield much better results. For example people often say "get this done to 100%" but I say "get this to 88.47%".
wiseowise 12 hours ago [-]
You putting “you’re an expert jerk off master” in agents.md is the same as shaman burning a bone to predict a future.
dofm 9 hours ago [-]
I am equally cynical!
Some of it is cosmic ordering, some of it is cargo culting, some of it is clearly FUD.
The trick presumably is to end up on the words that actively shift the “culture” of the subsequent interpretation of words.
I don’t know how you do that, really, but I do think it’s extremely instructive to see what a model can do with an empty prompt.
rsfern 7 hours ago [-]
If the explicit role-playing prompt is just to identify multi-valent terms, then revising the question to include more specific context without a role-play prompt should work just as well right? I’d be really interested if anyone has evaluated that hypothesis
A fun (frustrating) feature of language is that we get these name collisions even with a single domain. One that I have to remember to revise myself fairly often these days when chatting with other experts in my field is “diffusion model” which can either mean generative deep learning or a differential equation describing mass transport.
dofm 7 hours ago [-]
LoRA/LoRa always makes me chuckle. Feels like a turf war. And in the photography world, "AI model" is overloaded.
And indeed — beyond specific guidance around tool management, you have to wonder if a very minimal prompt and a /grill-me or /ubiquitous-language approach, narrowed to focus on what matters this time, wouldn't do better. I have been playing with this idea as I try to learn all this stuff in a local model context.
SeriousM 9 hours ago [-]
Yes. In my case (and I guess everyones usecase is subjective) my system prompt states to read the AGENT.md file when possible.
On a new project I usually set up the context of the model (language to use, reason of the product/prototype, etc.) and then I tell the LLM to write a AGENT.md, STATE.md and ROADMAP.md.
I don't tell the LLM what's in there because the model has it's own directive and flavor what should be in these files. The models already know the purpose of these files by themself! On a new session, I let the agent read the markdown files in order to continue with the work. Before a session ends, I let the LLM update the markdown files.
Maybe one word of caution: don't switch models - it's like putting another person on a working station and ask them to continue the work of others.
Easy setup, really good outcome!
weddpros 12 hours ago [-]
If adding something to the context doesn't help, it's only proves you're not adding the right stuff.
I'm adding pointers to specification documents, and it saves me from the /new dumb coding agent that sees your code base for the first time and knows nothing about architecture, concepts, code organisation, etc...
I'm using no cookie cutter directives though (except maybe "do not attempt to deploy, we're using CI CD to deploy" to avoid an automatic "wrangler deploy" to Cloudflare)
hyperpape 9 hours ago [-]
We need a companion to "IN MICE", which is "IN EVALS".
I don't think this is bad research, but you have to understand how far it generalizes. I'm not saying that evals are useless, we need to do our best to produce good benchmarks. But benchmarks are always going to lag pretty far behind real world applications.
cdogukank 8 hours ago [-]
Anecdotally yes, but with diminishing returns — a short, specific agents.md helped more than a long one in my experience. Past a point the agent stops respecting the extra context. Curious if others have found a sweet spot for length.
asp_hornet 12 hours ago [-]
As the author notes in the end, it would be really interesting to do these again on more recent models. I wonder if the no context file being cheaper still stands. But then how much does the harness influence the results. It can be frustrating trying to gauge what’s influencing what and if something suddenly starts working against you.
OutOfHere 5 hours ago [-]
Fwiw, don't put in AGENTS.md what belongs in README.md.
Also, there is such a thing as excessive or useless context. Personally, I like to control what the AI reads by customizing the prompt with only what it needs and nothing more.
simianwords 9 hours ago [-]
Is this one of those times you discard "scientific" tests and trust your instinct instead? OpenAI is doing it well using curated context files [1]. Maybe these "scientific" explanations are not useful or misleading.
The paper actually says:
"We find that all context files consistently increase the number of steps required to complete tasks. LLM-generated context files have a marginal negative effect on task success rates, while developer-written ones provide a marginal performance gain."
"Overall, our results suggest that context files have only marginal effect on agent behavior, and are likely only desirable when manually written."
verminator468 11 hours ago [-]
I say Nes.
Drakim 12 hours ago [-]
You are absolutely right, thank you for pushing back. Upon further examination, I've confirmed that the referenced paper says no.
/s
sam_lowry_ 11 hours ago [-]
On a second thought... The result is counter-intuitive, because writing AGENTS.md (and reading if it's generated) contributes to the context in the human. So yes, AGENTS.md probably help more than the paper says, but not the way we initially frame it.
Dunno if there's a way to test that AGENTS.md help the human more than the machine.
phplovesong 10 hours ago [-]
From my tests agents.md does NOT work with copilot. I have a custom languge and copilot thinks its Rust.
comboy 9 hours ago [-]
This is relevant to my interests, did you maybe test which models handle custom languages best? It also seems like a good proxy for them being able to stick to important instructions and not being carried away with things that are lookalikes.
I think the problem some people fall into, and especially LLM authored ones (which is where they see the documents not helping here) is instead describing the code, or the structure of the code. Which I don't think helps much - the agent can already see you have 4 modules called a b c and d, and can read the readmes inside of them just fine if it has questions.
One more marginal thing I find helpful but im less sure has positive impact is describing the right terminology for the agent so it can be smarter at communicating with the developer. Things like different names for the product, products it interfaces with, resource names in infra, terms from the customer and product team. I don't think it helps the agent code (much) but it does help communication if it knows what we mean when we speak (and naming things is, as we know, one of the hard problems in CS)
Overall, most of my agents.md now are a list of useful bash commands for working and testing with the project & tests. (heres how to spin up docker services, heres how to update the libraries, heres how to run a command against the local db, heres how to insert a document to be run etc)
and then at the end a terminology blob, which I find myself referencing too.
I have some usage instructions about this tool in my various agents files which works well enough alongside a syntax.md that can be copied around.
From there I have some concise skills under /.claude/commands/ like build-test, forge-search, and a few things for working on the compiler. The specific skills include short snippets, descriptions, and some concise usage guidelines.
It all works reasonably well.
My biggest issue is that as I build out libraries I keep finding edge cases with my Perceus GC[1] implementation.
[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...
Paper was discussed here 4 months ago, and the linked tweet on this post doesn't add any insights and completely misses the huge caveats that come with the found result: the main benefits of using AGENTS.md files are inherently opposed to the characteristics of _median_ "public github project that has an AGENTS.md file".
I think it's useful for people using agent harness to regularly evaluate your skills, agent rules, and memory implementation to ensure there's no conflict across them all. Also, best to rely as little as possible on the agent to write its own agents.md.
It can be be tedious, but that's why agentic coding can still be considered a "skill".
while even a mildly bad agents.md can be _very_ bad for the agent. they rot very quickly which is why human curation is essential.
same with memory - a lot of the self-learning tools that are becoming popular now degrade agents over time - which is why you end up being able to run an eval with no context and it performs better
> but that's why agentic coding can still be considered a "skill".
yes - far too many cases of throwing a kitchen sink of prompts, skills, tools etc. thinking the llm will sort it out. you need to constantly prune, eval, tweak, observe, update etc. in a loop
I use AGENTS.md to make sure my agents loop effectively (tests, quality, etc). Not to describe the code / architecture.
To me it is! I thought if an AGENTS.md is researched with much effort once, next time reading the brief conclusion would save on effort and time and cost. Would have seemed logical to me. Like a glossary, looking up a manual, getting briefed before starting.
But I accept the research findings, and I have even had bad experiences with an outdated AGENTS.md derailing the agent.
So now I'm trying to keep to the basics, e.g. just a sentence or two saying "use bun instead of npm" only because I've had to manually prompt that too many times.
It's remarkable how good agents are at exploring, and at treating relatively sparse code reads and get a good picture nevertheless.
This way it can refer to the docs only when relevant instead of dragging around the full knowledge of the front-end structure while it's working on the backend.
Why not just try and see? The fast feedback loop allow testing all kind of weird theories in a matter of 30m-1h during normal working sessions, most results are obvious
There is a difference between trying something and performing a test whose results are meant to be representative of most setups. A lot of people (in tech surprisingly) who see themselves as practitioners of applied science, apply flawed methodologies and try to generalise the results.
Just because testing is fast, it does not mean that it is free or cheap. Plus time is money and verifying every LLM setup every koolaid dev is proclaiming on bluesky could easily be a full time job.
A culture of really long preambles certainly benefits cloud AI companies though!
To me, that's worth losing some theoretical model performance and token efficiency unless you're using the small local models which basically lose all of that while trying to follow the rules.
I think shorter the better.
also a strange finding from my own experiences: specific empirical formats seem to yield much better results. For example people often say "get this done to 100%" but I say "get this to 88.47%".
Some of it is cosmic ordering, some of it is cargo culting, some of it is clearly FUD.
The trick presumably is to end up on the words that actively shift the “culture” of the subsequent interpretation of words.
I don’t know how you do that, really, but I do think it’s extremely instructive to see what a model can do with an empty prompt.
A fun (frustrating) feature of language is that we get these name collisions even with a single domain. One that I have to remember to revise myself fairly often these days when chatting with other experts in my field is “diffusion model” which can either mean generative deep learning or a differential equation describing mass transport.
And indeed — beyond specific guidance around tool management, you have to wonder if a very minimal prompt and a /grill-me or /ubiquitous-language approach, narrowed to focus on what matters this time, wouldn't do better. I have been playing with this idea as I try to learn all this stuff in a local model context.
On a new project I usually set up the context of the model (language to use, reason of the product/prototype, etc.) and then I tell the LLM to write a AGENT.md, STATE.md and ROADMAP.md. I don't tell the LLM what's in there because the model has it's own directive and flavor what should be in these files. The models already know the purpose of these files by themself! On a new session, I let the agent read the markdown files in order to continue with the work. Before a session ends, I let the LLM update the markdown files. Maybe one word of caution: don't switch models - it's like putting another person on a working station and ask them to continue the work of others.
Easy setup, really good outcome!
I'm adding pointers to specification documents, and it saves me from the /new dumb coding agent that sees your code base for the first time and knows nothing about architecture, concepts, code organisation, etc...
I'm using no cookie cutter directives though (except maybe "do not attempt to deploy, we're using CI CD to deploy" to avoid an automatic "wrangler deploy" to Cloudflare)
I don't think this is bad research, but you have to understand how far it generalizes. I'm not saying that evals are useless, we need to do our best to produce good benchmarks. But benchmarks are always going to lag pretty far behind real world applications.
Also, there is such a thing as excessive or useless context. Personally, I like to control what the AI reads by customizing the prompt with only what it needs and nothing more.
[1] https://openai.com/index/harness-engineering/
"Overall, our results suggest that context files have only marginal effect on agent behavior, and are likely only desirable when manually written."
/s
Dunno if there's a way to test that AGENTS.md help the human more than the machine.