We (the Princeton SWE-bench team) built an agent in ~100 lines of code that does pretty well on SWE-bench, you might enjoy it too: https://github.com/SWE-agent/mini-swe-agent
Your task: {{task}}. Please reply
with a single shell command in
triple backticks.
To finish, the first line of the
output of the shell command must be
'COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT'.
> 1. Analyze the codebase by finding and reading relevant files
2. Create a script to reproduce the issue
3. Edit the source code to resolve the issue
4. Verify your fix works by running your script again
5. Test edge cases to ensure your fix is robust
This prompt snippet from your instance template is quite useful. I use something like this for getting out of debug loops:
> Analyse the codebase and brainstorm a list of potential root causes for the issue, and rank them from most likely to least likely.
Then create scripts or add debug logging to confirm whether your hypothesis is correct. Rule out root causes from most likely to least by executing your scripts and observing the output in order of likelihood.
A very similar "how to guide" can be found here https://ampcode.com/how-to-build-an-agent written by Thorsten Ball. In general Amp is quite interesting - obviously no hidden gem anymore ;-) but great to see more tooling around agentic coding being published. Also, because similar agentic-approaches will be part of (certain/many?) software suits in the future.
Can someone confirm my understanding of how tool use works behind the scenes? Claude, ChatGPT, etc, through the API offer "tools" and give responses that ask for tool invocations which you then do and send the result back. However, the underlying model is a strictly text based medium, so I'm wondering how exactly the model APIs are turning the model response into these different sort of API responses. I'm assuming there's been a fine-tuning step with lots of examples which put desired tool invocations into some sort of delineated block or something, which the Claude/ChatGPT server understand? Is there any documentation about how this works exactly, and what those internal delineation tokens and such are? How do they ensure that the user text doesn't mess with it and inject "semantic" markers like that?
The disconnect here is that models aren't really "text" based, but token based, like how compilers don't use the code itself but a series of tokens that can include keywords, brackets, and other things. The output can include words but also metadata
You have the right picture of what’s going on. Roughly:
* The only true interface with an LLM is tokens. (No separation between control and data channels.)
* The model api layer injects instructions on tool calling and a list of available tools into the base prompt, with documentation on what those tools do.
* Tool calling is delineated by special tokens. When a model wants to call a tool, it adds a special block to the response that contains the magic token(s) along with the name of the tool and any params. The api layer then extracts this and forms a structured json response in some tool_calls parameter or whatever that is sent in the api response to the user. The result of the tool coming back from the user through the tool calling api is then encoded with special tokens and injected.
* Presumably, the api layer prevents the user from injecting such tokens themselves.
* SotA Models are good at tool calls because they have been heavily fine-tuned on them, with all sorts of tasks that involve tool calls, like bash invocations. The fine-tuning is both to get them good at tool calls in general, and also probably involves specific tool calls that the model provider wants them to be good at, such as the Claude Sonnet model getting fine-tuned on the specific tools Claude Code uses.
Sometimes it amazes me that this all works so well, but it does. You are right to put your finger on the fine-tuning, as it’s critical for making tool calling work well. Tool calling works without fine-tuning, but it’s going to be more hit-or-miss.
> I'm assuming there's been a fine-tuning step with lots of examples which put desired tool invocations into some sort of delineated block or something, which the Claude/ChatGPT server understand?
As far as I know that's what's happening. They are training it to return tool responses when it's unsure about the answer or instructed to do so. There are generic tool trainings for just following the response format, and then probably there are some tool specific trainings. For instance gpt-oss loves to use the search tool, even if it's not mentioned anywhere. Anthropic lists well known tools in their document (eg: text_editor, bash). They are likely to have been trained specifically to follow some deeper semantics compared to just generic tool usage.
The whole thing is pretty brittle and tool invocations are just taking place via in-band signalling, delineated by special tokens or token sequences.
Late to the party, but thanks to the author for this: I learned a lot from this article, although I have mixed feelings about it.
The good: cool to know more about the agents loops, different types of LLMs, ideas for prompting. I definitely wanna try it - would be cool to prompt the agent to build some feature, have it in a loop of building, testing, reviewing and, go have breakfast, come back and only have to tweak a reasonably legible and working code.
The bad: some of these concepts, maybe they are bit meant to mislead, buy really trigger my 'snake oil alert'. The AI compass? Agentic VS non agentic LLMs? People who are getting work done between meetings? Maybe this is more of a vibe thing, so it's not trivial / logical to explain, but in this space there's so many loosely defined concepts that that really trigger skepticism in me (and others).
Technically speaking, you can get away with just a Bash tool, and I had some success with this. It's actually quite interesting to take away tools from agents and see how creative they are with the use.
One of the reasons why you get better performance if you give them the other tools is that there has been some reinforcement learning on Sonne with all these tools. The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions. The Bash tool, for instance, at times gets confused by bashisms, not escaping arguments correctly, not handling whitespace correctly etc.
> The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions.
Interesting! This didn't seem to be the case in the OP's examples - for instance using a list_files tool and then checking if the json result included README vs bash [ -f README ]
Separate tools is simpler than having everything go through bash.
If everything goes through bash then you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.
If you have listing files as a separate tool then you can also enforce that the agent doesn't list any files outside of the project directory.
> you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.
This is a very strong argument for more specific tools, thanks!
Yeah, you could get away with a coding agent just using the Bash tool and the Edit tool (tbh somewhat optional but not having it would be highly inefficient). I haven't tried it, but it might struggle with the code search functionality. It would be possible with the right prompting. For example, you could just prompt the LLM to say "If you need to search the source code, use ripgrep with the Bash tool."
Why do humans need a IDE when we could do anything in a shell?
Interface give you the informations you need at a given moment and the actions you can take.
To me a better analogy would be: if you're a household of 2 who own 3 reliable cars, why would you need a 4th car with smaller cargo & passenger capacities, higher fuel consumption, worse off-road performance and lower top speed?
Who says that tokens are money? Local models are getting really good. For now, yes, if you want the best outcomes, you need to purchase tokens. But in the future, that may not be the case.
I'd argue that local models still cost money, albeit less than the vendors would cost. Unless you happen to live off-grid and get your own electricity for free. I suppose there are free tiers available that work for some things as well.
But with edge-case exceptions aside, yes, tokens cost money.
They are great for basic tasks like summarization and translation but for the best results from coding agents and from 90% of so-called AI startups who are using these APIs, they are all purchasing tokens.
No different to operating a slot-machine towards vibe-coders who are the AI companies favourite type of customer - spending endless amounts of money on tokens for another spin at fixing an error they don't understand.
has this agent fully built anything? that is a pretty straight forward question that you should be expected to answer when submitting something like this to HN.
You haven't built anything. You are just a grifter spinning words in desperate need for attention. No one will ever use your "product" because it's useless. You know this and yet you keep trying to hustle the ignorant. Keep boosting yourself with alts.
The whole thing runs on these prompts: https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...
https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...
Dead Comment
This prompt snippet from your instance template is quite useful. I use something like this for getting out of debug loops:
> Analyse the codebase and brainstorm a list of potential root causes for the issue, and rank them from most likely to least likely.
Then create scripts or add debug logging to confirm whether your hypothesis is correct. Rule out root causes from most likely to least by executing your scripts and observing the output in order of likelihood.
that's not the case with a codebase, where things are littered around in tune with specific model of organisation the developer had in mind.
https://en.wikipedia.org/wiki/Lumpers_and_splitters
You wish
I've built a SWE agent too (for fun), check it out => https://github.com/myriade-ai/autocode
Lack of tools in mini-swe-agent is a feature. You can run it with any LLM no matter how big or small.
https://docs.anthropic.com/en/docs/agents-and-tools/tool-use...
The disconnect here is that models aren't really "text" based, but token based, like how compilers don't use the code itself but a series of tokens that can include keywords, brackets, and other things. The output can include words but also metadata
* The only true interface with an LLM is tokens. (No separation between control and data channels.)
* The model api layer injects instructions on tool calling and a list of available tools into the base prompt, with documentation on what those tools do.
* Tool calling is delineated by special tokens. When a model wants to call a tool, it adds a special block to the response that contains the magic token(s) along with the name of the tool and any params. The api layer then extracts this and forms a structured json response in some tool_calls parameter or whatever that is sent in the api response to the user. The result of the tool coming back from the user through the tool calling api is then encoded with special tokens and injected.
* Presumably, the api layer prevents the user from injecting such tokens themselves.
* SotA Models are good at tool calls because they have been heavily fine-tuned on them, with all sorts of tasks that involve tool calls, like bash invocations. The fine-tuning is both to get them good at tool calls in general, and also probably involves specific tool calls that the model provider wants them to be good at, such as the Claude Sonnet model getting fine-tuned on the specific tools Claude Code uses.
Sometimes it amazes me that this all works so well, but it does. You are right to put your finger on the fine-tuning, as it’s critical for making tool calling work well. Tool calling works without fine-tuning, but it’s going to be more hit-or-miss.
As far as I know that's what's happening. They are training it to return tool responses when it's unsure about the answer or instructed to do so. There are generic tool trainings for just following the response format, and then probably there are some tool specific trainings. For instance gpt-oss loves to use the search tool, even if it's not mentioned anywhere. Anthropic lists well known tools in their document (eg: text_editor, bash). They are likely to have been trained specifically to follow some deeper semantics compared to just generic tool usage.
The whole thing is pretty brittle and tool invocations are just taking place via in-band signalling, delineated by special tokens or token sequences.
The good: cool to know more about the agents loops, different types of LLMs, ideas for prompting. I definitely wanna try it - would be cool to prompt the agent to build some feature, have it in a loop of building, testing, reviewing and, go have breakfast, come back and only have to tweak a reasonably legible and working code.
The bad: some of these concepts, maybe they are bit meant to mislead, buy really trigger my 'snake oil alert'. The AI compass? Agentic VS non agentic LLMs? People who are getting work done between meetings? Maybe this is more of a vibe thing, so it's not trivial / logical to explain, but in this space there's so many loosely defined concepts that that really trigger skepticism in me (and others).
The ugly: 1 word slides ;p
Surely listing files, searching a repo, editing a file can all be achieved with bash?
Or is this what's demonstrated by https://news.ycombinator.com/item?id=45001234?
One of the reasons why you get better performance if you give them the other tools is that there has been some reinforcement learning on Sonne with all these tools. The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions. The Bash tool, for instance, at times gets confused by bashisms, not escaping arguments correctly, not handling whitespace correctly etc.
Deleted Comment
Interesting! This didn't seem to be the case in the OP's examples - for instance using a list_files tool and then checking if the json result included README vs bash [ -f README ]
If everything goes through bash then you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.
If you have listing files as a separate tool then you can also enforce that the agent doesn't list any files outside of the project directory.
This is a very strong argument for more specific tools, thanks!
If you need to edit the source, just use patch with the bash tool.
What's the efficiency issue?
My best guess is they started out with a limited subset of tools and realised they can just give it bash later.
Money. Replace "tokens" with "money". You just keep throwing money at the loop, and then you've got yourself an agent.
But with edge-case exceptions aside, yes, tokens cost money.
They are great for basic tasks like summarization and translation but for the best results from coding agents and from 90% of so-called AI startups who are using these APIs, they are all purchasing tokens.
No different to operating a slot-machine towards vibe-coders who are the AI companies favourite type of customer - spending endless amounts of money on tokens for another spin at fixing an error they don't understand.
And remember to avoid feeding the trolls.