*What I did:*
- Built a multi-agent AI system with three specialised agents:
- Orchestrator: The brain - never touches code, just delegates and coordinates
- Explorer agents: Read & run only investigators that gather intel
- Coder agents: The ones who actually implement stuff
- Created a "Context Store" which can be thought of as persistent memory that lets agents share their discoveries.
- Tested on TerminalBench with both Claude Sonnet-4 and Qwen3-Coder-480B.
*Key results:*
- Orchestrator + Sonnet-4: 36.0% success rate (#12 on leaderboard, ahead of Claude Code!)
- Orchestrator + Qwen-3-Coder: 19.25% success rate
- Sonnet-4 consumed 93.2M tokens vs Qwen's 14.7M tokens to compete all tasks!
- The orchestrator's explicit task delegation + intelligent context sharing between subagents seems to be the secret sauce
*(Kind of) Technical details:*
- The orchestrator can't read/write code directly - this forces proper delegation patterns and strategic planning
- Each agent gets precise instructions about what "knowledge artifacts" to return, these artifacts are then stored, and can be provided to future subagents upon launch.
- Adaptive trust calibration: simple tasks = high autonomy, complex tasks = iterative decomposition
- Each agent has its own set of tools it can use.
*More details:*
My Github repo has all the code, system messages, and way more technical details if you're interested! (Github handle is danau5tin).
Thanks for reading!
Dan
Deleted Comment
*What I did:*
- I trained a 14B orchestrator model to better coordinate explorer & coder subagents (subagents are tool calls for orchestrator) - Scaled to 32x H100s that were pushed to their limits across 4 bare-metal nodes - Scaled to 256 Docker environments rolling out simultaneously, automatically distributed across the cluster
*Key results:*
- Qwen3-14B jumped from *7% → 18.25%* on TerminalBench after training - Model now within striking distance of Qwen3-Coder-480B (19.7%) - Training was stable with smooth entropy decrease and healthy gradient norms
*Training approach:*
Reward design and biggest learning: Kept it simple - *just unit tests*. Every "smart" reward signal I tried to craft led to policy collapse
Curriculum learning: - Stage-1: Tasks where base model succeeded 1-2/3 times (41 tasks) - Stage-2: Tasks where Stage-1 model succeeded 1-4/5 times
Dataset: Used synthetically generated RL environments and unit tests
*More details:*
I have added lots more details in the repo linked to this submission, including training code, model weights, datasets.
Huge thanks to: - Tara for providing the compute - Prime Intellect team for building prime-rl and dealing with my endless questions - Alex Dimakis for the conversation that sparked training the orchestrator model
Thanks for reading!
Dan
(Evaluated on the excellent TerminalBench benchmark by Stanford & Laude Institute)