Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?
Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?
1. Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.
2. Coding agents do massively better when they have a test-driven reward signal.
3. If a problem can be framed in a way that a coding agent can solve, that speeds up development at least 10x from the base case of human + assistant.
4. From (1)-(3), if you can get all the necessary context into 50k tokens and measure progress via tests, you can speed up development by 10x.
5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.
Sure enough, I see HN projects evolving in that direction.
I had a very similar impression (wrote more in https://hua.substack.com/p/are-longer-context-windows-all-yo...).
One framing is that effective context window (i.e. the length that the model is able to effectively reason over) determines how useful the model is. A human new grad programmer might effectively reason over 100s or 1000s of tokens but not millions - which is why we carefully scope the work and explain where to look for relevant context only. But a principal engineer might reason over many many millions of context - code yes, but also organizational and business context.
Trying to carefully select those 50k tokens is extremely difficult for LLMs/RAG today. I expect models to get much longer effective context windows but there are hardware / cost constraints which make this more difficult.
1. perplexity filtering - small LLM looks at how in-distribution the data is to the LLM's distribution. if it's too high (gibberish like this) or too low (likely already LLM generated at low temperature or already memorized), toss it out.
2. models can learn to prioritize/deprioritize data just based on the domain name of where it came from. essentially they can learn 'wikipedia good, your random website bad' without any other explicit labels. https://arxiv.org/abs/2404.05405 and also another recent paper that I don't recall...
that library used `viterin/vek` for SIMD math: https://github.com/viterin/vek/
However from a product point of view I wouldn't necessarily want to pipe that into an LLM and have it reply, I think in a lot of use-cases there needs to be a tool/function calling step before a reply. Down to chat with anyone reading this who is working along these lines!
edit: tincans as mentioned below looks excellent too
editedit: noooo apparently tincans development has ended, there's 10000% space for something in this direction - Chris if you read this please let me pitch you on the product/business use-cases this solves regardless of how good llms get...
I built that almost exactly a year ago :) it was good but not fast enough - hence building the joint model.
anyways, another interpretation is that the model needs to also make a decision on if the code in the issue is a reliable fix or not too