What I really want is to be able to run something like this locally for, say, less than $2000 in computer hardware. Is this feasible now or any time soon. Anyone out there using agents with local models for coding?
A lot of people are excited about the Qwen3-Coder family of models: https://huggingface.co/collections/Qwen/qwen3-coder-687fc861...
For running locally, there are tools like Ollama and LM Studio. Your hardware needs will fluctuate depending on what size/quantization of model you try to run, but 2k in hardware cost is reasonable for running a lot of models. Some people have good experiences using the M-series Macs, which is probably a good bang-for-buck if you're exclusively interested in inference.
I'd recommend checking out the LocalLlamas subreddit for more: https://www.reddit.com/r/LocalLLaMA/
Getting results on par with big labs isn't feasible, but if you prefer to run everything locally, it is a fun and doable project.
So, for example, by and large the orgs I've seen chucking Claude PRs over the wall with little review were previously chucking 100% human written PRs over the wall with little review.
Similarly, the teams I see effectively using test suites to guide their code generation are the same teams that effectively use test suites to guide their general software engineering workflows.
Tough to use them as proof that this "doesn't have anything to do with economics" when their entire social life was defined by the economics of coal mining.
I probably could get an LLM to do so, but I won't....
Much of their work is focused on discovering "circuits" that occur between layer activations as they process data, which correspond to dynamics the model has learned. So, as a simple hypothetical example, instead of embedding the answer to 1 million arbitrary addition problems in the weights, models might learn a circuit that approximates the operation of addition.
At their core (and as far as I understand), LLMs are based on pre-existing texts, and use statistical algorithms to stitch together text that is consistent with these.
An original research manuscript will not have formed part of any LLMs training dataset, so there is no conceivable way that it can evaluate it, regardless of claims that LLMs "understand" anything or not.
Reviewers who use LLMs are likely deluding themselves that they are now more productive due to use of AI, when in fact they are just polluting science through their own ignorance of epistemology.
You can then create datasets out of these traces, and use them to benchmark improvements you make to your application.
This niche of the field has come a very long way just over the last 12 months, and the tooling is so much better than it used to be. Trying to do this from scratch, beyond a "kinda sorta good enough for now" project, is a full-time engineering project in and of itself.
I'm a maintainer of Opik, but you have plenty of options in the space these days for whatever your particular needs are: https://github.com/comet-ml/opik