Developing aider, I've seen this problem with gpt-4o, Sonnet, DeepSeek, etc. Many aider users report this too. It's perhaps the #1 problem users have, so I created a dedicated help page [0].
Very large context may be useful for certain tasks with lots of "low value" context. But for coding, it seems to lure users into a problematic regime.
[0] https://aider.chat/docs/troubleshooting/edit-errors.html#don...
Many conflicting ideas are harder for models to follow than one large unified idea.
[0]: https://jeremyberman.substack.com/p/how-i-got-a-record-536-o...
His current work with "the Ruliad" and the hypergraph model of all possible rules is actually interesting. Whether it will yield results as a framework for finding a TOE; who knows? (It helps that the hypergraph edges have no physical length, which means Lorentz contraction and continuous space can still be modeled. It does seem to require discrete time, relative to some starting node, though. Could just be my limited understanding.)
Also, his derisive term for people who think in terms of computation is likely a back-handed reference to Wolfram.
For more up-to-date thoughts on thermodynamics I'd start here: https://writings.stephenwolfram.com/2023/02/computational-fo...
https://x.com/cognition_labs/status/1834292718174077014
I'd expect a very different experience with Devin vs the IDE-forks -- it provides status updates in Slack, runs CI, and when it's done it puts up a pull request in GitHub.
Slack integration, automatically pushing to CI, etc., are relatively low-value compared to the questions of “does it write better code than alternatives?”, “can I depend on it to solve hard problems?”, “will I still need a Cursor and/or ChatGPT Pro subscription to debug Devin’s mistakes?”
In my own experience using Cursor with Claude 3.5 Sonnet (new) and o1-preview, Claude is sufficient for most things, but there are times when Claude gets stumped. Invariably that means I asked it to do too much. But sometimes, maybe 10-20% of the time, o1-preview is able to do what Claude couldn’t.
I haven’t signed up for o1 Pro because going from Cursor to copy/pasting from ChatGPT is a big DevX downgrade. But from what I’ve heard o1 Pro can solve harder coding problems that would stump Claude or o1-preview.
My solution is just to split the problem into smaller chunks that make it tractable for Claude. I assume this is what Devin’s doing. Or is Devin using custom models or an early version of the o1 (full or pro) API?
After about an hour with Windsurf, I find myself frustrated with how it deals with context. If you add a directory to your Cascade, it's reluctant to actually read all the files in the directory.
I understand that they don't want to pay for a ton of long-context queries, but please, let users control the context, and pass the costs to the user.
It's very annoying to have the LLM try to create a file that already exists, it just didn't know about it.
Also, comments on the terminal management reflect a real issue. One solution is to expose the Cascade terminal to the user, letting the user configure the terminal in a working state, so that it has access to the correct dependencies and the PATH is properly sourced.
Getting the UX to work well enough is a major challenge. I’m redesigning that currently, as I got negative feedback from early testers on my initial experimental UX. There’s a balance to be struck between giving users a low latency response, giving the models time to work together and call tools, and not overloading the user with too much information.