Microsoft has clearly taken notice. They're already starting to lock down the upstream VSCode codebase, as seen with recent changes to the C/C++ extension [0]. It's not hard to imagine that future features like TypeScript 7.0 might be limited or even withheld from forks entirely. At the same time, Microsoft will likely replicate Windsurf and Cursor's features within a year. And deliver them with far greater stability and polish.
Both Windsurf and Cursor are riddled with bugs that don't exist upstream, _especially_ in their AI assistant features beyond the VSCode core. Context management which is supposed to be the core featured added is itself incredibly poorly implemented [1].
Ultimately, the future isn't about a smarter editor, it's about a smarter teammate. Tools like GitHub Copilot or future agents will handle entire engineering tickets: generating PRs with tests, taking feedback, and iterating like a real collaborator.
[0] https://www.theregister.com/2025/04/24/microsoft_vs_code_sub...
[1] https://www.reddit.com/r/cursor/comments/1kbt790/rules_in_49...
If autonomous agents were just around the corner, then why wouldn't OpenAI bet on their own Codex product obviating (most) need for an IDE and save themselves the $3 billion?
Claude and Grok 4 did reasonably well (within CPA baselines) for the first few months, but tended to degrade as more data came in. Interestingly, the failures aren’t exclusively a context length problem, as we reset the context monthly (with past decisions, accruals/deferrals, and comments available via tool calls) and the types of errors appear to be more reward hacking vs pure hallucinations.
Accounting is very interesting in an RL-first world as it is pretty easy to develop intermediate rewards for training models. We are pretty sure that we can juice the performance more with a far more rigid scaffold, but that’s less relevant from a capabilities research perspective. We’re pushing down this research direction and will see how it goes.
Let us know if you have any questions!
Can you comment on the variance? It's impressive that models are able to do this consistently with 100% accuracy in the early months, but it would be less so if there was any significant degree of variance amongst the three runs (e.g. 90%, 95%, 100%.)