Readit News logoReadit News
timabdulla commented on AccountingBench: Evaluating LLMs on real long-horizon business tasks   accounting.penrose.com/... · Posted by u/rickcarlino
yunyu · a month ago
Hey all, member of the benchmark team here! The goal for this project was to see how LLMs well could do bookkeeping without an overly opinionated scaffold. We gave them access to processed transaction records and code execution tools, but it was up to them to choose exactly how to use those.

Claude and Grok 4 did reasonably well (within CPA baselines) for the first few months, but tended to degrade as more data came in. Interestingly, the failures aren’t exclusively a context length problem, as we reset the context monthly (with past decisions, accruals/deferrals, and comments available via tool calls) and the types of errors appear to be more reward hacking vs pure hallucinations.

Accounting is very interesting in an RL-first world as it is pretty easy to develop intermediate rewards for training models. We are pretty sure that we can juice the performance more with a far more rigid scaffold, but that’s less relevant from a capabilities research perspective. We’re pushing down this research direction and will see how it goes.

Let us know if you have any questions!

timabdulla · a month ago
> We conducted three runs per experiment and selected the run with the highest final accuracy for inclusion in the chart (though illustrative examples and anecdotes may be drawn from any of the runs).

Can you comment on the variance? It's impressive that models are able to do this consistently with 100% accuracy in the early months, but it would be less so if there was any significant degree of variance amongst the three runs (e.g. 90%, 95%, 100%.)

Deleted Comment

timabdulla commented on OpenAI reaches agreement to buy Windsurf for $3B   bloomberg.com/news/articl... · Posted by u/swyx
Androider · 4 months ago
Windsurf and Cursor feel like temporary stopgaps, products of a narrow window in time before the landscape shifts again.

Microsoft has clearly taken notice. They're already starting to lock down the upstream VSCode codebase, as seen with recent changes to the C/C++ extension [0]. It's not hard to imagine that future features like TypeScript 7.0 might be limited or even withheld from forks entirely. At the same time, Microsoft will likely replicate Windsurf and Cursor's features within a year. And deliver them with far greater stability and polish.

Both Windsurf and Cursor are riddled with bugs that don't exist upstream, _especially_ in their AI assistant features beyond the VSCode core. Context management which is supposed to be the core featured added is itself incredibly poorly implemented [1].

Ultimately, the future isn't about a smarter editor, it's about a smarter teammate. Tools like GitHub Copilot or future agents will handle entire engineering tickets: generating PRs with tests, taking feedback, and iterating like a real collaborator.

[0] https://www.theregister.com/2025/04/24/microsoft_vs_code_sub...

[1] https://www.reddit.com/r/cursor/comments/1kbt790/rules_in_49...

timabdulla · 4 months ago
I mean, the fact that OpenAI, at the bleeding edge of it all, has decided to buy an IDE is a rather strong hint that the future of agents handling entire engineering tickets might be further out than many believe.

If autonomous agents were just around the corner, then why wouldn't OpenAI bet on their own Codex product obviating (most) need for an IDE and save themselves the $3 billion?

timabdulla commented on PaperBench   openai.com/index/paperben... · Posted by u/meetpateltech
timabdulla · 5 months ago
What were the human PhDs able to do after more than 48 hours of effort? Presumably given that these are top-level PhDs, the replication success rate would be close to 100%?
timabdulla commented on Ace: Realtime Computer Autopilot   generalagents.com/ace/... · Posted by u/huerne
sherjilozair · 5 months ago
I'm the founder and CEO of General Agents. Happy to answer questions!
timabdulla · 5 months ago
How does it perform on e.g. WebVoyager, WebArena, or OSWorld? These seem to be the oft-cited benchmarks when comparing computer-use agents.
timabdulla commented on Why Anthropic's Claude still hasn't beaten Pokémon   arstechnica.com/ai/2025/0... · Posted by u/Workaccount2
kibwen · 5 months ago
> Claude will readily notice when the game tells it that an attack from an electric-type Pokémon is “not very effective” against a rock-type opponent, for instance. Claude will then squirrel that factoid away in a massive written knowledge base for future reference later in the run.

But these models already know all this information??? Surely it's ingested Bulbapedia, along with a hundred zillion terabytes of every other Pokemon resource on the internet, so why does it need to squirrel this information away? What's the point of ruining the internet with all this damn parasitic crawling if the models can't even recall basic facts like "thunderbolt is an electric-type move", "geodude is a rock-type pokemon", "electric-type moves are ineffective against rock-type pokemon"?

timabdulla · 5 months ago
This is the most interesting aspect to me. I had Claude generate a guide to all the gyms in Pokemon Red and instructions for how to quickly execute a play through [0].

It obviously knows the game through and through. Yet even with encyclopedic knowledge of the game, it's still a struggle for it to play. Imagine giving it a gave of which it knows nothing at all.

[0] https://claude.site/artifacts/d127c740-b0ab-43ba-af32-3402e6...

timabdulla commented on Ask HN: Is Cursor deleting working code for you too or is it just me?    · Posted by u/namanyayg
zsoltkacsandi · 6 months ago
Under unavailable I meant unavailable. I was talking about 3.6 not 3.5.
timabdulla · 6 months ago
There is no 3.6. There is 3.5 and 3.5 (New), both of which remain available.
timabdulla commented on Ask HN: Is Cursor deleting working code for you too or is it just me?    · Posted by u/namanyayg
mort96 · 6 months ago
What do the numbers '3-5' in that name refer to if it's version 3.6?
timabdulla · 6 months ago
There was never a Sonnet 3.6. They released what is commonly known as 3.6 as "Sonnet 3.5 (New)". Then, because so many folks ended up referring to it as 3.6, they decided to call this new model 3.7, as the mental territory for 3.6 was already occupied by 3.5 (New). Not confusing in the slightest!
timabdulla commented on Ask HN: Is Cursor deleting working code for you too or is it just me?    · Posted by u/namanyayg
zsoltkacsandi · 6 months ago
Sonnet 3.7 is terrible, it was a huge mistake from Antrophic to release it and make 3.6 unavailable.

It does things I didn’t ask, while deleting random things I just asked to add in the previous prompt, it’s a mess.

However it does one thing very well: it can make me angry very quickly, like a real human.

timabdulla · 6 months ago
My feeling (totally unproven) is that in the drive to make Sonnet 3.7 more "agentic", they've lost some of its ability to actually just stick to what you asked it to do. It seems that it "wants" (I know, it's not sentient!) to be more in the driver's seat now.

Definitely can be very annoying if you do just want it to execute on a set of instructions.

timabdulla commented on Hyperspace   hypercritical.co/2025/02/... · Posted by u/tobr
NoToP · 6 months ago
The fact that copying doesn't copy seems dangerous. Like what if I wanted to copy for the purpose of modifying the file while retaining the original. A trivial example of this might be I have a meme template and I want to write text in it while still keeping a blank copy of the template.

There's a place for alias file pointers, but lying to the user and pretending like an alias is a copy is bound to lead to unintended and confusing results

timabdulla · 6 months ago
It's copy on write.

u/timabdulla

KarmaCake day199September 27, 2016View Original