timabdulla (u/timabdulla)

timabdulla commented on AccountingBench: Evaluating LLMs on real long-horizon business tasks accounting.penrose.com/... · Posted by u/rickcarlino

yunyu · a month ago

Hey all, member of the benchmark team here! The goal for this project was to see how LLMs well could do bookkeeping without an overly opinionated scaffold. We gave them access to processed transaction records and code execution tools, but it was up to them to choose exactly how to use those.

Claude and Grok 4 did reasonably well (within CPA baselines) for the first few months, but tended to degrade as more data came in. Interestingly, the failures aren’t exclusively a context length problem, as we reset the context monthly (with past decisions, accruals/deferrals, and comments available via tool calls) and the types of errors appear to be more reward hacking vs pure hallucinations.

Accounting is very interesting in an RL-first world as it is pretty easy to develop intermediate rewards for training models. We are pretty sure that we can juice the performance more with a far more rigid scaffold, but that’s less relevant from a capabilities research perspective. We’re pushing down this research direction and will see how it goes.

Let us know if you have any questions!

timabdulla · a month ago

> We conducted three runs per experiment and selected the run with the highest final accuracy for inclusion in the chart (though illustrative examples and anecdotes may be drawn from any of the runs).

Can you comment on the variance? It's impressive that models are able to do this consistently with 100% accuracy in the early months, but it would be less so if there was any significant degree of variance amongst the three runs (e.g. 90%, 95%, 100%.)

timabdulla commented on OpenAI reaches agreement to buy Windsurf for $3B bloomberg.com/news/articl... · Posted by u/swyx

Androider · 4 months ago

Windsurf and Cursor feel like temporary stopgaps, products of a narrow window in time before the landscape shifts again.

Microsoft has clearly taken notice. They're already starting to lock down the upstream VSCode codebase, as seen with recent changes to the C/C++ extension [0]. It's not hard to imagine that future features like TypeScript 7.0 might be limited or even withheld from forks entirely. At the same time, Microsoft will likely replicate Windsurf and Cursor's features within a year. And deliver them with far greater stability and polish.

Both Windsurf and Cursor are riddled with bugs that don't exist upstream, _especially_ in their AI assistant features beyond the VSCode core. Context management which is supposed to be the core featured added is itself incredibly poorly implemented [1].

Ultimately, the future isn't about a smarter editor, it's about a smarter teammate. Tools like GitHub Copilot or future agents will handle entire engineering tickets: generating PRs with tests, taking feedback, and iterating like a real collaborator.

[0] https://www.theregister.com/2025/04/24/microsoft_vs_code_sub...

[1] https://www.reddit.com/r/cursor/comments/1kbt790/rules_in_49...

timabdulla · 4 months ago

I mean, the fact that OpenAI, at the bleeding edge of it all, has decided to buy an IDE is a rather strong hint that the future of agents handling entire engineering tickets might be further out than many believe.

If autonomous agents were just around the corner, then why wouldn't OpenAI bet on their own Codex product obviating (most) need for an IDE and save themselves the $3 billion?

timabdulla commented on PaperBench openai.com/index/paperben... · Posted by u/meetpateltech

timabdulla · 5 months ago

What were the human PhDs able to do after more than 48 hours of effort? Presumably given that these are top-level PhDs, the replication success rate would be close to 100%?

timabdulla commented on Ace: Realtime Computer Autopilot generalagents.com/ace/... · Posted by u/huerne

sherjilozair · 5 months ago

I'm the founder and CEO of General Agents. Happy to answer questions!

timabdulla · 5 months ago

How does it perform on e.g. WebVoyager, WebArena, or OSWorld? These seem to be the oft-cited benchmarks when comparing computer-use agents.

timabdulla commented on Why Anthropic's Claude still hasn't beaten Pokémon arstechnica.com/ai/2025/0... · Posted by u/Workaccount2

kibwen · 5 months ago

> Claude will readily notice when the game tells it that an attack from an electric-type Pokémon is “not very effective” against a rock-type opponent, for instance. Claude will then squirrel that factoid away in a massive written knowledge base for future reference later in the run.

But these models already know all this information??? Surely it's ingested Bulbapedia, along with a hundred zillion terabytes of every other Pokemon resource on the internet, so why does it need to squirrel this information away? What's the point of ruining the internet with all this damn parasitic crawling if the models can't even recall basic facts like "thunderbolt is an electric-type move", "geodude is a rock-type pokemon", "electric-type moves are ineffective against rock-type pokemon"?

timabdulla · 5 months ago

This is the most interesting aspect to me. I had Claude generate a guide to all the gyms in Pokemon Red and instructions for how to quickly execute a play through [0].

It obviously knows the game through and through. Yet even with encyclopedic knowledge of the game, it's still a struggle for it to play. Imagine giving it a gave of which it knows nothing at all.

[0] https://claude.site/artifacts/d127c740-b0ab-43ba-af32-3402e6...

timabdulla commented on Ask HN: Is Cursor deleting working code for you too or is it just me? · Posted by u/namanyayg

zsoltkacsandi · 6 months ago

Under unavailable I meant unavailable. I was talking about 3.6 not 3.5.

timabdulla · 6 months ago

There is no 3.6. There is 3.5 and 3.5 (New), both of which remain available.

timabdulla commented on Ask HN: Is Cursor deleting working code for you too or is it just me? · Posted by u/namanyayg

mort96 · 6 months ago

What do the numbers '3-5' in that name refer to if it's version 3.6?

timabdulla · 6 months ago

There was never a Sonnet 3.6. They released what is commonly known as 3.6 as "Sonnet 3.5 (New)". Then, because so many folks ended up referring to it as 3.6, they decided to call this new model 3.7, as the mental territory for 3.6 was already occupied by 3.5 (New). Not confusing in the slightest!

timabdulla commented on Ask HN: Is Cursor deleting working code for you too or is it just me? · Posted by u/namanyayg

zsoltkacsandi · 6 months ago

Sonnet 3.7 is terrible, it was a huge mistake from Antrophic to release it and make 3.6 unavailable.

It does things I didn’t ask, while deleting random things I just asked to add in the previous prompt, it’s a mess.

However it does one thing very well: it can make me angry very quickly, like a real human.

timabdulla · 6 months ago

My feeling (totally unproven) is that in the drive to make Sonnet 3.7 more "agentic", they've lost some of its ability to actually just stick to what you asked it to do. It seems that it "wants" (I know, it's not sentient!) to be more in the driver's seat now.

Definitely can be very annoying if you do just want it to execute on a set of instructions.

timabdulla commented on Hyperspace hypercritical.co/2025/02/... · Posted by u/tobr

NoToP · 6 months ago

The fact that copying doesn't copy seems dangerous. Like what if I wanted to copy for the purpose of modifying the file while retaining the original. A trivial example of this might be I have a meme template and I want to write text in it while still keeping a blank copy of the template.

There's a place for alias file pointers, but lying to the user and pretending like an alias is a copy is bound to lead to unintended and confusing results

timabdulla · 6 months ago

It's copy on write.