Claude 4 - Readit News

An important note not mentioned in this announcement is that Claude 4's training cutoff date is March 2025, which is the latest of any recent model. (Gemini 2.5 has a cutoff of January 2025)

https://docs.anthropic.com/en/docs/about-claude/models/overv...

lxgr · 3 months ago

With web search being available in all major user-facing LLM products now (and I believe in some APIs as well, sometimes unintentionally), I feel like the exact month of cutoff is becoming less and less relevant, at least in my personal experience.

The models I'm regularly using are usually smart enough to figure out that they should be pulling in new information for a given topic.

bredren · 3 months ago

It still matters for software packages. Particularly python packages that have to do with programming with AI!

They are evolving quickly, with deprecation and updated documentation. Having to correct for this in system prompts is a pain.

It would be great if the models were updating portions of their content more recently than others.

For the tailwind example in parent-sibling comment, should absolutely be as up to date as possible, whereas the history of the US civil war can probably be updated less frequently.

jacob019 · 3 months ago

Valid. I suppose the most annoying thing related to the cutoffs, is the model's knowledge of library APIs, especially when there are breaking changes. Even when they have some knowledge of the most recent version, they tend to default to whatever they have seen the most in training, which is typically older code. I suspect the frontier labs have all been working to mitigate this. I'm just super stoked, been waiting for this one to drop.

drogus · 3 months ago

In my experience it really depends on the situation. For stable APIs that have been around for years, sure, it doesn't really matter that much. But if you try to use a library that had significant changes after the cutoff, the models tend to do things the old way, even if you provide a link to examples with new code.

myfonj · 3 months ago

For the recent resources it might matter: unless the training data are curated meticulously, they may be "spoiled" by the output of other LLM, or even the previous version of the one that is being trained. That's something what is generally considered dangerous, because it could potentially produce unintentional echo-chamber or even somewhat "incestuously degenerated" new model.

jgalt212 · 3 months ago

> The models I'm regularly using are usually smart enough to figure out that they should be pulling in new information for a given topic.

Fair enough, but information encoded in the model is return in milliseconds, information that needs to be scraped is returned in 10s of seconds.

GardenLetter27 · 3 months ago

I've had issues with Godot and Rustls - where it gives code for some ancient version of the API.

iLoveOncall · 3 months ago

Web search isn't desirable or even an option in a lot of use cases that involve GenAI.

It seems people have turned GenAI into coding assistants only and forget that they can actually be used for other projects too.

guywithahat · 3 months ago

I was thinking that too, grok can comment on things that have only just broke out hours earlier, cutoff dates don't seem to matter much

Kostic · 3 months ago

It's relevant from an engineering perspective. They have a way to develop a new model in months now.

dzhiurgis · 3 months ago

Ditto. Twitter's Grok is especially good at this.

lobochrome · 3 months ago

It knows uv now

tzury · 3 months ago

web search is an immediate limited operation training is a petabytes long term operation

BeetleB · 3 months ago

Web search is costlier.

tristanb · 3 months ago

Nice - it might know about Svelte 5 finally...

brulard · 3 months ago

It knows about Svelte 5 for some time, but it particularly likes to mix it with Svelte 4 in very weird and broken ways.

liorn · 3 months ago

I asked it about Tailwind CSS (since I had problems with Claude not aware of Tailwind 4):

> Which version of tailwind css do you know?

> I have knowledge of Tailwind CSS up to version 3.4, which was the latest stable version as of my knowledge cutoff in January 2025.

threeducks · 3 months ago

> Which version of tailwind css do you know?

LLMs can not reliably tell whether they know or don't know something. If they did, we would not have to deal with hallucinations.

SparkyMcUnicorn · 3 months ago

Interesting. It's claiming different knowledge cutoff dates depending on the question asked.

"Who is president?" gives a "April 2024" date.

dawnerd · 3 months ago

I did the same recently with copilot and it of course lied and said it knew about v4. Hard to trust any of them.

PeterStuer · 3 months ago

Did you try giving it the relevant parts of the tailwind 4 documentation in the prompt context?

Phelinofist · 3 months ago

Why can't it be trained "continuously"?

cma · 3 months ago

Catastrophic forgetting

https://en.wikipedia.org/wiki/Catastrophic_interference

AlexCoventry · 3 months ago

It's really not necessary, with retrieval-augmented generation. It can be trained to just check what the latest version is.

m3kw9 · 3 months ago

Even that, we don’t know what got updated and what didn’t. Can we assume everything that can be updated is updated?

diggan · 3 months ago

> Can we assume everything that can be updated is updated?

What does that even mean? Of course an LLM doesn't know everything, so it we wouldn't be able to assume everything got updated either. At best, if they shared the datasets they used (which they won't, because most likely it was acquired illegally), you could make some guesses what they tried to update.

simlevesque · 3 months ago

You might be able to ask it what it knows.

indigodaddy · 3 months ago

Should we not necessarily assume that it would have some FastHTML training with that March 2025 cutoff date? I'd hope so but I guess it's more likely that it still hasn't trained on FastHTML?

jph00 · 3 months ago

Claude 4 actually knows FastHTML pretty well! :D It managed to one-shot most basic tasks I sent its way, although it makes a lot of standard minor n00b mistakes that make its code a bit longer and more complex than needed.

I've nearly finished writing a short guide which, when added to a prompt, gives quite idiomatic FastHTML code.

VectorLock · 3 months ago

I'm starting to wonder if having more recent cut-off dates is more a bug than a feature.

dvfjsdhgfv · 3 months ago

One thing I'm 100% is that a cut off date doesn't exist for any large model, or rather there is no single date since it's practically almost impossible to achieve that.

sib · 3 months ago

But I think the general meaning of a cutoff date, D, is:

The model includes nothing AFTER date D

and not

The model includes everything ON OR BEFORE date D

Right? Definitionally, the model can't include anything that happened after training stopped.

koolba · 3 months ago

Indeed. It’s not possible stop the world and snapshot the entire internet in a single day.

Or is it?

tonyhart7 · 3 months ago

its not a definitive "date" you cut off information, but more a "recent" material you can feed, training takes times

if you waiting for a new information, of course you are not going ever to train

cma · 3 months ago

When I asked the model it told me January (for sonnet 4). Doesn't it normally get that in its system prompt?

SparkyMcUnicorn · 3 months ago

Although I believe it, I wish there was some observability into what data is included here.

Both Sonnet and Opus 4 say Joe Biden is president and claim their knowledge cutoff is "April 2024".

Tossrock · 3 months ago

Are you sure you're using 4? Mine says January 2025: https://claude.ai/share/9d544e4c-253e-4d61-bdad-b5dd1c2f1a63

I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks.

TechDebtDevin · 3 months ago

I think they are just getting better at the edges, MCP/Tool Calls, structured output. This definitely isn't increased intelligence, but it an increase in the value add, not sure the value added equates to training costs or company valuations though.

In all reality, I have zero clue how any of these companies remain sustainable. I've tried to host some inference on cloud GPUs and its seems like it would be extremely cost prohibitive with any sort of free plan.

layoric · 3 months ago

> how any of these companies remain sustainable

They don't, they have a big bag of money they are burning through, and working to raise more. Anthropic is in a better position cause they don't have the majority of the public using their free-tier. But, AFAICT, none of the big players are profitable, some might get there, but likely through verticals rather than just model access.

yahoozoo · 3 months ago

https://www.wheresyoured.at/reality-check/

hijodelsol · 3 months ago

If you read any work from Ed Zitron [1], they likely cannot remain sustainable. With OpenAI failing to convert into a for-profit, Microsoft being more interested in being a multi-modal provider and competing openly with OpenAI (e.g., open-sourcing Copilot vs. Windsurf, GitHub Agent with Claude as the standard vs. Codex) and Google having their own SOTA models and not relying on their stake in Anthropic, tarrifs complicating Stargate, explosion in capital expenditure and compute, etc., I would not be surprised to see OpenAI and Anthropic go under in the next years.

1: https://www.wheresyoured.at/oai-business/

NitpickLawyer · 3 months ago

> and that LLMs have basically reached a plateau

This is the new stochastic parrots meme. Just a few hours ago there was a story on the front page where an LLM based "agent" was given 3 tools to search e-mails and the simple task "find my brother's kid's name", and it was able to systematically work the problem, search, refine the search, and infer the correct name from an e-mail not mentioning anything other than "X's favourite foods" with a link to a youtube video. Come on!

That's not to mention things like alphaevolve, microsoft's agentic test demo w/ copilot running a browser, exploring functionality and writing playright tests, and all the advances in coding.

sensanaty · 3 months ago

And we also have a showcase from a day ago [1] of these magical autonomous AI agents failing miserably in the PRs unleashed on the dotnet codebase, where it kept reiterating it fixed tests it wrote that failed without fixing them. Oh, and multiple blatant failures that happened live on stage [2], with the speaker trying to sweep the failures under the rug on some of the simplest code imaginable.

But sure, it managed to find a name buried in some emails after being told to... Search through emails. Wow. Such magic

[1] https://news.ycombinator.com/item?id=44050152 [2] https://news.ycombinator.com/item?id=44056530

hsn915 · 3 months ago

Is this something that the models from 4 months ago were not able to do?

morepedantic · 3 months ago

The LLMs have reached a plateau. Successive generations will be marginally better.

We're watching innovation move into the use and application of LLMs.

strangescript · 3 months ago

I have used claude code a ton and I agree, I haven't noticed a single difference since updating. Its summaries I guess a little cleaner, but its has not surprised me at all in ability. I find I am correcting it and re-prompting it as much as I didn't with 3.7 on a typescript codebase. In fact I was kind of shocked how badly it did in a situation where it was editing the wrong file and it never thought to check that more specifically until I forced it to delete all the code and show that nothing changed with regards to what we were looking at.

hsn915 · 3 months ago

I'd go so far as to say Sonnet 3.5 was better than 3.7

At least I personally liked it better.

jug · 3 months ago

This is my feeling too, across the board. Nowadays, benchmark wins seem to come from tuning, but then causing losses in other areas. o3, o4-mini also hallucinates more than o1 in SimpleQA, PersonQA. Synthetic data seems to cause higher hallucination rates. Reasoning models at even higher risk due to hallucinations risking to throw the model off track at each reasoning step.

LLM’s in a generic use sense are done since already earlier this year. OpenAI discovered this when they had to cancel GPT-5 and later released the ”too costly for gains” GPT-4.5 that will be sunset soon.

I’m not sure the stock market has factored all this in yet. There needs to be a breakthrough to get us past this place.

voiper1 · 3 months ago

The benchmarks in many ways seem to be very similar to claude 3.7 for most cases.

That's nowhere near enough reason to think we've hit a plateau - the pace has been super fast, give it a few more months to call that...!

I think the opposite about the features - they aren't gimmicks at all, but indeed they aren't part of the core AI. Rather it's important "tooling" that adjacent to the AI that we need to actually leverage it. The LLM field in popular usage is still in it's infancy. If the models don't improve (but I expect they will), we have a TON of room with these features and how we interact, feed them information, tool calls, etc to greatly improve usability and capability.

fintechie · 3 months ago

It's not that it isn't better, it's actually worse. Seems like the big guys are stuck on a race to overfit for benchmarks, and this is becoming very noticeable.

sanex · 3 months ago

Well to be fair it's only .3 difference.

pantsforbirds · 3 months ago

It seems MUCH better at tool usage. Just had an example where I asked Sonnet 4 to split a PR I had after we had to revert an upstream commit.

I didn't want to lose the work I had done, and I knew it would be a pain to do it manually with git. The model did a fantastic job of iterating through the git commits and deciding what to put into each branch. It got everything right except for a single test that I was able to easily move to the correct branch myself.

brookst · 3 months ago

How much have you used Claude 4?

hsn915 · 3 months ago

I asked it a few questions and it responded exactly like all the other models do. Some of the questions were difficult / very specific, and it failed in the same way all the other models failed.

illegally · 3 months ago

Yes.

They just need to put out a simple changelog for these model updates, no need to make a big announcement everytime to make it look like it's a whole new thing. And the version numbers are even worse.

flixing · 3 months ago

i think you are.

go_elmo · 3 months ago

I feel like the model making a memory file to store context is more than a gimmick, no?

make3 · 3 months ago

the increases are not as fast, but they're still there. the models are already exceptionally strong, I'm not sure that basic questions can capture differences very well

hsn915 · 3 months ago

Hence, "plateau"

minimaxir · 3 months ago

jasonthorsness · 3 months ago

“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”

Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.

rco8786 · 3 months ago

It could be! But that's also what people said about all the models before it!

kmacdough · 3 months ago

And they might all be right!

> This tech could lead to...

I don't think he's saying this is the version that will suddenly trigger a Renaissance. Rather, it's one solid step that makes the path ever more promising.

Sure, everyone gets a bit overexcited each release until they find the bounds. But the bounds are expanding, and the need for careful prompt engineering is diminishing. Ever since 3.7, Claude has been a regular part of my process for the mundane. And so far 4.0 seems to take less fighting for me.

A good question would be when can AI take a basic prompt, gather its own requirements and build meaningful PRs off basic prompt. I suspect it's still at least a couple of paradigm shifts away. But those seem to be coming every year or faster.

max_on_hn · 3 months ago

I am incredibly eager to see what affordable coding agents can do for open source :) in fact, I should really be giving away CheepCode[0] credits to open source projects. Pending any sort of formal structure, if you see this comment and want free coding agent runs, email me and I’ll set you up!

[0] My headless coding agents product, similar to “assign to copilot” but works from your task board (Linear, Jira, etc) on multiple tasks in parallel. So far simple/routine features are already quite successful. In general the better the tests, the better the resulting code (and yes, it can and does write its own tests).

troupo · 3 months ago

> I am incredibly eager to see what affordable coding agents can do for open source :)

Oh, we know exactly what they will do: they will drive devs insane: https://www.reddit.com/r/ExperiencedDevs/comments/1krttqo/my...

dr_dshiv · 3 months ago

Especially since the EU just made open source contributors liable for cybersecurity (Cyber Resilience Act). Just let AI contribute and ur good

ModernMech · 3 months ago

That's kind of my benchmark for whether or not these models are useful. I've got a project that needs some extensive refactoring to get working again. Mostly upgrading packages, but also it will require updating the code to some new language semantics that didn't exist when it was written. So far, current AI models can make essentially zero progress on this task. I'll keep trying until they can!

yosito · 3 months ago

Personally, I don't believe AI is ever going to get to that level. I'd love to be proven wrong, but I really don't believe that an LLM is the right tool for a job that requires novel thinking about out of the ordinary problems like all the weird edge cases and poor documentation that comes up when trying to upgrade old software.

tmpz22 · 3 months ago

And IMO it has a long way to go. There is a lot of nuance when orchestrating dependencies that can cause subtle errors in an application that are not easily remedied.

For example a lot of llms (I've seen it in Gemini 2.5, and Claude 3.7) will code non-existent methods in dynamic languages. While these runtime errors are often auto-fixable, sometimes they aren't, and breaking out of an agentic workflow to deep dive the problem is quite frustrating - if mostly because agentic coding entices us into being so lazy.

mewpmewp2 · 3 months ago

I think this type of thing needs agent which has access to the documentation to read about nuances of the language and package versions, definitely a way to investigate types, interfaces. Problem is that training data has so much mixed data it can easily confuse the AI to mix up versions, APIs etc.

epolanski · 3 months ago

> having package upgrades and other mostly-mechanical stuff handled automatically

Those are already non-issues mostly solved by bots.

In any case, where I think AI could help here would be by summarizing changes, conflicts, impact on codebase and possibly also conduct security scans.

BaculumMeumEst · 3 months ago

Anyone see news of when it’s planned to go live in copilot?

vinhphm · 3 months ago

The option just shown up in Copilot settings page for me

The keynote confirms it is available now.

phito · 3 months ago

I don't see how a LLM could do better than a bot, eg renovate

ed_elliott_asc · 3 months ago

Until it pushes a severe vulnerability which takes a big service doen

Doohickey-d · 3 months ago

> Users requiring raw chains of thought for advanced prompt engineering can contact sales

So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.

In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.

a_bonobo · 3 months ago

Could the exclusion of CoT that be because of this recent Anthropic paper?

https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea...

>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out.

I.e., chain of thought may be a confabulation by the model, too. So perhaps there's somebody at Anthropic who doesn't want to mislead their customers. Perhaps they'll come back once this problem is solved.

whimsicalism · 3 months ago

i think it is almost certainly to prevent distillation

andrepd · 3 months ago

I have no idea what this means, can someone give the eli5?

42lux · 3 months ago

Because it's alchemy and everyone believes they have an edge on turning lead into gold.

elcritch · 3 months ago

I've been thinking for a couple of months now that prompt engineering, and therefore CoT, is going to become the "secret sauce" companies want to hold onto.

If anything that is where the day to day pragmatic engineering gets done. Like with early chemistry, we didn't need to precisely understand chemical theory to produce mass industrial processes by making a good enough working model, some statistical parameters, and good ole practical experience. People figured out steel making and black powder with alchemy.

The only debate now is whether the prompt engineering models are currently closer to alchemy or modern chemistry? I'd say we're at advanced alchemy with some hints of rudimentary chemistry.

Also, unrelated but with CERN turning lead into gold, doesn't that mean the alchemists were correct, just fundamentally unprepared for the scale of the task? ;)

viraptor · 3 months ago

We won't know without an official answer leaking, but a simple answer could be - people spend too much time trying to analyse those without understanding the details. There was a lot of talk on HN about the thinking steps second guessing and contradicting itself. But in practice that step is both trained by explicitly injecting the "however", "but" and similar words and they do more processing than simply interpreting the thinking part as text we read. If the content is commonly misunderstood, why show it?

pja · 3 months ago

IIRC RLHF inevitably compromises model accuracy in order to train the model not to give dangerous responses.

It would make sense if the model used for train-of-though was trained differently (perhaps a different expert from an MoE?) from the one used to interact with the end user, since the end user is only ever going to see its output filtered through the public model the chain-of-thought model can be closer to the original, more pre-rlhf version without risking the reputation of the company.

This way you can get the full performance of the original model whilst still maintaining the necessary filtering required to prevent actual harm (or terrible PR disasters).

landl0rd · 3 months ago

Yeah we really should stop focusing on model alignment. The idea that it's more important that your AI will fucking report you to the police if it thinks you're being naughty than that it actually works for more stuff is stupid.

Wowfunhappy · 3 months ago

Correct me if I'm wrong--my understanding is that RHLF was the difference between GPT 3 and GPT 3.5, aka the original ChatGPT.

If you never used GPT 3, it was... not good. Well, that's not fair, it was revolutionary in its own right, but it was very much a machine for predicting the most likely next word, it couldn't talk to you the way ChatGPT can.

Which is to say, I think RHLF is important for much more than just preventing PR disasters. It's a key part of what makes the models useful.

sunaookami · 3 months ago

Guess we have to wait till DeepSeek mops the floor with everyone again.

datpuz · 3 months ago

DeepSeek never mopped the floor with anyone... DeepSeek was remarkable because it is claimed that they spent a lot less training it, and without Nvidia GPUs, and because they had the best open weight model for a while. The only area they mopped the floor in was open source models, which had been stagnating for a while. But qwen3 mopped the floor with DeepSeek R1.

infecto · 3 months ago

Do people actually believe this? While I agree their open source contribution was impressive, I never got the sense they mopped the floor. Perhaps firms in China may be using some of their models but beyond learnings in the community, no dents in the market were made for the West.

> because it helped to see when it was going to go down the wrong track

It helped me tremendously learning Zig.

Seeing his chain of thought when asking it stuff about Zig and implementations let me widen the horizon a lot.

whiddershins · 3 months ago

The trend towards opaque is inexorable.

https://noisegroove.substack.com/p/somersaulting-down-the-sl...

it just makes it too easy to distill the reasoning into a separate model I guess. though I feel like o3 shows useful things about the reasoning while it's happening

Aeolun · 3 months ago

The Google CoT is so incredibly dumb. I thought my models had been lobotomized until I realized they must be doing some sort of processing on the thing.

user_7832 · 3 months ago

You are referring to the new (few days old-ish) CoT right? It’s bizzare as to why google did it, it was very helpful to see where the model was making assumptions or doing something wrong. Now half the time it feels better to just use flash with no thinking mode but ask it to manually “think”.

it’s fake cot, just like oai

cube2222 · 3 months ago

Sooo, I love Claude 3.7, and use it every day, I prefer it to Gemini models mostly, but I've just given Opus 4 a spin with Claude Code (codebase in Go) for a mostly greenfield feature (new files mostly) and... the thinking process is good, but 70-80% of tool calls are failing for me.

And I mean basic tools like "Write", "Update" failing with invalid syntax.

5 attempts to write a file (all failed) and it continues trying with the following comment

> I keep forgetting to add the content parameter. Let me fix that.

So something is wrong here. Fingers crossed it'll be resolved soon, because right now, at least Opus 4, is unusable for me with Claude Code.

The files it did succeed in creating were high quality.

Alright, I think I found the reason, clearly a bug: https://github.com/anthropics/claude-code/issues/1236#issuec...

Basically it seems to be hitting the max output token count (writing out a whole new file in one go), stops the response, and the invalid tool call parameters error is a red herring.

jasondclinton · 3 months ago

Thanks for the report! We're addressing it urgently.

_peregrine_ · 3 months ago

Already test Opus 4 and Sonnet 4 in our SQL Generation Benchmark (https://llm-benchmark.tinybird.live/)

Opus 4 beat all other models. It's good.

XCSme · 3 months ago

It's weird that Opus4 is the worst at one-shot, it requires on average two attempts to generate a valid query.

If a model is really that much smarter, shouldn't it lead to better first-attempt performance? It still "thinks" beforehand, right?

riwsky · 3 months ago

Don’t talk to Opus before it’s had its coffee. Classic high-performer failure mode.

stadeschuldt · 3 months ago

Interestingly, both Claude-3.7-Sonnet and Claude-3.5-Sonnet rank better than Claude-Sonnet-4.

yeah that surprised me too

Workaccount2 · 3 months ago

This is a pretty interesting benchmark because it seems to break the common ordering we see with all the other benchmarks.

Yeah I mean SQL is pretty nuanced - one of the things we want to improve in the benchmark is how we measure "success", in the sense that multiple correct SQL results can look structurally dissimilar while semantically answering the prompt.

There's some interesting takeaways we learned here after the first round: https://www.tinybird.co/blog-posts/we-graded-19-llms-on-sql-...

ineedaj0b · 3 months ago

i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately.

timmytokyo · 3 months ago

Not saying you're wrong about "OpenAI practices", but that's kind of a strange thing to complain about right after praising an LLM that was only recently inserting claims of "white genocide" into every other response.

gkfasdfasdf · 3 months ago

Just curious, how do you know your questions and the SQL aren't in the LLM training data? Looks like the benchmark questions w/SQL are online (https://ghe.clickhouse.tech/).

zarathustreal · 3 months ago

“Your model has memorized all knowledge, how do you know it’s smart?”

sagarpatil · 3 months ago

Sonnet 3.7 > Sonnet 4? Interesting.

dcreater · 3 months ago

How does Qwen3 do on this benchmark?

mritchie712 · 3 months ago

looks like this is one-shot generation right?

I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first).

sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter).

Actually no, we have it up to 3 attempts. In fact, Opus 4 failed on 36/50 tests on the first attempt, but it was REALLY good at nailing the second attempt after receiving error feedback.

jpau · 3 months ago

Interesting!

Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench?

No it's definitely interesting. It suggests that Opus 4 actually failed to write proper syntax on the first attempt, but given feedback it absolutely nailed the 2nd attempt. My takeaway is that this is great for peer-coding workflows - less "FIX IT CLAUDE"

That's a really useful benchmark, could you add 4.1-mini?

Yeah we're always looking for new models to add

jjwiseman · 3 months ago

Please add GPT o3.

Noted, also feel free to add an issue to the GitHub repo: https://github.com/tinybirdco/llm-benchmark

varunneal · 3 months ago

Why is o3-mini there but not o3?

We should definitely add o3 - probably will soon. Also looking at testing the Qwen models

joelthelion · 3 months ago

Did you try Sonnet 4?

vladimirralev · 3 months ago

It's placed at 10. Below claude-3.5-sonnet, GPT 4.1 and o3-mini.

kadushka · 3 months ago

what about o3?

We need to add it

tptacek · 3 months ago

Have they documented the context window changes for Claude 4 anywhere? My (barely informed) understanding was one of the reasons Gemini 2.5 has been so useful is that it can handle huge amounts of context --- 50-70kloc?

Context window is unchanged for Sonnet. (200k in/64k out): https://docs.anthropic.com/en/docs/about-claude/models/overv...

In practice, the 1M context of Gemini 2.5 isn't that much of a differentiator because larger context has diminishing returns on adherence to later tokens.

Rudybega · 3 months ago

I'm going to have to heavily disagree. Gemini 2.5 Pro has super impressive performance on large context problems. I routinely drive it up to 4-500k tokens in my coding agent. It's the only model where that much context produces even remotely useful results.

I think it also crushes most of the benchmarks for long context performance. I believe on MRCR (multi round coreference resolution) it beats pretty much any other model's performance at 128k at 1M tokens (o3 may have changed this).

zamadatix · 3 months ago

The amount of degradation at a given context length isn't constant though so a model with 5x the context can either be completely useless or still better depending on the strength of the models you're comparing. Gemini actually does really great in both regards (context length and quality at length) but I'm not sure what a hard numbers comparison to the latest Claude models would look like.

A good deep dive on the context scaling topic in general https://youtu.be/NHMJ9mqKeMQ

Gemini's real strength is that it can stay on the ball even at 100k tokens in context.

michaelbrave · 3 months ago

I've had a lot of fun using Gemini's large context. I scrape a reddit discussion with 7k responses, and have gemini synthesize it and categorize it, and by the time it's done and I have a few back and fourths with it I've gotten half of a book written.

That said I have noticed that if I try to give it additional threads to compare and contrast once it hits around the 300-500k tokens it starts to hallucinate more and forget things more.

ashirviskas · 3 months ago

It's closer to <30k before performance degrades too much for 3.5/3.7. 200k/64k is meaningless in this context.

Yeah, but why aren't they attacking that problem? Is it just impossible, because it would be a really simple win with regards to coding. I am huge enthusiast, but I am starting to feel a peak.

VeejayRampay · 3 months ago

that is just not correct, it's a big differentiator

fblp · 3 months ago

I wish they would increase the context window or better handle it when the prompt gets too long. Currently users get "prompt is too long" warnings suddenly which makes it a frustrating model to work with for long conversations, writing etc.

Other tools may drop some prior context, or use RAG to help but they don't force you to start a new chat without warning.

jbellis · 3 months ago

not sure wym, it's in the headline of the article that Opus 4 has 200k context

(same as sonnet 3.7 with the beta header)

esafak · 3 months ago

There's the nominal context length, and the effective one. You need a benchmark like the needle-in-a-haystack or RULER to determine the latter.

https://github.com/NVIDIA/RULER

We might be looking at different articles? The string "200" appears nowhere in this one --- or I'm just wrong! But thanks!

keeganpoppen · 3 months ago

context window size is super fake. if you don't have the right context, you don't get good output.

a2128 · 3 months ago

> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.

I don't want to see a "summary" of the model's reasoning! If I want to make sure the model's reasoning is accurate and that I can trust its output, I need to see the actual reasoning. It greatly annoys me that OpenAI and now Anthropic are moving towards a system of hiding the models thinking process, charging users for tokens they cannot see, and providing "summaries" that make it impossible to tell what's actually going on.

There are several papers pointing towards 'thinking' output is meaningless to the final output, and using dots, or pause tokens enabling the same additional rounds of throughput result in similar improvements.

So in a lot of regards the 'thinking' is mostly marketing.

- "Think before you speak: Training Language Models With Pause Tokens" - https://arxiv.org/abs/2310.02226

- "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models" - https://arxiv.org/abs/2404.15758

- "Do LLMs Really Think Step-by-step In Implicit Reasoning?" - https://arxiv.org/abs/2411.15862

- Video by bycloud as an overview -> https://www.youtube.com/watch?v=Dk36u4NGeSU

Davidzheng · 3 months ago

Lots of papers are insane. You can test it on competition math problems with s local AI and replace its thinking process with dots and see the result yourself.

cayley_graph · 3 months ago

Wow, my first ever video on AI! I'm rather disappointed. That was devoid of meaningful content save for the two minutes where they went over the Anthropic blog post on how LLMs (don't) do addition. Importantly, they didn't remotely approach what those other papers are about, or why thinking tokens aren't important for chain-of-thought. Is all AI content this kind of slop? Sorry, no offense to the above comment, it was just a total waste of 10 minutes that I'm not used to.

So, to anyone more knowledgeable than the proprietor of that channel: can you outline why it's possible to replace thinking tokens with garbage without a decline in output quality?

edit: Section J of the first paper seems to offer some succint explanations.

kovezd · 3 months ago

Don't be so concerned. There's ample evidence that thinking is often disassociated from the output.

My take is that this is a user experience improvement, given how little people actually goes on to read the thinking process.

padolsey · 3 months ago

If we're paying for reasoning tokens, we should be able to have access to these, no? Seems reasonable enough to allow access, and then we can perhaps use our own streaming summarization models instead of relying on these very generic-sounding ones they're pushing.

> There's ample evidence that thinking is often disassociated from the output.

What kind of work do use LLMs for? For the semi technical “find flaws in my argument” thing, I find it generally better at not making common or expected fallacies or assumptions.

then provide it as an option?

skerit · 3 months ago

Are they referring to their own chat interface? Because the API still streams the thinking tokens immediately.

khimaros · 3 months ago

i believe Gemini 2.5 Pro also does this

izabera · 3 months ago

I am now focusing on checking your proposition. I am now fully immersed in understanding your suggestion. I am now diving deep into whether Gemini 2.5 pro also does this. I am now focusing on checking the prerequisites.

patates · 3 months ago

It does now, but I think it wasn't the case before?