An important note not mentioned in this announcement is that Claude 4's training cutoff date is March 2025, which is the latest of any recent model. (Gemini 2.5 has a cutoff of January 2025)
With web search being available in all major user-facing LLM products now (and I believe in some APIs as well, sometimes unintentionally), I feel like the exact month of cutoff is becoming less and less relevant, at least in my personal experience.
The models I'm regularly using are usually smart enough to figure out that they should be pulling in new information for a given topic.
It still matters for software packages. Particularly python packages that have to do with programming with AI!
They are evolving quickly, with deprecation and updated documentation. Having to correct for this in system prompts is a pain.
It would be great if the models were updating portions of their content more recently than others.
For the tailwind example in parent-sibling comment, should absolutely be as up to date as possible, whereas the history of the US civil war can probably be updated less frequently.
Valid. I suppose the most annoying thing related to the cutoffs, is the model's knowledge of library APIs, especially when there are breaking changes. Even when they have some knowledge of the most recent version, they tend to default to whatever they have seen the most in training, which is typically older code. I suspect the frontier labs have all been working to mitigate this. I'm just super stoked, been waiting for this one to drop.
In my experience it really depends on the situation. For stable APIs that have been around for years, sure, it doesn't really matter that much. But if you try to use a library that had significant changes after the cutoff, the models tend to do things the old way, even if you provide a link to examples with new code.
For the recent resources it might matter: unless the training data are curated meticulously, they may be "spoiled" by the output of other LLM, or even the previous version of the one that is being trained. That's something what is generally considered dangerous, because it could potentially produce unintentional echo-chamber or even somewhat "incestuously degenerated" new model.
> Can we assume everything that can be updated is updated?
What does that even mean? Of course an LLM doesn't know everything, so it we wouldn't be able to assume everything got updated either. At best, if they shared the datasets they used (which they won't, because most likely it was acquired illegally), you could make some guesses what they tried to update.
Should we not necessarily assume that it would have some FastHTML training with that March 2025 cutoff date? I'd hope so but I guess it's more likely that it still hasn't trained on FastHTML?
Claude 4 actually knows FastHTML pretty well! :D It managed to one-shot most basic tasks I sent its way, although it makes a lot of standard minor n00b mistakes that make its code a bit longer and more complex than needed.
I've nearly finished writing a short guide which, when added to a prompt, gives quite idiomatic FastHTML code.
One thing I'm 100% is that a cut off date doesn't exist for any large model, or rather there is no single date since it's practically almost impossible to achieve that.
“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”
Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.
I don't think he's saying this is the version that will suddenly trigger a Renaissance. Rather, it's one solid step that makes the path ever more promising.
Sure, everyone gets a bit overexcited each release until they find the bounds. But the bounds are expanding, and the need for careful prompt engineering is diminishing. Ever since 3.7, Claude has been a regular part of my process for the mundane. And so far 4.0 seems to take less fighting for me.
A good question would be when can AI take a basic prompt, gather its own requirements and build meaningful PRs off basic prompt. I suspect it's still at least a couple of paradigm shifts away. But those seem to be coming every year or faster.
I am incredibly eager to see what affordable coding agents can do for open source :) in fact, I should really be giving away CheepCode[0] credits to open source projects. Pending any sort of formal structure, if you see this comment and want free coding agent runs, email me and I’ll set you up!
[0] My headless coding agents product, similar to “assign to copilot” but works from your task board (Linear, Jira, etc) on multiple tasks in parallel. So far simple/routine features are already quite successful. In general the better the tests, the better the resulting code (and yes, it can and does write its own tests).
That's kind of my benchmark for whether or not these models are useful. I've got a project that needs some extensive refactoring to get working again. Mostly upgrading packages, but also it will require updating the code to some new language semantics that didn't exist when it was written. So far, current AI models can make essentially zero progress on this task. I'll keep trying until they can!
Personally, I don't believe AI is ever going to get to that level. I'd love to be proven wrong, but I really don't believe that an LLM is the right tool for a job that requires novel thinking about out of the ordinary problems like all the weird edge cases and poor documentation that comes up when trying to upgrade old software.
And IMO it has a long way to go. There is a lot of nuance when orchestrating dependencies that can cause subtle errors in an application that are not easily remedied.
For example a lot of llms (I've seen it in Gemini 2.5, and Claude 3.7) will code non-existent methods in dynamic languages. While these runtime errors are often auto-fixable, sometimes they aren't, and breaking out of an agentic workflow to deep dive the problem is quite frustrating - if mostly because agentic coding entices us into being so lazy.
I think this type of thing needs agent which has access to the documentation to read about nuances of the language and package versions, definitely a way to investigate types, interfaces. Problem is that training data has so much mixed data it can easily confuse the AI to mix up versions, APIs etc.
> Users requiring raw chains of thought for advanced prompt engineering can contact sales
So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.
In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.
>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out.
I.e., chain of thought may be a confabulation by the model, too. So perhaps there's somebody at Anthropic who doesn't want to mislead their customers. Perhaps they'll come back once this problem is solved.
I've been thinking for a couple of months now that prompt engineering, and therefore CoT, is going to become the "secret sauce" companies want to hold onto.
If anything that is where the day to day pragmatic engineering gets done. Like with early chemistry, we didn't need to precisely understand chemical theory to produce mass industrial processes by making a good enough working model, some statistical parameters, and good ole practical experience. People figured out steel making and black powder with alchemy.
The only debate now is whether the prompt engineering models are currently closer to alchemy or modern chemistry? I'd say we're at advanced alchemy with some hints of rudimentary chemistry.
Also, unrelated but with CERN turning lead into gold, doesn't that mean the alchemists were correct, just fundamentally unprepared for the scale of the task? ;)
We won't know without an official answer leaking, but a simple answer could be - people spend too much time trying to analyse those without understanding the details. There was a lot of talk on HN about the thinking steps second guessing and contradicting itself. But in practice that step is both trained by explicitly injecting the "however", "but" and similar words and they do more processing than simply interpreting the thinking part as text we read. If the content is commonly misunderstood, why show it?
IIRC RLHF inevitably compromises model accuracy in order to train the model not to give dangerous responses.
It would make sense if the model used for train-of-though was trained differently (perhaps a different expert from an MoE?) from the one used to interact with the end user, since the end user is only ever going to see its output filtered through the public model the chain-of-thought model can be closer to the original, more pre-rlhf version without risking the reputation of the company.
This way you can get the full performance of the original model whilst still maintaining the necessary filtering required to prevent actual harm (or terrible PR disasters).
Yeah we really should stop focusing on model alignment. The idea that it's more important that your AI will fucking report you to the police if it thinks you're being naughty than that it actually works for more stuff is stupid.
Correct me if I'm wrong--my understanding is that RHLF was the difference between GPT 3 and GPT 3.5, aka the original ChatGPT.
If you never used GPT 3, it was... not good. Well, that's not fair, it was revolutionary in its own right, but it was very much a machine for predicting the most likely next word, it couldn't talk to you the way ChatGPT can.
Which is to say, I think RHLF is important for much more than just preventing PR disasters. It's a key part of what makes the models useful.
DeepSeek never mopped the floor with anyone... DeepSeek was remarkable because it is claimed that they spent a lot less training it, and without Nvidia GPUs, and because they had the best open weight model for a while. The only area they mopped the floor in was open source models, which had been stagnating for a while. But qwen3 mopped the floor with DeepSeek R1.
Do people actually believe this? While I agree their open source contribution was impressive, I never got the sense they mopped the floor. Perhaps firms in China may be using some of their models but beyond learnings in the community, no dents in the market were made for the West.
it just makes it too easy to distill the reasoning into a separate model I guess. though I feel like o3 shows useful things about the reasoning while it's happening
The Google CoT is so incredibly dumb. I thought my models had been lobotomized until I realized they must be doing some sort of processing on the thing.
You are referring to the new (few days old-ish) CoT right? It’s bizzare as to why google did it, it was very helpful to see where the model was making assumptions or doing something wrong. Now half the time it feels better to just use flash with no thinking mode but ask it to manually “think”.
I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks.
I think they are just getting better at the edges, MCP/Tool Calls, structured output. This definitely isn't increased intelligence, but it an increase in the value add, not sure the value added equates to training costs or company valuations though.
In all reality, I have zero clue how any of these companies remain sustainable. I've tried to host some inference on cloud GPUs and its seems like it would be extremely cost prohibitive with any sort of free plan.
They don't, they have a big bag of money they are burning through, and working to raise more. Anthropic is in a better position cause they don't have the majority of the public using their free-tier. But, AFAICT, none of the big players are profitable, some might get there, but likely through verticals rather than just model access.
If you read any work from Ed Zitron [1], they likely cannot remain sustainable. With OpenAI failing to convert into a for-profit, Microsoft being more interested in being a multi-modal provider and competing openly with OpenAI (e.g., open-sourcing Copilot vs. Windsurf, GitHub Agent with Claude as the standard vs. Codex) and Google having their own SOTA models and not relying on their stake in Anthropic, tarrifs complicating Stargate, explosion in capital expenditure and compute, etc., I would not be surprised to see OpenAI and Anthropic go under in the next years.
This is the new stochastic parrots meme. Just a few hours ago there was a story on the front page where an LLM based "agent" was given 3 tools to search e-mails and the simple task "find my brother's kid's name", and it was able to systematically work the problem, search, refine the search, and infer the correct name from an e-mail not mentioning anything other than "X's favourite foods" with a link to a youtube video. Come on!
That's not to mention things like alphaevolve, microsoft's agentic test demo w/ copilot running a browser, exploring functionality and writing playright tests, and all the advances in coding.
And we also have a showcase from a day ago [1] of these magical autonomous AI agents failing miserably in the PRs unleashed on the dotnet codebase, where it kept reiterating it fixed tests it wrote that failed without fixing them. Oh, and multiple blatant failures that happened live on stage [2], with the speaker trying to sweep the failures under the rug on some of the simplest code imaginable.
But sure, it managed to find a name buried in some emails after being told to... Search through emails. Wow. Such magic
I have used claude code a ton and I agree, I haven't noticed a single difference since updating. Its summaries I guess a little cleaner, but its has not surprised me at all in ability. I find I am correcting it and re-prompting it as much as I didn't with 3.7 on a typescript codebase. In fact I was kind of shocked how badly it did in a situation where it was editing the wrong file and it never thought to check that more specifically until I forced it to delete all the code and show that nothing changed with regards to what we were looking at.
This is my feeling too, across the board. Nowadays, benchmark wins seem to come from tuning, but then causing losses in other areas. o3, o4-mini also hallucinates more than o1 in SimpleQA, PersonQA. Synthetic data seems to cause higher hallucination rates. Reasoning models at even higher risk due to hallucinations risking to throw the model off track at each reasoning step.
LLM’s in a generic use sense are done since already earlier this year. OpenAI discovered this when they had to cancel GPT-5 and later released the ”too costly for gains” GPT-4.5 that will be sunset soon.
I’m not sure the stock market has factored all this in yet. There needs to be a breakthrough to get us past this place.
The benchmarks in many ways seem to be very similar to claude 3.7 for most cases.
That's nowhere near enough reason to think we've hit a plateau - the pace has been super fast, give it a few more months to call that...!
I think the opposite about the features - they aren't gimmicks at all, but indeed they aren't part of the core AI. Rather it's important "tooling" that adjacent to the AI that we need to actually leverage it. The LLM field in popular usage is still in it's infancy. If the models don't improve (but I expect they will), we have a TON of room with these features and how we interact, feed them information, tool calls, etc to greatly improve usability and capability.
It's not that it isn't better, it's actually worse. Seems like the big guys are stuck on a race to overfit for benchmarks, and this is becoming very noticeable.
It seems MUCH better at tool usage. Just had an example where I asked Sonnet 4 to split a PR I had after we had to revert an upstream commit.
I didn't want to lose the work I had done, and I knew it would be a pain to do it manually with git. The model did a fantastic job of iterating through the git commits and deciding what to put into each branch. It got everything right except for a single test that I was able to easily move to the correct branch myself.
I asked it a few questions and it responded exactly like all the other models do. Some of the questions were difficult / very specific, and it failed in the same way all the other models failed.
They just need to put out a simple changelog for these model updates, no need to make a big announcement everytime to make it look like it's a whole new thing. And the version numbers are even worse.
the increases are not as fast, but they're still there. the models are already exceptionally strong, I'm not sure that basic questions can capture differences very well
Sooo, I love Claude 3.7, and use it every day, I prefer it to Gemini models mostly, but I've just given Opus 4 a spin with Claude Code (codebase in Go) for a mostly greenfield feature (new files mostly) and... the thinking process is good, but 70-80% of tool calls are failing for me.
And I mean basic tools like "Write", "Update" failing with invalid syntax.
5 attempts to write a file (all failed) and it continues trying with the following comment
> I keep forgetting to add the content parameter. Let me fix that.
So something is wrong here. Fingers crossed it'll be resolved soon, because right now, at least Opus 4, is unusable for me with Claude Code.
The files it did succeed in creating were high quality.
Basically it seems to be hitting the max output token count (writing out a whole new file in one go), stops the response, and the invalid tool call parameters error is a red herring.
Yeah I mean SQL is pretty nuanced - one of the things we want to improve in the benchmark is how we measure "success", in the sense that multiple correct SQL results can look structurally dissimilar while semantically answering the prompt.
i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately.
Not saying you're wrong about "OpenAI practices", but that's kind of a strange thing to complain about right after praising an LLM that was only recently inserting claims of "white genocide" into every other response.
Just curious, how do you know your questions and the SQL aren't in the LLM training data? Looks like the benchmark questions w/SQL are online (https://ghe.clickhouse.tech/).
Actually no, we have it up to 3 attempts. In fact, Opus 4 failed on 36/50 tests on the first attempt, but it was REALLY good at nailing the second attempt after receiving error feedback.
No it's definitely interesting. It suggests that Opus 4 actually failed to write proper syntax on the first attempt, but given feedback it absolutely nailed the 2nd attempt. My takeaway is that this is great for peer-coding workflows - less "FIX IT CLAUDE"
Have they documented the context window changes for Claude 4 anywhere? My (barely informed) understanding was one of the reasons Gemini 2.5 has been so useful is that it can handle huge amounts of context --- 50-70kloc?
In practice, the 1M context of Gemini 2.5 isn't that much of a differentiator because larger context has diminishing returns on adherence to later tokens.
I'm going to have to heavily disagree. Gemini 2.5 Pro has super impressive performance on large context problems. I routinely drive it up to 4-500k tokens in my coding agent. It's the only model where that much context produces even remotely useful results.
I think it also crushes most of the benchmarks for long context performance. I believe on MRCR (multi round coreference resolution) it beats pretty much any other model's performance at 128k at 1M tokens (o3 may have changed this).
The amount of degradation at a given context length isn't constant though so a model with 5x the context can either be completely useless or still better depending on the strength of the models you're comparing. Gemini actually does really great in both regards (context length and quality at length) but I'm not sure what a hard numbers comparison to the latest Claude models would look like.
I've had a lot of fun using Gemini's large context. I scrape a reddit discussion with 7k responses, and have gemini synthesize it and categorize it, and by the time it's done and I have a few back and fourths with it I've gotten half of a book written.
That said I have noticed that if I try to give it additional threads to compare and contrast once it hits around the 300-500k tokens it starts to hallucinate more and forget things more.
Yeah, but why aren't they attacking that problem? Is it just impossible, because it would be a really simple win with regards to coding. I am huge enthusiast, but I am starting to feel a peak.
I wish they would increase the context window or better handle it when the prompt gets too long. Currently users get "prompt is too long" warnings suddenly which makes it a frustrating model to work with for long conversations, writing etc.
Other tools may drop some prior context, or use RAG to help but they don't force you to start a new chat without warning.
> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.
I don't want to see a "summary" of the model's reasoning! If I want to make sure the model's reasoning is accurate and that I can trust its output, I need to see the actual reasoning. It greatly annoys me that OpenAI and now Anthropic are moving towards a system of hiding the models thinking process, charging users for tokens they cannot see, and providing "summaries" that make it impossible to tell what's actually going on.
There are several papers pointing towards 'thinking' output is meaningless to the final output, and using dots, or pause tokens enabling the same additional rounds of throughput result in similar improvements.
So in a lot of regards the 'thinking' is mostly marketing.
Lots of papers are insane. You can test it on competition math problems with s local AI and replace its thinking process with dots and see the result yourself.
Wow, my first ever video on AI! I'm rather disappointed. That was devoid of meaningful content save for the two minutes where they went over the Anthropic blog post on how LLMs (don't) do addition. Importantly, they didn't remotely approach what those other papers are about, or why thinking tokens aren't important for chain-of-thought. Is all AI content this kind of slop? Sorry, no offense to the above comment, it was just a total waste of 10 minutes that I'm not used to.
So, to anyone more knowledgeable than the proprietor of that channel: can you outline why it's possible to replace thinking tokens with garbage without a decline in output quality?
edit: Section J of the first paper seems to offer some succint explanations.
If we're paying for reasoning tokens, we should be able to have access to these, no? Seems reasonable enough to allow access, and then we can perhaps use our own streaming summarization models instead of relying on these very generic-sounding ones they're pushing.
> There's ample evidence that thinking is often disassociated from the output.
What kind of work do use LLMs for? For the semi technical “find flaws in my argument” thing, I find it generally better at not making common or expected fallacies or assumptions.
I am now focusing on checking your proposition. I am now fully immersed in understanding your suggestion. I am now diving deep into whether Gemini 2.5 pro also does this. I am now focusing on checking the prerequisites.
https://docs.anthropic.com/en/docs/about-claude/models/overv...
The models I'm regularly using are usually smart enough to figure out that they should be pulling in new information for a given topic.
They are evolving quickly, with deprecation and updated documentation. Having to correct for this in system prompts is a pain.
It would be great if the models were updating portions of their content more recently than others.
For the tailwind example in parent-sibling comment, should absolutely be as up to date as possible, whereas the history of the US civil war can probably be updated less frequently.
Fair enough, but information encoded in the model is return in milliseconds, information that needs to be scraped is returned in 10s of seconds.
It seems people have turned GenAI into coding assistants only and forget that they can actually be used for other projects too.
> Which version of tailwind css do you know?
> I have knowledge of Tailwind CSS up to version 3.4, which was the latest stable version as of my knowledge cutoff in January 2025.
LLMs can not reliably tell whether they know or don't know something. If they did, we would not have to deal with hallucinations.
"Who is president?" gives a "April 2024" date.
https://en.wikipedia.org/wiki/Catastrophic_interference
What does that even mean? Of course an LLM doesn't know everything, so it we wouldn't be able to assume everything got updated either. At best, if they shared the datasets they used (which they won't, because most likely it was acquired illegally), you could make some guesses what they tried to update.
I've nearly finished writing a short guide which, when added to a prompt, gives quite idiomatic FastHTML code.
The model includes nothing AFTER date D
and not
The model includes everything ON OR BEFORE date D
Right? Definitionally, the model can't include anything that happened after training stopped.
Or is it?
if you waiting for a new information, of course you are not going ever to train
Both Sonnet and Opus 4 say Joe Biden is president and claim their knowledge cutoff is "April 2024".
Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.
> This tech could lead to...
I don't think he's saying this is the version that will suddenly trigger a Renaissance. Rather, it's one solid step that makes the path ever more promising.
Sure, everyone gets a bit overexcited each release until they find the bounds. But the bounds are expanding, and the need for careful prompt engineering is diminishing. Ever since 3.7, Claude has been a regular part of my process for the mundane. And so far 4.0 seems to take less fighting for me.
A good question would be when can AI take a basic prompt, gather its own requirements and build meaningful PRs off basic prompt. I suspect it's still at least a couple of paradigm shifts away. But those seem to be coming every year or faster.
[0] My headless coding agents product, similar to “assign to copilot” but works from your task board (Linear, Jira, etc) on multiple tasks in parallel. So far simple/routine features are already quite successful. In general the better the tests, the better the resulting code (and yes, it can and does write its own tests).
Oh, we know exactly what they will do: they will drive devs insane: https://www.reddit.com/r/ExperiencedDevs/comments/1krttqo/my...
For example a lot of llms (I've seen it in Gemini 2.5, and Claude 3.7) will code non-existent methods in dynamic languages. While these runtime errors are often auto-fixable, sometimes they aren't, and breaking out of an agentic workflow to deep dive the problem is quite frustrating - if mostly because agentic coding entices us into being so lazy.
Those are already non-issues mostly solved by bots.
In any case, where I think AI could help here would be by summarizing changes, conflicts, impact on codebase and possibly also conduct security scans.
So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.
In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.
https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea...
>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out.
I.e., chain of thought may be a confabulation by the model, too. So perhaps there's somebody at Anthropic who doesn't want to mislead their customers. Perhaps they'll come back once this problem is solved.
If anything that is where the day to day pragmatic engineering gets done. Like with early chemistry, we didn't need to precisely understand chemical theory to produce mass industrial processes by making a good enough working model, some statistical parameters, and good ole practical experience. People figured out steel making and black powder with alchemy.
The only debate now is whether the prompt engineering models are currently closer to alchemy or modern chemistry? I'd say we're at advanced alchemy with some hints of rudimentary chemistry.
Also, unrelated but with CERN turning lead into gold, doesn't that mean the alchemists were correct, just fundamentally unprepared for the scale of the task? ;)
It would make sense if the model used for train-of-though was trained differently (perhaps a different expert from an MoE?) from the one used to interact with the end user, since the end user is only ever going to see its output filtered through the public model the chain-of-thought model can be closer to the original, more pre-rlhf version without risking the reputation of the company.
This way you can get the full performance of the original model whilst still maintaining the necessary filtering required to prevent actual harm (or terrible PR disasters).
If you never used GPT 3, it was... not good. Well, that's not fair, it was revolutionary in its own right, but it was very much a machine for predicting the most likely next word, it couldn't talk to you the way ChatGPT can.
Which is to say, I think RHLF is important for much more than just preventing PR disasters. It's a key part of what makes the models useful.
It helped me tremendously learning Zig.
Seeing his chain of thought when asking it stuff about Zig and implementations let me widen the horizon a lot.
https://noisegroove.substack.com/p/somersaulting-down-the-sl...
In all reality, I have zero clue how any of these companies remain sustainable. I've tried to host some inference on cloud GPUs and its seems like it would be extremely cost prohibitive with any sort of free plan.
They don't, they have a big bag of money they are burning through, and working to raise more. Anthropic is in a better position cause they don't have the majority of the public using their free-tier. But, AFAICT, none of the big players are profitable, some might get there, but likely through verticals rather than just model access.
1: https://www.wheresyoured.at/oai-business/
This is the new stochastic parrots meme. Just a few hours ago there was a story on the front page where an LLM based "agent" was given 3 tools to search e-mails and the simple task "find my brother's kid's name", and it was able to systematically work the problem, search, refine the search, and infer the correct name from an e-mail not mentioning anything other than "X's favourite foods" with a link to a youtube video. Come on!
That's not to mention things like alphaevolve, microsoft's agentic test demo w/ copilot running a browser, exploring functionality and writing playright tests, and all the advances in coding.
But sure, it managed to find a name buried in some emails after being told to... Search through emails. Wow. Such magic
[1] https://news.ycombinator.com/item?id=44050152 [2] https://news.ycombinator.com/item?id=44056530
We're watching innovation move into the use and application of LLMs.
At least I personally liked it better.
LLM’s in a generic use sense are done since already earlier this year. OpenAI discovered this when they had to cancel GPT-5 and later released the ”too costly for gains” GPT-4.5 that will be sunset soon.
I’m not sure the stock market has factored all this in yet. There needs to be a breakthrough to get us past this place.
That's nowhere near enough reason to think we've hit a plateau - the pace has been super fast, give it a few more months to call that...!
I think the opposite about the features - they aren't gimmicks at all, but indeed they aren't part of the core AI. Rather it's important "tooling" that adjacent to the AI that we need to actually leverage it. The LLM field in popular usage is still in it's infancy. If the models don't improve (but I expect they will), we have a TON of room with these features and how we interact, feed them information, tool calls, etc to greatly improve usability and capability.
I didn't want to lose the work I had done, and I knew it would be a pain to do it manually with git. The model did a fantastic job of iterating through the git commits and deciding what to put into each branch. It got everything right except for a single test that I was able to easily move to the correct branch myself.
They just need to put out a simple changelog for these model updates, no need to make a big announcement everytime to make it look like it's a whole new thing. And the version numbers are even worse.
And I mean basic tools like "Write", "Update" failing with invalid syntax.
5 attempts to write a file (all failed) and it continues trying with the following comment
> I keep forgetting to add the content parameter. Let me fix that.
So something is wrong here. Fingers crossed it'll be resolved soon, because right now, at least Opus 4, is unusable for me with Claude Code.
The files it did succeed in creating were high quality.
Basically it seems to be hitting the max output token count (writing out a whole new file in one go), stops the response, and the invalid tool call parameters error is a red herring.
Opus 4 beat all other models. It's good.
If a model is really that much smarter, shouldn't it lead to better first-attempt performance? It still "thinks" beforehand, right?
There's some interesting takeaways we learned here after the first round: https://www.tinybird.co/blog-posts/we-graded-19-llms-on-sql-...
I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first).
sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter).
Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench?
In practice, the 1M context of Gemini 2.5 isn't that much of a differentiator because larger context has diminishing returns on adherence to later tokens.
I think it also crushes most of the benchmarks for long context performance. I believe on MRCR (multi round coreference resolution) it beats pretty much any other model's performance at 128k at 1M tokens (o3 may have changed this).
A good deep dive on the context scaling topic in general https://youtu.be/NHMJ9mqKeMQ
That said I have noticed that if I try to give it additional threads to compare and contrast once it hits around the 300-500k tokens it starts to hallucinate more and forget things more.
Other tools may drop some prior context, or use RAG to help but they don't force you to start a new chat without warning.
(same as sonnet 3.7 with the beta header)
https://github.com/NVIDIA/RULER
I don't want to see a "summary" of the model's reasoning! If I want to make sure the model's reasoning is accurate and that I can trust its output, I need to see the actual reasoning. It greatly annoys me that OpenAI and now Anthropic are moving towards a system of hiding the models thinking process, charging users for tokens they cannot see, and providing "summaries" that make it impossible to tell what's actually going on.
So in a lot of regards the 'thinking' is mostly marketing.
- "Think before you speak: Training Language Models With Pause Tokens" - https://arxiv.org/abs/2310.02226
- "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models" - https://arxiv.org/abs/2404.15758
- "Do LLMs Really Think Step-by-step In Implicit Reasoning?" - https://arxiv.org/abs/2411.15862
- Video by bycloud as an overview -> https://www.youtube.com/watch?v=Dk36u4NGeSU
So, to anyone more knowledgeable than the proprietor of that channel: can you outline why it's possible to replace thinking tokens with garbage without a decline in output quality?
edit: Section J of the first paper seems to offer some succint explanations.
My take is that this is a user experience improvement, given how little people actually goes on to read the thinking process.
What kind of work do use LLMs for? For the semi technical “find flaws in my argument” thing, I find it generally better at not making common or expected fallacies or assumptions.