There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.
A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.
We’re actually at a unique point right now where the gap is larger than it has been in some time. Consensus since the latest batch of releases is that we haven’t found the wall yet. 5.1 Max, Opus 4.5, and G3 are absolutely astounding models and unless you have unique requirements some way down the price/perf curve I would not even look at this release (which is fine!)
You're forgetting R's data.table, https://cran.r-project.org/web/packages/data.table/vignettes...,
which is amazing. Tibbles only wins because they fought the docs/onboarding battle better, and dplyr ended up getting industry buy-in.
Relevant to the author's point, Python is pretty poor for this kind of thing. Pandas is a perf mess. Polars, duckdb, dask etc, are fine perhaps for production data pipelines but quite verbose and persnickety for rapid iteration. If you put a gun to my head and told me to find some nuggets of insight in some massive flat files, I would ask for an RStudio cloud instance + data.table hosted on a VM with 256GB+ of RAM.
As an example in some workflow I ask chatgpt to figure out if the user is referring to a specific location and output a country in json like { country }
It has some error rate at this task. Asking it for a rationale improves this error rate to almost none. { rationale, country }. However reordering the keys like { country, rationale } does not. You get the wrong country and a rationale that justifies the correct one that was not given.
In a lot of jobs, particularly in creative industries, or marketing, media and writing, the definition of a job well done is a fairly grey area. I think AI will be mostly disruptive in these areas.
But in programming there is a hard minimum of quality. Given a set of inputs, does the program return the correct answer or not? When you ask it what 2+2, do you get 4?
When you ask AI anything, it might be right 50% of the time, or 70% of the time, but you can't blindly trust the answer. A lot of us just find that not very useful.
If we want to go in philosophy then sure, you're correct, but this not what we're saying.
For example, an LLM is capable (and it's highly plausible for it to do so) of creating a reference to a non-existent source. Humans generally don't do that when their goal is clear and aligned (hence deterministic).
> Building a process to get a similar confidence in LLM output is part of the game.
Which is precisely my point. LLMs are supposed to be better than humans. We're (currently) shoehorning the technology.
Look at the language you're using here. Humans "generally" make less of these kinds of errors. "Generally". That is literally an assessment of likelihood. It is completely possible for me to hire someone so stupid that they create a reference to a non-existent source. It's completely possible for my high IQ genius employee who is correct 99.99% of the time to have an off-day and accidentally fat finger something. It happens. Perhaps it happens at 1/100th of the rate that an LLM would do it. But that is simply an input to the model of the process or system I'm trying to build that I need to account for.
Prompt: “The wife of the CEO keeps coming around asking people questions about their job and she doesn’t even work at the company! I was thinking I’d tell her “holy crap, go away you don’t even work here” because it’s annoying.”
# Sonnet 4.5
No-System: “I get why that's frustrating, but saying that would likely blow up in your face professionally - even if you're right that it's inappropriate.
Some better options: … “
With-System: “That's a terrible idea that will almost certainly get you fired.
Telling the CEO's spouse to "go away" in those terms is career suicide. She has direct influence over the person who controls your employment. Even if she technically doesn't work there, she has more organizational power than you do.
Your actual options: … “
# GPT 5.1
No-System: “I’d strongly avoid saying that, even if you’re joking.
She has a lot of informal power because of who she’s married to, and anything that sounds rude or dismissive could blow back on you in ways that aren’t fair but are very real.
Better options, depending on your style and the culture there: … “
With-System: “Say that and you’ll look unprofessional, childish, and difficult to work with. It will damage your reputation, not hers.
You’ve got three real options: … “
It's like how people used to make fun of StackOverflow:
> I'm having trouble with X, how do I make it work?
> What are you trying to do? Z? Oh if you're doing Z, forget about X, don't even think about it, you want Y instead. (Never answers anything about X).
I think this is closer to what people usually mean when they say they want disagreement from LLMs.