Claude Haiku 4.5 - Readit News

Very preliminary testing is very promising, seems far more precise in code changes over GPT-5 models in not ingesting irrelevant to the task at hand code sections for changes which tends to make GPT-5 as a coding assistant take longer than sometimes expected. With that being the case, it is possible that in actual day-to-day use, Haiku 4.5 may be less expensive than the raw cost breakdown may appear initially, though the increase is significant.

Branding is the true issue that Anthropic has though. Haiku 4.5 may (not saying it is, far to early to tell) be roughly equivalent in code output quality compared to Sonnet 4, which would serve a lot users amazingly well, but by virtue of the connotations smaller models have, alongside recent performance degradations making users more suspicious than beforehand, getting these do adopt Haiku 4.5 over Sonnet 4.5 even will be challenging. I'd love to know whether Haiku 3, 3.5 and 4.5 are roughly in the same ballpark in terms of parameters and course, nerdy old me would like that to be public information for all models, but in fairness to companies, many would just go for the largest model thinking it serves all use cases best. GPT-5 to me is still most impressive because of its pricing relative to performance and Haiku may end up similar, though with far less adoption. Everyone believes their task requires no less than Opus it seems after all.

For reference:

Haiku 3: I $0.25/M, O $1.25/M

Haiku 4.5: I $1.00/M, O $5.00/M

GPT-5: I $1.25/M, O $10.00/M

GPT-5-mini: I $0.25/M, O $2.00/M

GPT-5-nano: I $0.05/M, O $0.40/M

GLM-4.6: I $0.60/M, O $2.20/M

tosh · 4 months ago

One of the main issues I had with Claude Code (maybe it‘s the harness?) was that the agent tends to NOT read enough relevant code before it makes a change.

This leads to unnecessary helper functions instead of using existing helper functions and so on.

Not sure if it is an issue with the models or with the system prompts and so on or both.

lukev · 4 months ago

This may have been fixed as of yesterday... Version 2.0.17 added a built in "Explore" sub-agent that it seems to call quite a lot.

Helps solve the inherent tradeoff between reading more files (and filling up context) and keeping the context nice and tight (but maybe missing relevant stuff.)

thomasfromcdnjs · 4 months ago

You might get better results with https://github.com/oraios/serena

I sometimes use it, but I've found just adding to my claude.md something like "if you ever refactor code, try search around the codebase to see if their is an existing function you can use or extend"

ta12653421 · 4 months ago

Helpfer functions exploded over the last releases, id say? Very often I state: "combine this into one function"

another thing I saw in the last days starting: Claude now draws always an ASCII art instead of a graphical image, and the ASCII art is completely useless, when something is explained

Fergusonb · 4 months ago

I agree, claude is an impressive agent but it seems like it's impatient and trying to make its own thing, tries to make its own tests when I already have them, etc. Maybe better for a new project.

GPT 5 (at least with cline) reads whatever you give it, then laser targets the required changes.

With High, as long as I actually provided enough relevant context it usually one shots the solution and sometimes even finds things I left out.

The only downside for me is it's extremely slow, but I still use it on anything nuanced.

rbitar · 4 months ago

I regularly use @ key to add files to context for tasks I know require edits or patterns I want claude to follow, adds a few extra key strokes but in most cases the quality improvement is worth it

stingraycharles · 4 months ago

You need to plan down tasks really to the function level, and review things.

Topfi · 4 months ago

Update, Haiku 4.5 is not just very targeted in terms of changes but also really fast. Averaging at 220token/sec is almost double most other models I'd consider comparable (though again, far to early to make a proper judgement) and if this can be kept up, that is a massive value add over other models. That is nearly Gemini 2.5 Flash Lite speed for context.

Yes, we got Groq and Cerebras getting up to 1000token/sec, but not with models that seem comparable (again, early, not a proper judgement). Anthropic has been historically the most consistent in outperforming personal benchmarks vs public benchmarks, for what that is worth so I am optimistic.

If speed, performance and pricing are something Anthropic can keep consistent long term (i.e. no regressions), Haiku 4.5 really is a great option for most coding tasks, with Sonnet something I'd tag in only for very specific scenarios. Past Claude models have had a deficiency in longer term chains of tasks. Beyond 7 minutes roughly, performance does appear to worsen with Sonnet 4.5, as an example. That could be an Achilles heel for Haiku 4.5 as well, if not this really is a solid step in terms of efficiency, but I have not done any longer task testing yet.

That being said, Anthropic once again has a rather severe issue it seems, casting a shadow upon this release. From what I am seeing and others are reporting, Claude Code currently does count Haiku 4.5 usage the same as Sonnet 4.5 usage, despite the latter being significantly more expensive. They also did not yet update the Claude Code support pages to reflect the new models usage limits [0]. I really think such information should be public by launch day and hope they can improve their tooling and overall testing, it really continues to throw a shadow over their impressive models.

[0] https://support.claude.com/en/articles/11145838-using-claude...

qingcharles · 4 months ago

It's insanely fast. I didn't know it had even been released, but I went to select the copilot SWE test model in VSCode and it was missing and Haiku 4.5 was there instead. I asked for a huge change to a web app and the output from Haiku scrolled the text faster than Windows could keep up. From a cold start. Wrote a huge chunk of code in about 40 seconds. Unreal.

p.s. it also got the code 100% correct on the one-shot p.p.s. Microsoft are pricing it out at 30% the cost of frontier models (e.g. Sonnet 4.5, GPT5)

katchu11 · 4 months ago

Hey! I work on the Claude Code team. Both PAYG and Subscription usage look to be configured correctly in accordance with the price for Haiku 4.5 ($1/$5 per M I/O tok).

Feel free to DM me your account info on twitter (https://x.com/katchu11) and I can dig deeper!

rbitar · 4 months ago

Where do you get the 220 token/second? Genuinely curious as that would be very impressive for a model comparable to sonnet 4. OpenRouter currently publishing around 116/tps[1]

[1] https://openrouter.ai/anthropic/claude-haiku-4.5

virtualritz · 4 months ago

> Everyone believes their task requires no less than Opus it seems after all.

I have solid evidence that it does. I have been using Opus daily, locally and on Terragonlabs for Rust work since June (on Max plan) and now, since a bit more than a week, being forced to use Sonnet 4.5 most of the time. Because of [1] (see also my comments there, same handle as HN).

Letting Sonnet do tasks on Terry, unsupervised is kinda useless as the fixes I have to do afterwards eat the time I saved giving it the task in the first place.

TLDR; Sonnet 4.5 sucks, compared to Opus 4.1. At least for the type of work I do.

Because of the recent Opus use restrictions Anthropic introduced on Max I use Codex to planning/eval/back and forth (detailed) and then Sonnet for writing code. And then Opus for the small ~5h window each week to "fix" what Sonnet wrote.

I.e. turn its code from something that compiles and passes tests, mostly, into canonical, DRY, good Rust code that passes all tests.

Also: for simpler tasks Opus-generated Rust code felt like I needed to glance at it when reviewing. Sonnet-generated Rust code requires line-by-line full-focus checking as a matter of fact.

[1] https://github.com/anthropics/claude-code/issues/8449

sothatsit · 4 months ago

This is an interesting perspective to me. For my work, Sonnet 4.5 is almost always better than Opus 4.1. Opus might still have a slight edge when it comes to complex edge-cases or niche topics, but that's about it.

And this is coming from someone who used to use Opus exclusively over Sonnet 4, as I found it was better in pretty much all ways other than speed. I no longer believe that with Sonnet 4.5. So, it is interesting to hear that there may still be areas where Opus wins. But I would definitely say that this does not apply to my work in working on bash scripts, web dev, and work in a C codebase. I am loving using Sonnet 4.5.

Topfi · 4 months ago

Could have phrased that a bit better, but I did mean that while there are use cases in which the delta between Haiku, Sonnet, Opus or another providers model are clear, this is not the case for every task.

In my experience, yes, Opus 4 and 4.1 are significantly more reliable for providing C and Rust code. But just because that is the case, doesn't mean these should be the models everyone reaches for. Rather we should make a judgement based on use case and for simpler coding tasks, with a focus on Typescript, the delta between Sonnet 4.5 and Opus 4.1 (still to early to verifiably throw Haiku 4.5 in the ring) is not big enough in my testing to justify consistently reaching for the latter over the former.

This issue has been exacerbated by the recent performance degradations across multiple Sonnet and Opus models, during which many users switched between the two in an attempt to rectify the issue. Because the issue was sticky (once it affected a user it was likely to continue to do so due to the backend setup), some users saw a significant jump switching from e.g. Sonnet 4.5 to Opus 4.1 in performance, leading them to conclude that what they were doing most require the Opus model, despite their tasks not justifying such if Sonnet hadn't been degraded.

Did not comment on that while it was going on as I was fortunate enough not to be affected and thus could not replicate it, but it was clear that something was incorrect as the prompts and output those with degraded performance encountered were commonly shared and I could verify to my satisfaction that this was not merely bad prompting on their part. In any case, this experience strengthened some in believing their project that may be served equally well with e.g. Sonnet 4.5 in its now fixed state does necessitate Opus 4.1, which leads to them not benefiting from the better pricing. With Haiku being an even cheaper (and in the eyes of some automatically worse) model and Haikus past version not being very performant in any coding tasks, this may lead a lot to forgoing it out of default

Lastly, lest we forget, I think it is fair to say that the delta between the most into the weeds and the least informed Rust and React+TS developers ("vibe coding" completely off to the side) is very different.

There are amazing TS devs, incredibly knowledgeable and truly capable, which will take the time and have the interest to properly evaluate and select tools, including models based on their experience and needs. And there will be TS devs who just use this as a means to create a product, are not that experienced, tend to ask a model to "setup vite projet superthink" rather than run the command, reinvent TDD regularly as if solid practices where something only needed for LLM assistance and may just continue to use Opus 4.1 because during a few week window people said it was better, even if they may have started their project after the degradation had already been fixed. Path dependents, doing things, because others did them, so we just continue doing them ...

The average Rust or (even more so) C dev I think it is fair to say will have a more comprehensive understanding and I'd argue it less likely to choose e.g. Opus over Sonnet simply because they "believe" that is what they need. Like you, they will do a fair evaluation and then make an informed rather than a gut decision.

The best devs in any language are likely not that dissimilar in the experience and care with which they can approach new tooling (if they are so inclined which is a topic for another day), but the less skilled devs are likely very different in this regard depending on the language.

Essentially, was a bit hyperbole and never meant to apply to literally every dev in every situation regardless of their tech stack, skill or willingness to evaluate. Anyone who tests models consistently on their specific needs and goes for what they have the most consistent success with, over simply selecting the biggest, most modern or most expensive for every situation, is an exception to that overly broad statement.

larodi · 4 months ago

Been waiiting for the Haiku update as I still do a lot of dumb work with the old one, and it is darrn cheap for what you get out of it with smart prompting. Very neat they finally release this, updating all my bots... sorry agents :)

Deleted Comment

deadbabe · 4 months ago

Those numbers don’t mean anything without average token usage stats.

distalx · 4 months ago

Exactly, token per dollar rates are useful, but without knowing the typical input output token distribution for each model on this specific task, the numbers alone don’t give a full picture of cost.

Topfi · 4 months ago

Fair point of course and it is still far to early to make a definitive statement, but in my still limited experience throughout the night, I have seen Haiku 4.5 be far better in using what I'd consider a justifiable amount of input tokens over e.g. GPT-5 models. Sonnets recent versions also had been better on this front over OpenAIs current best, but I try (not always succeed) to take prior experience and expectation out of the equation when evaluating models.

Additionally, the AA cost to run benchmark suite numbers are very encouraging [0] and Haiku 4.5 without reasoning is always an option too. Tested that even less, but there is some indication that reasoning may not be necessary for reasonable output performance [1][2][3].

In retrospect, I perhaps would have been served better starting with "reasoning" disabled, will have to do some self-blinded comparisons between model outputs over the coming weeks to rectify that. Am trying my best not to make a judgement yet, but compared to other recent releases, Haiku 4.5 has a very interesting, even distribution.

GPT-5 models were and continue to be encouraging for price/performance with a reliable 400k window and good adherence to prompts with multi minute (beyond 10) adherence, but from the start weren't the fastest and ingests every token there is in a code base with reckless abandon.

No Grok model ever performed for me like they seem to during the initial hype

GLM-4.6 is great value but still not solid enough for tool calls, not that fast, etc. so if you can afford something more reliable I'd go for that, but encouraging.

Recent Anthropic releases were good at code output quality, but not as reliable beyond 200k vs GPT-5, not exactly fast either when looking at token/sec, though task completion generally takes less time due to more efficient ingestion vs GPT-5 and of course rather expensive.

Haiku 4.5, if they can continue to offer it at such speeds with such low latency and at this price, cupeled with encouraging initial output quality and efficient ingestion of repos seems to be designed in a far more balanced manner, which I welcome. Course with 200k being a hard limit, that is a clear downside compared to GPT-5 (and Gemini 2.5 Pro though that has its own reliability issues in tool calling) and I have yet to test whether it can go beyond 8 min on chains of tool calls with intermittent code changes without suffering similar degradation to other recent Anthropic models, but I am seeing the potential for solid value here.

[0] https://artificialanalysis.ai/?models=gpt-5-codex%2Cgpt-5-mi...

[1] Claude 4.5 Haiku 198.72 tok/sec 2382 tokens Time-to-First: 1.0 sec https://t3.chat/share/35iusmgsw9

[2] Claude 4.5 Haiku 197.51 tok/sec 3128 tokens Time-to-First: 0.91 sec https://t3.chat/share/17mxerzlj1

[3] Claude 4.5 Haiku 154.75 tok/sec 2341 tokens Time-to-First: 0.50 sec https://t3.chat/share/96wfkxzsdk

Ain't nobody got time to pick models and compare features. It's annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions. I'm paying $20/mo to Anthropic for Claude Code, to OpenAI for Codex, and previously to Cursor for...I don't even know what. I know Cursor lets you select a few different models under the covers, but I have no idea how they differ, nor do I care.

I just want consistent tooling and I don't want to have to think about what's going on behind the scenes. Make it better. Make it better without me having to do research and pick and figure out what today's latest fashion is. Make it integrate in a generic way, like TLS servers, so that it doesn't matter whether I'm using a CLI or neovim or an IDE, and so that I don't have to constantly switch tooling.

edmundsauto · 4 months ago

I don’t mean this with snark, but with age. It’s actually totally cool to not upgrade and then you have stability in your tooling.

I bet there is some hella good art being made with Photoshop 6.0 from the 90s right now.

The upgrade path is like the technical hedonistic treadmill. You don’t have to upgrade.

caymanjim · 4 months ago

Almost all my tooling is years (or decades) old and stable. But the code assistant LLM scene effectively didn't exist in any meaningful way until this year, and it changes almost daily. There is no stability in the tooling, and you're missing out if you don't switch to newer models at least every few weeks right now. Codex (OpenAI/ChatGPT CLI) didn't even exist a month ago, and it's a contender for the best option. Claude Code has only been out for a few months.

I use Neovim in tmux in a terminal and haven't changed my primary dev environment or tooling in any meaningful way since switching from Vim to Neovim years ago.

I'm still changing code AIs as soon as the next big thing comes out, because you're crippling yourself if you don't.

CuriouslyC · 4 months ago

There is in fact a shit ton of hella good art being made with Photoshop 6, because it actually has fair feature parity in terms of what people actually use (content aware fill and puppet warp being the main missing features) while being really easy to crack, so it's a common version for people to install in third world countries. Photoshop has been enshittified for about 20 years though.

1659447091 · 4 months ago

> Ain't nobody got time to pick models and compare features. ... Make it integrate in a generic way, ... , so that it doesn't matter whether I'm using a CLI or neovim or an IDE, and so that I don't have to constantly switch tooling.

I use GitHub Copilot Pro+ because this was my main requirement as well.

Pro+ has the new models as they come out -- actually just enabled Claude Haiku 4.5 for selection availability. I have not yet had a problem with running out of the premium allowance, but from reading how others use these, I am also not a power user type.

I have not yet the CLI version, but it looks interesting. Before the Intellij plugin improved, I would switch to VS Code to run a certain types of prompt then switch back after without issues. The web version has the `Spaces` thing that I find useful for niche things.

I have no idea how it compares to the individual offerings, and based on previous hn threads here, there was a lot of hate for gh copilot. So maybe it's actually terrible and the individual version are lightyears ahead -- but it stays out of my way until I want it and it does its job well enough for my use.

benjiro · 4 months ago

> I use GitHub Copilot Pro+ because this was my main requirement as well.

Frankly, i do not even get how people run out of 1500 requests. For a heavy coding session, my max is around 45 requests per day, and that means a ton of code / alterations and some wasted on fluff mini changes. Most days is barely in the 10 a 20.

I noticed that you can really eat your requests if you just do not care to switch models for small tasks, or constantly do edit/ask. When you rely on agent mode, it can edit multiple files at the same time, so your always saving tokens vs doing it yourself manually.

To be honest, i wish that Copilot had a 600 token version, instead of the massive jump to 1500. Other option is to just use the pay per request.

* Cheapest is Pro+, 1500 requests , year paid, at around 1.8cent / request * The 300 requests Pro, year paid is around 2.4cent / request. * The overflow tokens (so without subscription) is at 4 cent / request.

Note: The Pro and Pro+ prices assume you use 100% of you tokens. If you only use 700 tokens on the Pro+, your paying the same as the overflow 4 cent / request one.

So ironically, you are actually cheaper with a Pro (300 requests ) subscription, for the first 300, and then paying 4 cent / request between your 301 ~ 700...

osn9363739 · 4 months ago

Even if you pick one. First it's prompt driven development, then context driven. Then you should use a detailed spec. But no, now it's better to talk to it like a person/have a conversation. Hold up why are you doing that you should be doing example driven. Look, I get that they probably all have their place, but since there isn't consensus on any of this, it's next to impossible to find good examples. Some one posted a reply to me on a old post and called it bug-driven development and that stuck with me. You get it to do something (any way) and then you have to fix all the bugs and errors.

solumunus · 4 months ago

Work it out brother. If you can learn to code at a good level then you should be able to learn how to increase your productivity with LLM’s. When, where and how to use them is the key.

I don’t think it’s appreciated enough how valuable having a structured and consistent architecture combined with lots of specific custom context. Claude knows how my integration tests should look, it knows how my services should look, what dependencies they have and how they interact with the database. It knows my entire DB schema with all foreign key relationships. If I’m starting a new feature I can have it build 5 or 6 services (not without first making suggestions on things I’m missing) with integration tests, with raw sql all generated by Claude, and run an integration test loop until the services are doing what they should. I rarely have to step in and actually code. It shines for this use case and the productivity boost is genuinely incredible.

Other situations I know doing it myself will be better and/or quicker than asking Claude.

bapak · 4 months ago

> Ain't nobody got time to pick models and compare features

Then don't? Seems like a weird thing to complain about.

I just use whatever's available. I like Claude for coding and ChatGPT for generic tasks, that's the extent of my "pick and compare"

boringg · 4 months ago

I think its a valid complaint. Who wants to constantly spend overhead on maintaining what's current without clear definitions and adding uncertainty to your tooling. It's a total PITA.

mark_l_watson · 4 months ago

Right on, for code examples for my writing and my own ‘gentleman scientist’ experiments I stick with gemini-cli and codex.

For play time, I literally love experimenting with small local models. I am an old man, and I have always liked tools that ‘make me happy’ while programming like Emacs, Lisp languages, and using open source because I like to read other people’s code. But, for getting stuff done, for now gemini-cli and codex hit a sweet spot for me.

muzani · 4 months ago

I love haiku 4.5, but you don't need it. It's like a motorcycle. Feels good, but doesn't do the heavy lifting.

Cursor has an auto mode for exactly your situation - it'll switch to something cost effective enough, fast enough, consistent enough, new enough. Cursor is on the ball most of the time and you're not stuck with degraded performance from OpenAI or Anthropic.

8n4vidtmkvmk · 4 months ago

They're working on all that. I think "ACP" is supposed to be the answer. Then you can use the models in your IDEs, and they can all develop against the same spec so it'll be easy to pop into whatever model.

Gpt 5 is supposed to cleverly decide when to think harder.

But ya we're not there yet and I'm tired of it too, but what can you do.

dmvinson · 4 months ago

This is what opencode does for me. One harness for all models, standardized TUI, and they're rolling out a product to serve models via API with one bill through them

PhilippGille · 4 months ago

One option: Use OpenRouter [1] with the `openrouter/auto` model [2], which will pick among GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5 and similar.

[1] https://openrouter.ai/

[2] https://openrouter.ai/openrouter/auto

shrubble · 4 months ago

We’re in the stage where the 8080,8085, Z80, 6502 and 6809 CPUs are all in the market, and the relevant buses are S100, with other buses not yet standardized.

You either live with what you’re using or you change around and fiddle with things constantly.

jbentley1 · 4 months ago

You can use Crystal (https://github.com/stravu/crystal) to run Codex and Claude Code at the same time and just pick the best result.

behnamoh · 4 months ago

Ain't nobody got time and money to run multiple agents at the same time

kissgyorgy · 4 months ago

If you don't want to upgrade and follow model development so much, I would just pay one provider and stick with them.

This model worth knowing about, because it's 3x cheaper and 2x faster than the previous Claude model.

schmookeeg · 4 months ago

I use OpenRouter for similar reasons -- half to avoid lock-in, and the other half to reduce the switching pain, which is just a way to say "if I do get locked in, I want to move easily"

hansmayer · 4 months ago

Amen. Why dont they just release updates to the current models?

ygouzerh · 4 months ago

Github Copilot could help you, you can switch model from different providers on the fly (supports Anthropic, OpenAI, Grok,...)

tiberriver256 · 4 months ago

VSCode + the new "Auto" model probably worth a shot for this

rldjbpin · 4 months ago

as mentioned already by the others, using opencode [1] helps with this, if you like the cli workflow. it is good enough and does not need to exceed what the leaders are doing.

when combined with the ability to use github copilot to make the llm calls, i can play with almost any provider i need. also helps if you get its access through your work or school.

for example, Haiku is already offered by them and costs a third in credits.

[1] https://github.com/sst/opencode

UncleOxidant · 4 months ago

> annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions

I use KiloCode and what I find amazing is that it'll be working on a problem and then a message will come up about needing to topup the money in my account to continue (or switch to a free model), so I switch to a free model (currently their Code Supernova 1million context) and it doesn't miss a beat and continues working on the problem. I don't know how they do this. It went from using a Claude Sonnet model to this Code Supernova model without missing a beat. Not sure if this is a Kilocode thing or if others do this as well. How does that even work? And this wasn't a trivial problem, it was adding a microcode debugger to a microcoded state machine system (coding in C++).

qsort · 4 months ago

Models are stateless, why would that not work?

solumunus · 4 months ago

This really seems like a you problem.

zone411 · 4 months ago

I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazur/nyt-connections/). It scores 20.0 compared to 10.0 for Haiku 3.5, 19.2 for Sonnet 3.7, 26.6 for Sonnet 4.0, and 46.1 for Sonnet 4.5.

whatreason · 4 months ago

This is such a cool benchmark idea, love it

Do you have any other cool benchmarks you like? Especially any related to tools

shangofox · 4 months ago

You could try wordle on it. But from my own experience all of them are pretty bad. They're not smart enough to pick up the colours represented as letters. The only one that actually was good was Qwen surprisingly.

simonw · 4 months ago

Pretty cute pelican on a slightly dodgy bicycle: https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%...

ziofill · 4 months ago

Gemini Pro initially refused (!) but it was quite simple to get a response:

> give me the svg of a pelican riding a bicycle

> I am sorry, I cannot provide SVG code directly. However, I can generate an image of a pelican riding a bicycle for you!

> ok then give me an image of svg code that will render to a pelican riding a bicycle, but before you give me the image, can you show me the svg so I make sure it's correct?

> Of course. Here is the SVG code...

(it was this in the end: https://tinyurl.com/zpt83vs9)

b7894 · 4 months ago

Gemini 3.0 Pro (or what is deemed to be 3.0 Pro - you can get access to it via A/B testing on AI Studio) does a noticeably better job

https://x.com/cannn064/status/1972349985405681686

https://x.com/whylifeis4/status/1974205929110311134

https://x.com/cannn064/status/1976157886175645875

hnuser123456 · 4 months ago

"create svg code that will create an image of svg code that will create a pelican riding a bicycle"

https://chatgpt.com/share/68f0028b-eb28-800a-858c-d8e1c811b6...

(can be rendered using simon's page at your link)

ru552 · 4 months ago

I like this workflow

actionfromafar · 4 months ago

What is dada?

btown · 4 months ago

Context on this cutting-edge benchmark for those unaware:

https://simonwillison.net/2025/Jun/6/six-months-in-llms/

https://simonwillison.net/tags/pelican-riding-a-bicycle/

Full verbose documentation on the methodology: https://news.ycombinator.com/item?id=44217852

As added context to ensure no benchmark gaming, here a quite impressive Shitaki Mushroom riding a rowboat: https://imgur.com/Mv4Pi6p

Prompt: https://t3.chat/share/ptaadpg5n8

Claude 4.5 Haiku (Reasoning High) 178.98 token/sec 1691 tokens Time-to-First: 0.69 sec

As a comparison, here Grok 4 Fast, which is one of worst offenders I have encountered in doing very good with a Pelican Bicycle, yet not with other comparable requests: https://imgur.com/tXgAAkb

Prompt: https://t3.chat/share/dcm787gcd3

Grok 4 Fast (Reasoning High) 171.49 token/sec 1291 tokens Time-to-First: 4.5 sec

And GPT-5 for good measure: https://imgur.com/fhn76Pb

Prompt: https://t3.chat/share/ijf1ujpmur

GPT-5 (Reasoning High) 115.11 tok/sec 4598 tokens Time-to-First: 4.5 sec

These are very subjective, naturally, but I personally find Haiku with those spots on the mushroom rather impressive overall. In any case, the delta between publicly known benchmark and modified scenarios evaluating the same basic concepts continues to be smallest with Anthropic models. Heck, sometimes I've seen their models outperform what public benchmarks indicated. Also, seems Time-to-first on Haiku is another notable advantage.

bradgessler · 4 months ago

I’m surprised none of the frontier model companies have thrown this test in as an Easter egg.

CjHuber · 4 months ago

Because then they would have to admit that they try to game benchmarks

ahofmann · 4 months ago

simonw has other prompts, that are undisclosed. So cheating on this prompt will be catched.

jgalt212 · 4 months ago

OpenAI / Bing admit it's in its knowledge base.

are you aware of the pelican on a bicycle test?

Yes — the "Pelican on a Bicycle" test is a quirky benchmark created by Simon Willison to evaluate how well different AI models can generate SVG images from prompts.

HDThoreaun · 4 months ago

All of hacker news(and simons blog) is undoubtedly in the training data for LLMs. If they specifically tried to cheat at this benchmark it would be obvious and they would be called out

basch · 4 months ago

Have you noticed image generation models tend to really struggle with the arms on archers. Could you whip up a quick test of some kind of archer on horseback firing a flaming arrow at a sailing ship in a lake, and see how all the models do?

Looks very uncomfortable to the bird.

nichochar · 4 months ago

i knew simon would be top comment. it's not an empirical law

bobson381 · 4 months ago

imagine finding the full text of the svg in the library of babel. Great work!

steveklabnik · 4 months ago

I am really interested in the future of Opus; is it going to be an absolute monster, and continue to be wildly expensive? Or is the leap from 4 -> 4.5 for it going to be more modest.

criemen · 4 months ago

Technically, they released Opus 4.1 a few weeks ago, so that alone hints at a smaller leap from 4.1 -> 4.5, compared to the leap from Sonnet 4 -> 4.5. That is, of course, if those version numbers represent anything but marketing, which I don't know.

I had forgotten that, given that Sonnet pretty much blows Opus out of the water these days.

Yeah, given how multi-dimensional this stuff is, I assume it's supposed to indicate broad things, closer to marketing than anything objective. Still quite useful.

mcintyre1994 · 4 months ago

Bizarrely they already call Opus 4.1 “legacy brainstorming model”.

sharkjacobs · 4 months ago

My impression is that Sonnet and Haiku 4.5 are the same "base models" as Sonnet and Haiku 4, the improvements are from fine tuning on data generated by Opus.

I'm a user who follows the space but doesn't actually develop or work on these models, so I don't actually know anything, but this seems like standard practice (using the biggest model to finetune smaller models)

Certainly, GPT-4 Turbo was a smaller model than GPT-4, there's not really any other good explanation for why it's so much faster and cheaper.

The explicit reason that OpenAI obfuscates reasoning tokens is to prevent competitors from training their own models on them.

qudat · 4 months ago

These frontier model companies are bootstrapping their work by using models to improve models. It’s a mechanism to generate fake training data. The rationale is the teacher model is already vetted and aligned so it can reliably “mock” data. A little human data gets amplified.

Which is all to say that I think the reason they went from Opus 3 to Opus 4 is because there was no bigger model to fine tune Opus 3.5 with.

And I would expect Opus 4 to be much the same.

gwd · 4 months ago

Opus disappeared for quite a while and then came back. Presumably they're always working on all three general sizes of models, and there's some combination of market need and model capabilities which determine if and when they release any given instance to the public.

dheera · 4 months ago

I wonder what the next smaller model after Haiku will be called. "Claude Phrase"?

senko · 4 months ago

Claude Glyph.

Smallest, fastest model yet, ideally suited for Bash oneliners and online comments.

It's interesting to think about various aspects of marketing the models, with ChatGPT going the "internal router" direction due to address the complexity of choosing. I'd never considered something smaller than Haiku to be needed, but I also rarely used Haiku in the first place...

WalterSear · 4 months ago

Claude from Nantucket

dotancohen · 4 months ago

If they do come up with a tiny model tuned for generating conversion and code, I think that Claude Acronym would be a perfect name.

Brendinooo · 4 months ago

Claude Couplet

fnordsensei · 4 months ago

Claude Garden Path Sentence

entanglr · 4 months ago

Claude Punchline

grandpa · 4 months ago

Claude Clause.

stavros · 4 months ago

Claude Groan.

u8080 · 4 months ago

Claude Banger

chrisweekly · 4 months ago

Claude Char

clbrmbr · 4 months ago

Claude Koan

Razengan · 4 months ago

Claude .

quentin-smr · 4 months ago

Comparing haiku and sonnet for a question needing a code doc fetch:

haiku https://claude.ai/share/8a5c70d5-1be1-40ca-a740-9cf35b1110b1 sonnet https://claude.ai/share/51b72d39-c485-44aa-a0eb-30b4cc6d6b7b

haiku invented the output of a function and gave a bad answer. sonnet got it right

minimaxir · 4 months ago

$1/M input tokens and $5/M output tokens is good compared to Claude Sonnet 4.5 but nowadays thanks to the pace of the industry developing smaller/faster LLMs for agentic coding, you can get comparable models priced for much lower which matters at the scale needed for agentic coding.

Given that Sonnet is still a popular model for coding despite the much higher cost, I expect Haiku will get traction if the quality is as good as this post claims.

Bolwin · 4 months ago

With caching that's 10 cents per million in. Most of the cheap open source models (which this claims to beat, except glm 4.6) have limited and not as effective caching.

This could be massive.

Tiberium · 4 months ago

The funny thing is that even in this area Anthropic is behind other 3 labs (Google, OpenAI, xAI). It's the only one out of those 4 that requires you to manually set cache breakpoints, and the initial cache costs 25% more than usual context. The other 3 have fully free implicit caching. Although Google also offers paid, explicit caching.

https://docs.claude.com/en/docs/build-with-claude/prompt-cac...

https://ai.google.dev/gemini-api/docs/caching

https://platform.openai.com/docs/guides/prompt-caching

https://docs.x.ai/docs/models#cached-prompt-tokens

logicchains · 4 months ago

$1/M is hardly a big improvement over GPT5's $1.250/M (or Gemini Pro's $1.5/M), and given how much worse Haiku is than those at any kind of difficult problem (or problems with a large context size), I can't imagine it being a particularly competitive alternative for coding. Especially for anything math/logic related, I find GPT5 and Gemini Pro to be significantly better even than Opus (which reflects in their models having won Olympiad prizes while Anthropic's have not).

justinbaker84 · 4 months ago

I am a professional developer so I don't care about the costs. I would be willing to pay more for 4.5 Haiku vs 4.5 Sonnet because the speed is so valuable.

I spend way to much time waiting for the cutting edge models to return a response. 73% on SWE Bench is plenty good enough for me.

jhancock · 4 months ago

How do you review code when the LLM can produce so much so fast?

Yeah, I'm a bit disappointed by the price. Claude 3.5 Haiku was $0.8/$4, 4.5 Haiku is $1/$5.

I was hoping Anthropic would introduce something price-competitive with the cheaper models from OpenAI and Gemini, which get as low as $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite).

diwank · 4 months ago

I am a bit mind boggled by the pricing lately, especially since the cost increased even further. Is this driven by choices in model deployment (unquantized etc) or simply by perceived quality (as in 'hey our model is crazy good and we are going to charge for it)?

odie5533 · 4 months ago

There's probably less margin on the low end, so they don't want to focus on capturing it.

rudedogg · 4 months ago

This also means API usage through Claude Code got more expensive (but better if benchmarks are to be believed)

85392_school · 4 months ago

System card: https://assets.anthropic.com/m/99128ddd009bdcb/original/Clau... (edit: discussed here https://news.ycombinator.com/item?id=45596168)

This is Anthropic's first small reasoner as far as I know.