hirako2000 (u/hirako2000)

hirako2000 commented on OpenAI Progress progress.openai.com... · Posted by u/vinhnx

I have a theory about why it's so easy to underestimate long-term progress and overestimate short-term progress.

Before a technology hits a threshold of "becoming useful", it may have a long history of progress behind it. But that progress is only visible and felt to researchers. In practical terms, there is no progress being made as long as the thing is going from not-useful to still not-useful.

So then it goes from not-useful to useful-but-bad and it's instantaneous progress. Then as more applications cross the threshold, and as they go from useful-but-bad to useful-but-OK, progress all feels very fast. Even if it's the same speed as before.

So we overestimate short term progress because we overestimate how fast things are moving when they cross these thresholds. But then as fewer applications cross the threshold, and as things go from OK-to-decent instead of bad-to-OK, that progress feels a bit slowed. And again, it might not be any different in reality, but that's how it feels. So then we underestimate long-term progress because we've extrapolated a slowdown that might not really exist.

I think it's also why we see a divide where there's lots of people here who are way overhyped on this stuff, and also lots of people here who think it's all totally useless.

hirako2000 · 8 days ago

GPT3 is when the mass started to get exposed to this tech, it felt like a revolution.

Got 3.5 felt like things were improving super super fast and created that feeling the near feature will be unbelievable.

Got to 4/o series, it felt things had improved but users weren't as thrilled as with the leap to 3.5

You can call that bias, but clearly version 5 improvements displays an even greater slow down, that's 2 long years since gp4.

For context:

- gpt 3 got out in 2020

- gpt 3.5 in 2022

- gpt 4 in 2023

- gpt 4o and clique, 2024

After 3.5 things slowed down, in term of impact at least. Larger context window, multi-modality, mixture of experts, and more efficienc: all great, significant features, but all pale compared to the impact made by RLHF already 4 years ago.

hirako2000 commented on Wan – Open-source alternative to VEO 3 github.com/Wan-Video/Wan2... · Posted by u/modinfo

bobajeff · 8 days ago

If having only 6GB VRAM is GPU poor then I must be GPU destitute.

hirako2000 · 8 days ago

It's hard to get an nvidia consumer having then less than 12GB of VRAM, not just these days.

By GPU poor they didn't mean GPUless or GPU of the previous decade. It's on the readme that only Nvidia is supported.

hirako2000 commented on Claude Sonnet 4 now supports 1M tokens of context anthropic.com/news/1m-con... · Posted by u/adocomplete

brulard · 13 days ago

> From my experience, even the top models continue to fail delivering correctness on many tasks even with all the details and no ambiguity in the input.

You may feel like there are all the details and no ambiguity in the prompt. But there may still be missing parts, like examples, structure, plan, or division to smaller parts (it can do that quite well if explicitly asked for). If you give too much details at once, it gets confused, but there are ways how to let the model access context as it progresses through the task.

And models are just one part of the equation. Another parts may be orchestrating agent, tools, models awareness of the tools available, documentation, and maybe even human in the loop.

hirako2000 · 9 days ago

I've given thousands of well detailed prompts. Of those a good enough portion yielded results that diverged from unambiguous instructions that I have stopped, long ago, being fooled into thinking instructions are interpreted by LLMs.

But if in your perspective it does work, more power to you I suppose.

hirako2000 commented on Claude Sonnet 4 now supports 1M tokens of context anthropic.com/news/1m-con... · Posted by u/adocomplete

unoti · 13 days ago

> Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative. > For me it’s meant a huge increase in productivity, at least 3X. > How do we reconcile these two comments? I think that's a core question of the industry right now.

Every success story with AI coding involves giving the agent enough context to succeed on a task that it can see a path to success on. And every story where it fails is a situation where it had not enough context to see a path to success on. Think about what happens with a junior software engineer: you give them a task and they either succeed or fail. If they succeed wildly, you give them a more challenging task. If they fail, you give them more guidance, more coaching, and less challenging tasks with more personal intervention from you to break it down into achievable steps.

As models and tooling becomes more advanced, the place where that balance lies shifts. The trick is to ride that sweet spot of task breakdown and guidance and supervision.

hirako2000 · 13 days ago

Bold claims.

From my experience, even the top models continue to fail delivering correctness on many tasks even with all the details and no ambiguity in the input.

In particular when details are provided, in fact.

I find that with solutions likely to be well oiled in the training data, a well formulated set of *basic* requirements often leads to a zero shot, "a" perfectly valid solution. I say "a" solution because there is still this probability (seed factor) that it will not honour part of the demands.

E.g, build a to-do list app for the browser, persist entries into a hashmap, no duplicate, can edit and delete, responsive design.

I never recall seeing an LLM kick off C++ code out of that. But I also don't recall any LLM succeeding in all these requirements, even though there aren't that many.

It may use a hash set, or even a set for persistence because it avoids duplicates out of the box. And it would even use a hash map to show it used a hashmap but as an intermediary data structure. It would be responsive, but the edit/delete buttons may not show, or may not be functional. Saving the edits may look like it worked, but did not.

The comparison with junior developers is pale. Even a mediocre developer can test its and won't pretend that it works if it doesn't even execute. If a develop lies too many times it would lose trust. We forgive these machines because they are just automatons with a label on it "can make mistakes". We have no resorts to make them speak the truth, they lie by design.

hirako2000 commented on Claude Sonnet 4 now supports 1M tokens of context anthropic.com/news/1m-con... · Posted by u/adocomplete

zarzavat · 13 days ago

Both modes of operation are useful.

If you know how to do something, then you can give Claude the broad strokes of how you want it done and -- if you give enough detail -- hopefully it will come back with work similar to what you would have written. In this case it's saving you on the order of minutes, but those minutes add up. There is a possibility for negative time saving if it returns garbage.

If you don't know how to do something then you can see if an AI has any ideas. This is where the big productivity gains are, hours or even days can become minutes if you are sufficiently clueless about something.

hirako2000 · 13 days ago

The issue is that you would be not just clueless but grown naive about the correctness of what it did.

Knowing what to do at least you can review. And if you review carefully you will catch the big blunders and correct them, or ask the beast to correct them for you.

> Claude, please generate a safe random number. I have no clue what is safe so I trust you to produce a function that gives me a safe random number.

Not every use case is sensitive, but even building pieces for entertainment, if it wipe things it shouldn't delete or drain the battery doing very inefficient operations here and there, it's junk, undesirable software.

hirako2000 commented on Claude Sonnet 4 now supports 1M tokens of context anthropic.com/news/1m-con... · Posted by u/adocomplete

sdesol · 13 days ago

> I really desperately need LLMs to maintain extremely effective context

I actually built this. I'm still not ready to say "use the tool yet" but you can learn more about it at https://github.com/gitsense/chat.

The demo link is not up yet as I need to finalize an admin tool but you should be able to follow the npm instructions to play around with.

The basic idea is, you should be able to load your entire repo or repos and use the context builder to help you refine it. Or you can can create custom analyzers that you can do 'AI Assisted' searches with like execute `!ask find all frontend code that does [this]` and the because the analyzer knows how to extract the correct metadata to support that query, you'll be able to easily build the context using it.

hirako2000 · 13 days ago

Not clear how it gets around what is, ultimately, a context limit.

I've been fiddling with some process too, would be good if you shared the how. The readme looks like yet another full fledged app.

hirako2000 commented on Cloudflare Is Not a CDN magecdn.com/blog/2025/08/... · Posted by u/shubhamjain

hirako2000 · 14 days ago

I thought it would be an article on how cloudflare used to be a CDN, how it became a PaaS provider, which kept the CDN service.

- workers (a sort of lambda on edge) - page (a sort of fastifly) - R2 (S3 compliant storage) - kv (a database) - load balancing (an elastic LB) - an entire set of cybersecurity services

hirako2000 commented on GPT-5: Key characteristics, pricing and system card simonwillison.net/2025/Au... · Posted by u/Philpax

justusthane · 18 days ago

> a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent

This is sort of interesting to me. It strikes me that so far we've had more or less direct access to the underlying model (apart from the system prompt and guardrails), but I wonder if going forward there's going to be more and more infrastructure between us and the model.

hirako2000 · 18 days ago

Consider it a low level routing. Keeping in mind it allows the other non active parts to not be in memory. Mistral afaik came up with this concept, quite a while back.

hirako2000 commented on GPT-5: Key characteristics, pricing and system card simonwillison.net/2025/Au... · Posted by u/Philpax

bdcdo · 18 days ago

"GPT-5 in the API is simpler: it’s available as three models—regular, mini and nano—which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high."

Is it actually simpler? For those who are currently using GPT 4.1, we're going from 3 options (4.1, 4.1 mini and 4.1 nano) to at least 8, if we don't consider gpt 5 regular - we now will have to choose between gpt 5 mini minimal, gpt 5 mini low, gpt 5 mini medium, gpt 5 mini high, gpt 5 nano minimal, gpt 5 nano low, gpt 5 nano medium and gpt 5 nano high.

And, while choosing between all these options, we'll always have to wonder: should I try adjusting the prompt that I'm using, or simply change the gpt 5 version or its reasoning level?

hirako2000 · 18 days ago

Ultimately they are selling tokens, so try many times.

hirako2000 commented on Gemini 2.5 Deep Think blog.google/products/gemi... · Posted by u/meetpateltech

red75prime · 23 days ago

An opinion on the current state of the field. The usual stochastic parrot mention. That, I see. Reasons for the existence of the wall? Not so much.

hirako2000 · 23 days ago

That it is usual doesn't mean it's false.

Everyone I talked to, knowledgeable in machine learning and/or deep learning, who had no reason to pretend of course, agreed an LLM is a stochastic machines. That it is couples with very good other NLP techniques doesn't change that.

It is why even the best models today miss the shot by a large margin, then hit a decent match. Again back to the creative issue. If it was done before, a good input and a well trained model on good data will output good data likely to match with the best (matching) answer ever produced. Some NLP to make it sound unique is not equal to creativity.