henriquegodoy (u/henriquegodoy)

henriquegodoy commented on Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics · Posted by u/grace77

henriquegodoy · 4 months ago

This is actually really needed, current ai design tools are so predictable and formulaic, like every output feels like the same purple gradients with rounded corners and that one specific sans serif font that every model seems obsessed with, it's gotten to the point where you can spot ai-generated designs from a mile away because they all have this weird sterile aesthetic that screams "made by a model"

henriquegodoy commented on Show HN: Omnara – Run Claude Code from anywhere github.com/omnara-ai/omna... · Posted by u/kmansm27

henriquegodoy · 4 months ago

This is pretty cool and feels like we're heading in the right direction, the whole idea of being able to hop between devices while claude code is thinking through problems is neat, but honestly what excites me more is the broader pattern here, like we're moving toward a world where coding isn't really about sitting down and grinding out syntax for hours, it's becoming more about organizing tasks and letting ai agents figure out the implementation details.

I can already see how this evolves into something where you're basically managing a team of specialized agents rather than doing the actual coding, you set up some high-level goals, maybe break them down into chunks, and then different agents pick up different pieces and coordinate with each other, the human becomes more like a project manager making decisions when the agents get stuck or need direction, imho tools like omnara are just the first step toward that, right now it's one agent that needs your input occasionally, but eventually it'll probably be orchestrating multiple agents working in parallel, way better than sitting there watching progress bars for 10 minutes.

henriquegodoy commented on Evaluating LLMs playing text adventures entropicthoughts.com/eval... · Posted by u/todsacerdoti

henriquegodoy · 4 months ago

Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data. Like, you'd think they'd at least brute force their way through the early game mechanics by now, but honestly this kinda validates something I've been thinking about like real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out

This is exactly why something like arc-agi-3 feels so important right now. Instead of static benchmarks that these models can basically brute force with enough training data, like designing around interactive environments where you actually need to perceive, decide, and act over multiple steps without prior instructions, that shift from "can you reproduce known patterns" to "can you figure out new patterns" seems like the real test of intelligence.

What's clever about the game environment approach is that it captures something fundamental about human intelligence that static benchmarks miss entirely, like, when humans encounter a new game, we explore, form plans, remember what worked, adjust our strategy all that interactive reasoning over time that these text adventure results show llms are terrible at, we need systems that can actually understand and adapt to new situations, not just really good autocomplete engines that happen to know a lot of trivia.

henriquegodoy commented on Claude Sonnet 4 now supports 1M tokens of context anthropic.com/news/1m-con... · Posted by u/adocomplete

henriquegodoy · 4 months ago

Thats incredible to see how ai models are improving, i'm really happy with this news. (imo it's more impactful than the release of gpt5) now, we need more tokens per second, and then the self-improvement of the model will accelerate.

henriquegodoy commented on GPT-5 openai.com/gpt-5/... · Posted by u/rd

henriquegodoy · 4 months ago

That SWE-bench chart with the mismatched bars (52.8% somehow appearing larger than 69.1%) was emblematic of the entire presentation - rushed and underwhelming. It's the kind of error that would get flagged in any internal review, yet here it is in a billion-dollar product launch. Combined with the Bernoulli effect demo confidently explaining how airplane wings work incorrectly (the equal transit time fallacy that NASA explicitly debunks), it doesn't inspire confidence in either the model's capabilities or OpenAI's quality control.

The actual benchmark improvements are marginal at best - we're talking single-digit percentage gains over o3 on most metrics, which hardly justifies a major version bump. What we're seeing looks more like the plateau of an S-curve than a breakthrough. The pricing is competitive ($1.25/1M input tokens vs Claude's $15), but that's about optimization and economics, not the fundamental leap forward that "GPT-5" implies. Even their "unified system" turns out to be multiple models with a router, essentially admitting that the end-to-end training approach has hit diminishing returns.

The irony is that while OpenAI maintains their secretive culture (remember when they claimed o1 used tree search instead of RL?), their competitors are catching up or surpassing them. Claude has been consistently better for coding tasks, Gemini 2.5 Pro has more recent training data, and everyone seems to be converging on similar performance levels. This launch feels less like a victory lap and more like OpenAI trying to maintain relevance while the rest of the field has caught up. Looking forward to seeing what Gemini 3.0 brings to the table.

henriquegodoy commented on GPT-5 for Developers openai.com/index/introduc... · Posted by u/6thbit

henriquegodoy · 4 months ago

I dont think there's so much difference from opus 4.1 and gpt-5, probably just the context size, waiting for the gemini 3.0