namnnumbr (u/namnnumbr)

namnnumbr commented on Regressions on benchmark scores suggest frontier LLMs ~3-5T params aimlbling-about.ninerealm... · Posted by u/namnnumbr

namnnumbr · a month ago

I tried backing out proprietary model sizes from benchmark scores, inspired by a Latent Space podcast where Artificial Analysis noted their Omniscience Accuracy numbers track parameter count better than anything else they measure.

I trained a bunch of simple linear regressions - while Omniscience Accuracy had the best fit (R2: 0.98), it predicted absurd multi‑trillion param sizes (Gemini 3 Pro ~1,254T total parameters). Artificial Analysis' Intelligence Index provided more plausible results:

Gemini 3 Pro: 3.4T Claude 4.5 Sonnet: 1.4T Claude 4.5 Opus: 4.1T GPT-5.x series in 2.9-5.3T range total parameters.

Interesting notes:

- task benchmarks (Tau²/GDPVal) aren't predictive of model size - adding price made the fit worse - sparsity or parameter activation ratios did not influence predicted sizes at all.

namnnumbr commented on ChatGPT Atlas chatgpt.com/atlas... · Posted by u/easton

namnnumbr · 4 months ago

I can't wait for ads to prompt inject my Agentic Browser.

namnnumbr commented on Designing agentic loops simonwillison.net/2025/Se... · Posted by u/simonw

simonw · 5 months ago

Nice, thanks for sharing. The lack of an equivalent on macOS (sandbox-exec is similar but mostly undocumented and described as "deprecated" by Apple) is really frustrating.

namnnumbr · 4 months ago

Would something like dagger.io work for sandboxing? I'm not sure on the security side of things, but I very much liked the presentation they did at the AI Engineering conference (San Fran, earlier this year) about how they can build branching containers to support branching or parallelized development workflows.

namnnumbr commented on Say farewell to the AI bubble, and get ready for the crash latimes.com/business/stor... · Posted by u/taimurkazmi

toomuchtodo · 6 months ago

95 per cent of organisations are getting zero return from AI according to MIT - https://news.ycombinator.com/item?id=44956648 - August 2025

State of AI in Business 2025 [pdf] - https://news.ycombinator.com/item?id=44941374 - August 2025

https://web.archive.org/web/20250818145714/https://nanda.med...

> Despite $30–40 billion in enterprise investment into GenAI, this report uncovers a surprising result in that 95% of organizations are getting zero return. The outcomes are so starkly divided across both buyers (enterprises, mid-market, SMBs) and builders (startups, vendors, consultancies) that we call it the GenAI Divide. Just 5% of integrated AI pilots are extracting millions in value, while the vast majority remain stuck with no measurable P&L impact. This divide does not seem to be driven by model quality or regulation, but seems to be determined by approach.

namnnumbr · 6 months ago

This is not new - the quote was "87% of data science projects fail" in 2019.

https://venturebeat.com/ai/why-do-87-of-data-science-project...

namnnumbr commented on Ask HN: Go deep into AI/LLMs or just use them as tools? · Posted by u/pella_may

mikedelfino · 9 months ago

Thank you for sharing. Do you recommend any courses or books for following that path?

namnnumbr · 9 months ago

For SWEs interested in "AI Engineering" (either getting involved in how models work, or building applications on them), there's a critical paradigm shift in that using "AI" requires more of an experimental mindset than software engineering typically does.

- I strongly recommend Chip Huyen's books ("Designing Machine Learning Systems" and "AI Engineering") and blog (https://huyenchip.com/blog/).

- Andreessen Horowitz' "AI Cannon" is a good reference listicle (https://a16z.com/ai-canon/)

- "12 factor agents" (https://github.com/humanlayer/12-factor-agents)

namnnumbr commented on Show HN: New Agentic AI Framework in CNCF github.com/dapr/dapr-agen... · Posted by u/yaronsc

yaronsc · a year ago

Benchmarks are WIP. We're thinking about durability, task latency, agent throughput. What else would you like to see?

namnnumbr · a year ago

Pass^k and not Pass@k (see https://www.philschmid.de/agents-pass-at-k-pass-power-k). Would be a great twofer to see the code used to run the benchmarks as examples.

namnnumbr commented on Ask HN: What $500-2500 product improved your 2022 · Posted by u/awillen

technics256 · 3 years ago

A bit unrelated, but I've been a runner my whole life and have been injured the past few months from plantar fasciitis.

Curious if you have any recommendations there.

namnnumbr · 3 years ago

I had chronic achilles tendonitis and ended up getting shockwave therapy from my orthopedist, who mentioned it's also frequently used for plantar fasciitis if PT does not help