Readit News logoReadit News
namnnumbr commented on Regressions on benchmark scores suggest frontier LLMs ~3-5T params   aimlbling-about.ninerealm... · Posted by u/namnnumbr
namnnumbr · a month ago
I tried backing out proprietary model sizes from benchmark scores, inspired by a Latent Space podcast where Artificial Analysis noted their Omniscience Accuracy numbers track parameter count better than anything else they measure.

I trained a bunch of simple linear regressions - while Omniscience Accuracy had the best fit (R2: 0.98), it predicted absurd multi‑trillion param sizes (Gemini 3 Pro ~1,254T total parameters). Artificial Analysis' Intelligence Index provided more plausible results:

Gemini 3 Pro: 3.4T Claude 4.5 Sonnet: 1.4T Claude 4.5 Opus: 4.1T GPT-5.x series in 2.9-5.3T range total parameters.

Interesting notes:

- task benchmarks (Tau²/GDPVal) aren't predictive of model size - adding price made the fit worse - sparsity or parameter activation ratios did not influence predicted sizes at all.

namnnumbr commented on ChatGPT Atlas   chatgpt.com/atlas... · Posted by u/easton
namnnumbr · 4 months ago
I can't wait for ads to prompt inject my Agentic Browser.
namnnumbr commented on Designing agentic loops   simonwillison.net/2025/Se... · Posted by u/simonw
simonw · 5 months ago
Nice, thanks for sharing. The lack of an equivalent on macOS (sandbox-exec is similar but mostly undocumented and described as "deprecated" by Apple) is really frustrating.
namnnumbr · 4 months ago
Would something like dagger.io work for sandboxing? I'm not sure on the security side of things, but I very much liked the presentation they did at the AI Engineering conference (San Fran, earlier this year) about how they can build branching containers to support branching or parallelized development workflows.
namnnumbr commented on Say farewell to the AI bubble, and get ready for the crash   latimes.com/business/stor... · Posted by u/taimurkazmi
toomuchtodo · 6 months ago
95 per cent of organisations are getting zero return from AI according to MIT - https://news.ycombinator.com/item?id=44956648 - August 2025

State of AI in Business 2025 [pdf] - https://news.ycombinator.com/item?id=44941374 - August 2025

https://web.archive.org/web/20250818145714/https://nanda.med...

> Despite $30–40 billion in enterprise investment into GenAI, this report uncovers a surprising result in that 95% of organizations are getting zero return. The outcomes are so starkly divided across both buyers (enterprises, mid-market, SMBs) and builders (startups, vendors, consultancies) that we call it the GenAI Divide. Just 5% of integrated AI pilots are extracting millions in value, while the vast majority remain stuck with no measurable P&L impact. This divide does not seem to be driven by model quality or regulation, but seems to be determined by approach.

namnnumbr · 6 months ago
This is not new - the quote was "87% of data science projects fail" in 2019.

https://venturebeat.com/ai/why-do-87-of-data-science-project...

namnnumbr commented on Ask HN: Go deep into AI/LLMs or just use them as tools?    · Posted by u/pella_may
mikedelfino · 9 months ago
Thank you for sharing. Do you recommend any courses or books for following that path?
namnnumbr · 9 months ago
For SWEs interested in "AI Engineering" (either getting involved in how models work, or building applications on them), there's a critical paradigm shift in that using "AI" requires more of an experimental mindset than software engineering typically does.

- I strongly recommend Chip Huyen's books ("Designing Machine Learning Systems" and "AI Engineering") and blog (https://huyenchip.com/blog/).

- Andreessen Horowitz' "AI Cannon" is a good reference listicle (https://a16z.com/ai-canon/)

- "12 factor agents" (https://github.com/humanlayer/12-factor-agents)

namnnumbr commented on Show HN: New Agentic AI Framework in CNCF   github.com/dapr/dapr-agen... · Posted by u/yaronsc
yaronsc · a year ago
Benchmarks are WIP. We're thinking about durability, task latency, agent throughput. What else would you like to see?
namnnumbr · a year ago
Pass^k and not Pass@k (see https://www.philschmid.de/agents-pass-at-k-pass-power-k). Would be a great twofer to see the code used to run the benchmarks as examples.

Deleted Comment

namnnumbr commented on Ask HN: What $500-2500 product improved your 2022    · Posted by u/awillen
technics256 · 3 years ago
A bit unrelated, but I've been a runner my whole life and have been injured the past few months from plantar fasciitis.

Curious if you have any recommendations there.

namnnumbr · 3 years ago
I had chronic achilles tendonitis and ended up getting shockwave therapy from my orthopedist, who mentioned it's also frequently used for plantar fasciitis if PT does not help

u/namnnumbr

KarmaCake day14November 16, 2021View Original