Chamix (u/Chamix) - Readit News

Chamix commented on No, it doesn't cost Anthropic $5k per Claude Code user martinalderson.com/posts/... · Posted by u/jnord

johndough · 2 days ago

Could you point at some more public info about active parameter count? You said:

> and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

I can see ~100B, but that would near the same order of magnitude. I find ~1000B active parameters hard to believe.

Chamix · 2 days ago

Sorry if that was unclear, I did mean 100Bs as in the next order of magnitude. Even GPT-4 had ~220B active params, though the trend has been towards increased sparsification (lower activation:total ratio). GPT 4.5 is the only publicly facing model that approached 1T active parameters (an experiment to see if there was any value in the extreme inference cost of quadratically increasing compute cost with naïve-like attention). Nowadays you optimize your head size to your attention kernel arch and obtain performance principally through inference time scaling (generate more of tokens) and parallel consensus (gpt pro, gemini deep think etc), both of which favor faster, cheaper active heads.

4o and other H100 era models did indeed drop their activated heads far smaller than gpt-4 to the 10s just like current Hopper-Era Chinese open-source, but it went right back up again post-Blackwell with the 10x L2 bump (for kv cache) in congruence with nlogn attention mechanisms being refined. Similar story for Claude.

The fun speculation is wondering about the true size of Gemini 3's internals, given the petabyte+ world size of their homefield IronwoodV7 systems and Jim Keller's public penchant for envisioning extreme MoE-like diversification across hundreds of dedicated sub-models constructed by individual teams within DeepMind.

Chamix commented on No, it doesn't cost Anthropic $5k per Claude Code user martinalderson.com/posts/... · Posted by u/jnord

jychang · 2 days ago

Nobody is running 10s of trillion param models in 2026. That's ridiculous.

Opus is 2T-3T in size at most.

Chamix · 2 days ago

What do you think labs are doing with the minimum 10TB memory in NvLink 72 systems that were publicly reported to all start coming online in November/December of last year? And why would this 1 TB -> 10 TB jump matter so much for Anthropic previously being wholly dependent on running Opus 4x on TPUs, if the models were 2-3T at 4bit and could fit in 8x B200 (1.5 TB = 3T param) widely deployed during the Opus 4 era?

You have presented a vibe-based rebuttal with no evidence or or logic to outline why you think labs are still stuck in the single trillions of parameters (GPT 4 was ~1 trillion params!). Though, you have successfully cunninghammed me into saying that while anything I publicly state is derived from public info, working in the industry itself is a helpful guide to point at the right public info to reference.

Chamix commented on No, it doesn't cost Anthropic $5k per Claude Code user martinalderson.com/posts/... · Posted by u/jnord

aurareturn · 2 days ago

They do not have enough H200 or Blackwell systems to server 1.6 billion people and the world so I doubt it's in any meaningful number.

Chamix · 2 days ago

I assure you, the number of people paying to use Qwen3-Max or other similar proprietary endpoints is far less than 1.6 billion.

Chamix commented on No, it doesn't cost Anthropic $5k per Claude Code user martinalderson.com/posts/... · Posted by u/jnord

aurareturn · 2 days ago

China is targeting H20 because that's all they were officially allowed to buy.

Chamix · 2 days ago

I generally agree, back of the napkin math shows H20 cluster of 8gpu * 96gb = 768gb = 768B parameters on FP8 (no NVFP4 on Hopper), which lines up pretty nicely with the sizes of recent open source Chinese models.

However, I'd say its relatively well assumed in realpolitik land that Chinese labs managed to acquire plenty of H100/200 clusters and even meaningful numbers of B200 systems semi-illicitly before the regulations and anti-smuggling measures really started to crack down.

This does somewhat beg the question of how nicely the closed source variants, of undisclosed parameter counts, fit within the 1.1tb of H200 or 1.5tb of B200 systems.

Chamix commented on No, it doesn't cost Anthropic $5k per Claude Code user martinalderson.com/posts/... · Posted by u/jnord

codemog · 2 days ago

Also curious if any experts can weigh in on this. I would guess in the 1 trillion to 2 trillion range.

Chamix · 2 days ago

Try 10s of trillions. These days everyone is running 4-bit at inference (the flagship feature of Blackwell+), with the big flagship models running on recently installed Nvidia 72gpu rubin clusters (and equivalent-ish world size for those rented Ironwood TPUs Anthropic also uses). Let's see, Vera Rubin racks come standard with 20 TB (Blackwell NVL72 with 10 TB) of unified memory, and NVFP4 fits 2 parameters per btye...

Of course, intense sparsification via MoE (and other techniques ;) ) lets total model size largely decouple from inference speed and cost (within the limit of world size via NVlink/TPU torrus caps)

So the real mystery, as always, is the actual parameter count of the activated head(s). You can do various speed benchmarks and TPS tracking across likely hardware fleets, and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

Comparing Opus 4.6 or GPT 5.4 thinking or Gemini 3.1 pro to any sort Chinese model (on cost) is just totally disingenuous when China does NOT have Vera Rubin NVL72 GPUs or Ironwood V7 TPUs in any meaningful capacity, and is forced to target 8gpu Blackwell systems (and worse!) for deployment.

Chamix commented on Claude’s C Compiler vs. GCC harshanu.space/en/tech/cc... · Posted by u/unchar1

Chamix · a month ago

You, know, it sure does add some additional perspective to the original Anthropic marketing materia... ahem, I mean article, to learn that the CCC compiled runtime for SQLite could potentially run up to 158,000 times slower than a GCC compiled one...

Nevertheless, the victories continue to be closer to home.

Chamix commented on GPT-4.5 openai.com/index/introduc... · Posted by u/meetpateltech

SamvitJ · a year ago

One comparison I found interesting... I think GPT-4o has a more balanced answer!

> What are your thoughts on space exploration?

GPT-4.5:

Space exploration isn't just valuable—it's essential.

People often frame it as a luxury we pursue after solving Earth-bound problems. But space exploration actually helps us address those very challenges: climate change (via satellite monitoring), resource scarcity (through asteroid mining), and existential risk (by enabling eventual multi-planetary living). It also drives technological innovation, economic growth, and inspires humanity toward collective achievement.

We don't explore space to escape Earth. We explore space to better understand, protect, and sustain life on Earth.

> What are your thoughts on space exploration?

GPT-4o:

Space exploration is one of the most ambitious and profound endeavors humanity has undertaken. It expands our understanding of the universe, drives technological innovation, and inspires generations to think beyond the boundaries of Earth.

From a scientific perspective, space exploration helps us address fundamental questions: Are we alone in the universe? How did the cosmos begin? Can we colonize other planets? Missions like the James Webb Space Telescope, Mars rovers, and deep-space probes continuously reshape our understanding of physics, planetary science, and even biology.

From a practical standpoint, space exploration fuels advancements in materials science, AI, robotics, and medicine. Many everyday technologies—GPS, satellite communications, weather forecasting, and even medical imaging—stem from space research.

There's also the existential argument: humanity's long-term survival may depend on our ability to become a multi-planetary species. While colonizing Mars or mining asteroids is far from easy, these efforts could help us safeguard civilization against existential threats like climate change or asteroid impacts.

That said, space exploration also raises ethical and economic questions. Should we focus more on Earth’s problems before investing trillions in space? How do we prevent the same exploitative behaviors that shaped colonial history from repeating on other planets?

Overall, I see space exploration as a necessary pursuit—not at the cost of solving problems on Earth, but as a way to advance our knowledge, drive innovation, and secure a future beyond our home planet. What’s your take?

Chamix · a year ago

Indeed, and the difference could in essence be achieved yourself with a different system prompt on 4o. What exactly is 4.5 contributing here in terms of a more nuanced intelligence?

The new RLHF direction (heavily amplified through scaling synthetic training tokens) seems to clobber any minor gains the improved base internet prediction gains might've added.

Chamix commented on GPT-4.5 openai.com/index/introduc... · Posted by u/meetpateltech

bguberfain · a year ago

Until GPT-4.5, GPT-4 32K was certainly the most heavy model available at OpenAI. I can imagine the dilemma between to keep it running or stop it to free GPU for training new models. This time, OpenAI was clear whether to continue serving it in the API long-term.

Chamix · a year ago

It's interesting to compare the cost of that original gpt-4 32k(0314) vs gpt-4.5:

$60/M input tokens vs $75/M input tokens

$120/M output tokens vs $150/M output tokens

Chamix commented on Sohu – first specialized chip (ASIC) for transformer models twitter.com/Etched/status... · Posted by u/rkwasny

gwern · 2 years ago

Some background: https://www.lesswrong.com/posts/cB2Rtnp7DBTpDy3ii/memory-ban...

Chamix · 2 years ago

Forgive me if I'm missing your existing realization (I did a quick check of your HN, reddit, twitter, LW), but I think the big deal with Sohu (wrt Etched) is that they have pivoted from the "all model parameters hard etched onto the chip" to "only transformer(matmul etc) ops etched onto the chip".

Soho does not have the LLaMA 70b weights directly lithographed onto the silicon, as you seem? to be implying with attachment to that 6month old post.

Seems like a sensible pivot; I'd imagine they're rather up to date on the pulse of dynamically updated nets potentially being a major feature in upcoming frontier models, as you've recently been commentating on. However, I'm not deep enough in it to be sure how much this removes their differentiation vs other AI accelerator startups.