charleshn (u/charleshn)

charleshn commented on How to Think About GPUs jax-ml.github.io/scaling-... · Posted by u/alphabetting

aschleck · 6 days ago

It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.

Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.

charleshn · 6 days ago

Yes, 450GB/s is the per GPU bandwidth in the nvlink domain. 3.2Tbps is the per-host bandwidth in the scale out IB/Ethernet domain.

Posted by u/charleshn 12 days ago

Demystifying NCCL: An In-Depth Analysis of GPU Communication Protocols and Algos arxiv.org/abs/2507.04786...

charleshn commented on The Surprising gRPC Client Bottleneck in Low-Latency Networks blog.ydb.tech/the-surpris... · Posted by u/eivanov89

lacop · a month ago

Yeah that was my understanding too, hence I filed the bug (actually duplicate of older bug that was closed because poster didn't provide reproduction).

Still not sure if this is linux network configuration issue or grpc issue, but something is for sure broken if I can't send a ~1MB request and get response within roughly network RTT + server processing time.

charleshn · a month ago

Could you check the value of your kernel's net.ipv4.tcp_slow_start_after_idle sysctl, and if it's non zero set it to 0?

charleshn commented on AI capex is so big that it's affecting economic statistics paulkedrosky.com/honey-ai... · Posted by u/throw0101c

charleshn · a month ago

I'm always surprised by the number of people posting here that are dismissive of AI and the obvious unstoppable progress.

Just looking at what happened with chess, go, strategy games, protein folding etc, it's obvious that pretty much any field/problem that can be formalised and cheaply verified - e.g. mathematics, algorithms etc - will be solved, and that it's only a matter of time before we have domain-specific ASI.

I strongly encourage everyone to read about the bitter lesson [0] and verifier's law [1].

[0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[1] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...

charleshn · a month ago

You can now add getting gold at IMO [0] to the above list.

[0] https://x.com/alexwei_/status/1946477742855532918

charleshn commented on AI capex is so big that it's affecting economic statistics paulkedrosky.com/honey-ai... · Posted by u/throw0101c

kadushka · a month ago

I love it how people are transitioning from “LLMs can’t reason” to “LLMs can’t reliably reason”.

charleshn · a month ago

Frontier models went from not being able to count the number of 'r's in "strawberry" to getting gold at IMO in under 2 years [0], and people keep repeating the same clichés such as "LLMs can't reason" or "they're just next token predictors".

At this point, I think it can only be explained by ignorance, bad faith, or fear of becoming irrelevant.

[0] https://x.com/alexwei_/status/1946477742855532918

charleshn commented on AI capex is so big that it's affecting economic statistics paulkedrosky.com/honey-ai... · Posted by u/throw0101c

oytis · a month ago

It's very different from chess etc. If we could formalise and "solve" software engineering precisely, it would be really cool, and probably indeed just lift programming to a new level of abstraction.

I don't mind if software jobs move from writing software to verifying software either if it makes the whole process more efficient and the software becomes better as a result. Again, not what is happening here.

What is happening, at least in AI optimist CEO minds is "disruption". Drop the quality while cutting costs dramatically.

charleshn · a month ago

I mentioned algorithms, not software engineering, precisely for that reason.

But the next step is obviously increased formalism via formal methods, deterministic simulators etc, basically so that one could define an environment for a RL agent.

charleshn commented on AI capex is so big that it's affecting economic statistics paulkedrosky.com/honey-ai... · Posted by u/throw0101c

oytis · a month ago

I just hope when (if) the hype is over, we can repurpose the capacities for something useful (e.g. drug discovery etc.)

charleshn · a month ago

I'm always surprised by the number of people posting here that are dismissive of AI and the obvious unstoppable progress.

I strongly encourage everyone to read about the bitter lesson [0] and verifier's law [1].

[0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[1] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...

charleshn commented on AI capex is so big that it's affecting economic statistics paulkedrosky.com/honey-ai... · Posted by u/throw0101c

mikewarot · a month ago

I'm waiting for the shoe to drop when someone comes out with an FPGA optimized for reconfigurable computing and lowers the cost of llm compute by 90% or better.

charleshn · a month ago

We do already have ASICs, see Google's TPU to get some cost estimates.

HBM is also very expensive.

charleshn commented on Aeron: Efficient reliable UDP unicast, UDP multicast, and IPC message transport github.com/aeron-io/aeron... · Posted by u/todsacerdoti

lll-o-lll · a month ago

> Relative latency savings cross-DC become less interesting the longer the distance, so there's nothing wrong with TCP there.

Long fat pipe sees dramatic throughput drops with tcp and relatively small packet loss. Possibly we were holding it wrong; would love to know if there is some definitive guide to doing it right. Good success with UDT.

charleshn · a month ago

You might want to look into TCP BBR [0], it might help. Easy to try on Linux, simple sysctl.

[0] https://en.m.wikipedia.org/wiki/TCP_congestion_control#TCP_B...

charleshn commented on Caching is an abstraction, not an optimization buttondown.com/jaffray/ar... · Posted by u/samuel246

charleshn · 2 months ago

As can be seen from other comments, people tend to focus on the consistency implications, but something not discussed often in the context of distributed systems is that caches tend to introduce bimodality and metastability [0] [1]. See e.g. DynamoDB for an example of design taking it into account [2].

[0] https://brooker.co.za/blog/2021/08/27/caches.html

[1] https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...

[2] https://brooker.co.za/blog/2022/07/12/dynamodb.html

u/charleshn

KarmaCake day154November 22, 2023View Original