jacobaustin123 (u/jacobaustin123)

jacobaustin123 commented on How to Think About GPUs jax-ml.github.io/scaling-... · Posted by u/alphabetting

hackrmn · 5 days ago

I appreciate your response. I made a point of not revising my comment after posting it and finding in a subsequent parable the following, quoting:

> Each SM is broken up into 4 identical quadrants, which NVIDIA calls SM subpartitions, each containing a Tensor Core, 16k 32-bit registers, and a SIMD/SIMT vector arithmetic unit called a Warp Scheduler, whose lanes (ALUs) NVIDIA calls CUDA Cores.

And right after:

> CUDA Cores: each subpartition contains a set of ALUs called CUDA Cores that do SIMD/SIMT vector arithmetic.

So, to your defense and my shame -- you *did* do better than I was able to infer from first glance. And I can take absolutely no issue with a piece elaborating on originally "vague" sentence later on -- we need to read top to bottom, after all.

Much of the difficulty with laying out knowledge in written word is inherent constraints like choosing between deferring detail to "further down" at the expense of giving the "bird's eye view". I mean there is a reason writing is hard, technical writing perhaps more so, in a way. You're doing much better than a lot of other stuff I've had to learn with, so I can only thank you to have done as much as you already have.

To be more constructive still, I agree the border between clarity and utility isn't always clearly drawn. But I think you can think of it as a service to your readers -- go with precision I say -- if you really presuppose the reader should know SIMD, chances are they are able to grok a new definition like "SIMD lane" if you define it _once_ and _well_. You don't need to be "unwieldy" in repetition -- the first time may be hard but you only need to do it once.

I am rambling. I do believe there are worse and better ways to impart knowledge of the kind in writing, but I too obviously don't have the answers, so my criticism was in part inconstructive, just a sheer outcry of mild frustration once I started conflating things from the get go but before I decided to give it a more thorough read.

One last thing though: I always like when a follow-up article starts with a preamble along of "In the previous part of the series..." so new visitors can simultaneously become aware there's prior knowledge that may be assumed, _and_ navigate their way to desired point in the series, all the way to the start perhaps. That frees you from e.g. wanting to annotate abbreviations in every part, if you want to avoid doing that.

jacobaustin123 · 3 days ago

Thank you for taking the time to write this reply. Agree with "in the previous part of this series" comment. I'll try to find a way to highlight this more.

What I'd like to add to this page is some sort of highly clear glossary that defines all the terms at the top (but in some kind of collapsable fashion) so I can define everything with full clarity without disrupting the flow. I'll play with the HTML and see what I can do.

jacobaustin123 commented on How to Think About GPUs jax-ml.github.io/scaling-... · Posted by u/alphabetting

charleshn · 5 days ago

Yes, 450GB/s is the per GPU bandwidth in the nvlink domain. 3.2Tbps is the per-host bandwidth in the scale out IB/Ethernet domain.

jacobaustin123 · 5 days ago

I believe this is correct. For an H100, the 4 NVLink switches each have 64 ports supporting 25GB/s each, and each GPU uses a total of 18 ports. This gives us 450GB/s bandwidth within the node. But once you start trying to leave the node, you're limited by the per-node InfiniBand cabling, which only gives you 400GB/s out of the entire node (50GB / GPU).

jacobaustin123 commented on How to Think About GPUs jax-ml.github.io/scaling-... · Posted by u/alphabetting

hackrmn · 5 days ago

I find the piece, much like a lot of other documentation, "imprecise". Like most such efforts, it likely caters to a group of people expected to benefit from being explained what a GPU is, but it fumbles it terms, e.g. (the first image with burned-in text):

> The "Warp Scheduler" is a SIMD vector unit like the TPU VPU with 32 lanes, called "CUDA Cores"

It's not clear from the above what a "CUDA core" (singular) _is_ -- this is the archetypical "let me explain things to you" error most people make, in good faith usually -- if I don't know the material, and I am out to understand, then you have gotten me to read all of it but without making clear the very objects of your explanation.

And so, for these kind of "compounding errors", people who the piece was likely targeted at, are none the wiser really, while those who already have a good grasp of the concepts attempted explained, like what a CUDA core actually is, already know most of what the piece is trying to explain anyway.

My advice to everyone who starts out with a back of envelope cheatsheet then decides to publish it "for the good of mankind", e.g. on Github: please be surgically precise with your terms -- the terms are your trading cards, then come the verbs etc. I mean this is all writing 101, but it's a rare thing, evidently. Don't mix and match terms, don't conflate them (the reader will do it for you many times over for free if you're sloppy), and be diligent with analogies.

Evidently, the piece may have been written to help those already familiar with TPU terminology -- it mentions "MXU" but there's no telling what that is.

I understand I am asking for a tall order, but the piece is long and all the effort that was put in, could have been complemented with minimal extra hypertext, like annotated abbreviations like "MXU".

I can always ask $AI to do the equivalent for me, which is a tragedy according to some.

jacobaustin123 · 5 days ago

Shamelessly responding as the author. I (mostly) agree with you here.

> please be surgically precise with your terms

There's always a tension between precision in every explanation and the "moral" truth. I can say "a SIMD (Single Instruction Multiple Data) vector unit like the TPU VPU with 32 ALUs (SIMD lanes) which NVIDIA calls CUDA Cores", which starts to get unwieldy and even then leaves terms like vector units undefined. I try to use footnotes liberally, but you have to believe the reader will click on them. Sidenotes are great, but hard to make work in HTML.

For terms like MXU, I was intending this to be a continuation of the previous several chapters which do define the term, but I agree it's maybe not reasonable to assume people will read each chapter.

There are other imprecisions here, like the term "Warp Scheduler" is itself overloaded to mean the scheduler, dispatch unit, and SIMD ALUs, which is kind of wrong but also morally true, since NVIDIA doesn't have a name for the combined unit. :shrug:

I agree with your points and will try to improve this more. It's just a hard set of compromises.