yvdriess (u/yvdriess)

yvdriess commented on Memory optimizations to reduce CPU costs ayende.com/blog/203011-A/... · Posted by u/jbjbjbjb

Panzer04 · 3 days ago

That seems like a huge burden, surely not? How often would a GC typically check for hanging references?

yvdriess · 3 days ago

That's most of the work performed by a marking GC.

How much a GC is of total cpu cost totally depends on the application, the GC implementation and the language. It's famously hard to measure what the memory management overhead is, GC in production is anywhere between 7-82% (Cai ISPASS2022). I measured about 19% geomean overhead in accurate simulation by ignoring instructions involved in GC/MM in python's pyperf benchmarks.

yvdriess commented on Memory optimizations to reduce CPU costs ayende.com/blog/203011-A/... · Posted by u/jbjbjbjb

Panzer04 · 3 days ago

That's not even the point they're really making here, IMO.

The significant decrease they talk about is a side effect of their chosen language having a GC. This means the strings take more work to deal with than expected.

This feels more like this speaks to the fact that the often small costs associated with certain operations do eventually add up. it's not entirely clear in the post where and when the cost from the GC is incurred, though; I'd presume on creation and destruction?

yvdriess · 3 days ago

The cost of a string array is paid on every GC phase. That array may/contains references so the gc has to check each element every time to check if anything changed. An int array cannot contain references so it can be skipped.

edit: There are tricks to not traverse a compound object every time, but assume that at least one of the 80M objects in that giant array gets modified in between GC activations.

yvdriess commented on FFmpeg 8.0 adds Whisper support code.ffmpeg.org/FFmpeg/FF... · Posted by u/rilawa

londons_explore · 16 days ago

Does this have the ability to edit historic words as more info becomes available?

Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".

Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".

Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.

yvdriess · 16 days ago

A good opportunity to point people to the paper with my favorite title of all time:

"How to wreck a nice beach you sing calm incense"

https://dl.acm.org/doi/10.1145/1040830.1040898

yvdriess commented on Cerebras launches Qwen3-235B, achieving 1.5k tokens per second cerebras.ai/press-release... · Posted by u/mihau

aurareturn · a month ago

If this is the full fp16 quant, you'd need 2TB of memory to use with the full 131k context.

With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.

For comparison, you can buy a DGX B200 with 8x B200 Blackwell chips and 1.4TB of memory for around $500k. Two systems would give you 2.8TB memory which is enough for this. So $1m vs $135m to run this model.

It's not very scalable unless you have some ultra high value task that need super fast inference speed. Maybe hedge funds or some sort of financial markets?

PS. The reason why I think we're only in the beginning of the AI boom is because I can't imagine what we can build if we can run models as good as Claude Opus 4 (or even better) at 1500 tokens/s for a very cheap price and tens of millions of context tokens. We're still a few generations of hardware away I'm guessing.

yvdriess · a month ago

> With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.

That on-chip SRAM memory is purely temporary working memory and does need to hold the entire model weights. The Cerebras chip works on a sparse weights representation, streams non-zero off their external memory server and the cores work in a transport-triggered dataflow manner.

yvdriess commented on Algorithms for Modern Processor Architectures lemire.github.io/talks/20... · Posted by u/matt_d

DrNosferatu · a month ago

I’m surprised so much branching isn’t more costly.

yvdriess · a month ago

Branch predictors have gotten really good and it often now makes more sense to rely on it rather than working away the branches.

For example, modern compilers will very rarely introduce conditional moves (cmov) x86 because they are nearly always slower than simply branching. It might be counter intuitive, but a branch prediction breaks the dependencies of the micro-ops between the conditional and the clause. So if your cmov's conditional depends on a load, you need to wait for that load complete before it can execute.

Always benchmark with at-scale data and measure.

yvdriess commented on Algorithms for Modern Processor Architectures lemire.github.io/talks/20... · Posted by u/matt_d

benob · a month ago

Are there efforts to include the neccessary context in compilers to autovectorize?

yvdriess · a month ago

What do you mean with necessary context?

Modern compilers all autovectorize really well. Usually writing plain canonical loops with plain C arrays is a good way to write portable optimal SIMD code. The usual workflow I use is to translate the vector notation (RIP Cilk+ array syntax) in my paper notes to plain C loops. The compiler's optimization report (-qopt-report for icx, gcc has -fopt-info-vec and -fopt-info-vec-missed) gives feedback on what optimizations it considered and why it did not apply them. In more complex scenarios it can be helpful to add `#pragma omp simd` pragmas or similar to overrule the C semantics.

yvdriess commented on Why is AI so slow to spread? economist.com/finance-and... · Posted by u/1vuio0pswjnm7

the_duke · a month ago

I'd disagree with that, if given enough compute LLMs can have impressive capabilities in finding bugs and implementing new logic, if guided right.

It's hit or miss at the moment, but it's definitely way more than "UML code generators".

yvdriess · a month ago

It's not the same, but boy it sure does rhyme.

Typical prose in late 90's early 00s: Designing the right UML meta-schema and UML diagram will generate a bug-free source code of the program, enabling even non-programmers to create applications and business logic. Programs can check the UML diagram beforehand for logic errors, prove security and more.

yvdriess commented on Intel's retreat is unlike anything it's done before in Oregon oregonlive.com/silicon-fo... · Posted by u/cbzbc

vachina · a month ago

Lip Bu Tan is here for some spring cleaning.

yvdriess · a month ago

Can you still call it a spring cleaning when you take out support walls in the process?

yvdriess commented on Intel's Lion Cove P-Core and Gaming Workloads chipsandcheese.com/p/inte... · Posted by u/zdw

fschutze · 2 months ago

Interesting. You realize this by identifying the offending assembly instructions and then see that one operands comes from memory?

yvdriess · 2 months ago

There's no single good way, but yes as you said, logical deduction based on the surrounding instructions and their hardware counters is a way to do it. Instruction B might be collecting a ton of hardware counted cycles, but it could be because instruction A it depends on is slow. Sometimes, those dependencies are even implicit, since x86 is in-order commit some instructions like lock/atomics have implicit and dynamic dependencies based on what is in the reorder buffer at the time.

To give a concrete example I encountered analysing a GC: traversing the object graph in a loop means calculating the address of an object, loading that object, doing some work on it and then grabbing the bits to calculate the children to visit next. This creates a long brittle chain of data-dependent conditionals, depending on a calculation that eventually came from a much earlier load. That conditional branch might be 30/70 taken/untaken, so the branch predictor often does not speculate, reducing the ILP and making it harder to hide the load's latencies. Now, dear Watson, would you say the blame is to the front end? There are no stalls when all the loads go to fast cache, only when there is the occasional remote LLC hit, DRAM hit or god forbid cross-NUMA hit. What if I tell you that there's an atomic operation to mark the object as visited, which is fast in itself but can only be issued when all prior loads have completed and stops from newer instructions to be issued while it hasn't been committed.

You need to look at a whole bunch of surrounding instructions and a variety of hardware counters to start forming a picture. Insert Always Sunny in Philadelphia meme with the red wire crime board here.