You may also want to look at other sorting algorithms - common CPU sorting algorithms are hard to maximize GPU hardware with - a network sort like bitonic sorting involves more work (and you have to pad to a power of 2) but often runs much faster on parallel hardware.
I had a fairly naive implementation that would sort 10M in around 10ms on an H100. I'm sure with more work they can get quite a bit faster, but they need to be fairly big to make up for the kernel launch overhead.
How many people regularly experience power outages (ok, if you're American relying on Canadian electricity, you might have a right to be concerned).
I'm surprised you're not touting the "save on your power bill" benefits. Could this not store power when rates are low, and use the battery when rates are higher, while maintaining a balanced minimum storage amount to ensure power is available should the power go out?
I'd think it could be quite smart about this if you looked at weather patterns and other factors to calculate a likelihood of an outage, and ensured more back-up was available.
From a selling stand-point, isn't saving money every day a better feature than "just in case the electricity goes out"?
At ~$600/kWh for capacity, the ROI isn't great. I have a pretty big differential on my rates because I have an EV, and even then I'd need over a decade to make the $1,000 back assuming I fully discharged it every day.
Their paper [1] only mentions using PTX in a few areas to optimize data transfer operations so they don't blow up the L2 cache. This makes intuitive sense to me, since the main limitation of the H800 vs H100 is reduced nvlink bandwidth, which would necessitate doing stuff like this that may not be a common thing for others who have access to H100s.
How can I leverage that experience into earning the huge amounts of money that AI companies seem to be paying? Most job listings I've looked at require a PhD in specifically AI/math stuff and 15 years of experience (I have a masters in CS, and no where close to 15 years of experience).
I'd think things like optimizing for occupancy/memory throughput, ensuring coalesced memory accesses, tuning block sizes, using fast math alternatives, writing parallel algorithms, working with profiling tools like nsight, and things like that are fairly transferable?
Deleted Comment
https://i.imgur.com/WdMPX8S.jpeg
According to this, Zen4s FP register file is almost as big as its FP execution units. It's a pretty sizable chunk of silicon.
The register file size makes sense, I didn't think they were that much of the die on those processors but I guess they had to be pretty aggressive to meet power goals?
The richest and most "powerful" people still have meat-based assistants do all their shit: Take their notes, check their calendars, make their appointments, toast their bread..
And it shows: This is how you get features like "Edge Light" and an Invites app before fixing basic functionality that the peasants rely upon. Like how we get the weird iOS Journal app even though Notes could have done all that if they had improved it a bit.
Steve Jobs was probably one of the few people in charge who actually used his company's own products. You need someone who's annoyed with the status quo enough to make a company to solve it, not just someone elected by a board.