Readit News logoReadit News
CaliforniaKarl · a year ago
More context: This is related to today's release of the Spring Top 500 list (https://news.ycombinator.com/item?id=40346788). Aurora rated 1,012.00 PetaFLOPS/second Rmax, and is in 2nd place, behind Frontier.

In the November 2023 list, Aurora was also in second place, with an Rmax of 585.34 PetaFLOPS/second.

See https://www.top500.org/system/180183/ for the specs on Aurora, and https://www.top500.org/system/180047/ for the specs on Frontier.

See https://www.top500.org/project/top500_description/ and https://www.top500.org/project/linpack/ for a description of Rmax and the LINPACK benchmark, by which supercomputers are generally ranked. The Top 500 list only includes supercomputers that are able to run the LINPACK benchmark, and where the owner is willing to publish the results.

The jump in Aurora's Rmax scope is explained by Aurora's difficult birth. https://morethanmoore.substack.com/p/5-years-late-only-2 (published when the November 2023 list came out) has a good explanation of what's been going on.

imrehg · a year ago
Looking at the two specs, interesting to see how Frontier (the first, running AMD CPUs) has much better power efficiency than Aurora (the second, running Intel), 18.89 kW/PFLOPS vs 38.24 kW/PFLOPS respectively... Good advertisement for AMD? :)
nolok · a year ago
These days this is true from top to bottom, desktop, servers, ... Even in gaming, the 7800X3D is cheaper than the 14700K, it is also more performant and yet uses roughly 20% less power at idle and the gap only grows at full charge.

AMD's current architecture is very power responsible, and Intel has more or less used watt overfeeding to catch back in performance.

pyrale · a year ago
Also the delta between theoretical performance and benchmarked performance is much smaller for Frontier (AMD) than for Aurora (Intel).

That being said, note that the software is also different on the two computers.

p1esk · a year ago
Note all mentions of FLOPS in this thread refer to FP64 (double precision), unlike more popular “AI OPS”, which are typically INT8, specified for modern GPUs.
kkielhofner · a year ago
> which are typically INT8

These systems are used for training which is VERY rarely INT8. On Frontier, for example, it's recommended to use bfloat16 or float32 if that doesn't work for you/your application.

Nvidia has FP8 with >=Hopper and supposedly AMD MI300 has it as well although I have no experience with the MI300 so I can't speak to that.

klysm · a year ago
What does FLOPS/second mean? Isn’t FLOPS already per second? Are they accelerating?
falcor84 · a year ago
I'd actually be interested in an estimate of the world's overall flop/s^2. Could someone please run a back of the envelope calculation for me, e.g. looking at least year's data?
dragonwriter · a year ago
Yeah, the top500 pages cited use Flop/s (apparently using Flop for “Floating point operations” – not sure which “o” and “p” are used), I’ve could swear I’ve seen FLOPS and seen it expanded specifically as “FLoating point Operations Per Second” when I first encountered it, FLOPS/s seems to be using “FLOPS” like the “Flop” above (probably as “FLoating point OPerationS”, in which case the “/s” makes sense.)
ngneer · a year ago
Made me chuckle. F=ma, where a is the derivative of FLOPS with respect to time.
kortilla · a year ago
Some people treat FLOPS as “FLoating point OPerationS”.

Deleted Comment

verandaguy · a year ago
I don't know anything about supercomputer architecture; are lifetime upgrades that double the performance typical, let alone YoY?

What do those kinds of upgrades entail from a hardware side? Software side? Is this just a horizontal scaling of a cluster?

CaliforniaKarl · a year ago
This isn’t really an upgrade, it’s the system still being commissioned.

See the last paragraph of my post for a link to more info.

Harmohit · a year ago
Serious question: my understanding of HPC is that there are many workloads running on a given supercomputer at any time. There is no singular workload that takes up the entire or most of the resources of a supercomputer.

Is my understanding correct? If yes, then why is it important to build supercomputers with more and more compute? Wouldn't it be better to build smaller systems that focus more on power/cost/space efficiency?

dekhn · a year ago
There's many variables that go into supercomputers, of which "company/country propaganda" is just one of them.

Supercomputer admins would love to have a single code that used the whole machine, both the compute elements and the network elements, at close to 100%. In fact they spend a significant fraction on network elements to unblock the compute elements, but few codes are really so light on networking that the program scales to the full core count of the machine. So, instead they usually have several codes which can scale up to a significant fraction of the machine and then backfill with smaller jobs to keep the utilization up (because the acquisition cost and the running cost are so high).

Supercomputers have limited utility- beyond country bragging rights, only a few problems really justify spending this kind of resource. I intentionally switched my own research in molecular dynamics away from supercomputers (where I'd run one job on 64-128 processors for a 96X speedup) to closet cllusters, where I'd run 128 indpendent jobs for a 128X speedup, but then have to do a bunch of post-processing to make the results comparable to the long, large runs on the supercomputer (https://research.google/pubs/cloud-based-simulations-on-goog...). I actually was really relieved when my work no longer depended on expensive resources with little support, as my scientific productivity went up and my costs went way down.

I feel that supercomputers are good at one thing: if you need to make your country's flagship submarine about 10% faster/quieter than the competition.

alephnerd · a year ago
Ever used GPT-3, DALL-E, or other LLMs?

The GPUs used to train them only existed because the DoE explicitly worked with Nvidia on a decade-long roadmap for delivery in it's various supercomputers, and would often work in tandem with private sector players to coordinate purchases and R&D (for example, protein folding and just about every Big Pharma company).

Hell, the only reason AMD EPYC exists is for the same reason.

CaliforniaKarl · a year ago
> I feel that supercomputers are good at one thing: if you need to make your country's flagship submarine about 10% faster/quieter than the competition.

Or doing Numerical Weather Prediction. :-)

But seriously, as a cluster sysadmin, the “128 jobs, followed by post-processing” is great for me, because it lets those separate jobs be scheduled as soon as resources are available.

> expensive resources with little support

Unfortunately, there isn’t as much funding available in places for user training and consultation. Good writing & education is a skill, and folks aren’t always interested in a job that has is term-limited, or whose future is otherwise unclear.

trueismywork · a year ago
Not all jobs can be parallelized in that manner without communication.
Harmohit · a year ago
Thanks for the great reply and for linking the research paper - I am going to check it out. Could you suggest a good review paper for someone new to get into modern HPC?
bitfilped · a year ago
Most of the time yes, HPC systems are shared among many users. Sometimes though the whole system (or near it) will be used in a single run. These are sometimes referred to as "hero runs" and while they're more common for benchmarking and burn-ins there are some tightly-coupled workloads that perform well in that style of execution. It really depends on a number of factors like the workloads being run, the number of users, and what the primary business purpose of the HPC resource is. Sites that have to run both types of jobs will typically allow any user to schedule jobs most of the time but then pre-reserve blocks of time for hero runs to take place where other user jobs are held until the primary scheduled run is over.
Harmohit · a year ago
Thanks for the reply! Can you give some examples of these "hero runs"?
kkielhofner · a year ago
I have a project on Frontier. Generally these systems (including Frontier) use slurm[0] for scheduling and workload management.

The OLCF Frontier user guide[1] has some information on scheduling and Frontier specific quirks (very minor).

Current status of jobs on Frontier:

[kkielhofner@login11.frontier ~]$ squeue -h -t running -r | wc -l

137

[kkielhofner@login11.frontier ~]$ squeue -h -t pending -r | wc -l

1016

The running jobs are relatively low because there are some massive jobs using a significant number of nodes ATM.

[0] - https://slurm.schedmd.com/documentation.html

[1] - https://docs.olcf.ornl.gov/systems/frontier_user_guide.html

EDIT: I give up on HN code formatting

SushiHippie · a year ago
> EDIT: I give up on HN code formatting

Just FYI: https://news.ycombinator.com/formatdoc

> Text after a blank line that is indented by two or more spaces is reproduced verbatim. (This is intended for code.)

  [kkielhofner@login11.frontier ~]$ squeue -h -t running -r | wc -l

  137

  [kkielhofner@login11.frontier ~]$ squeue -h -t pending -r | wc -l

  1016

Xcelerate · a year ago
> There is no singular workload that takes up the entire or most of the resources of a supercomputer.

I performed molecular dynamics simulations on the Titan supercomputer at ORNL during grad school. At the time, this supercomputer was the fastest in the world.

At least back then around 2012, ORNL really wanted projects that uniquely showcased the power of the machine. Many proposals for compute time were turned down for workloads that were “embarrassingly parallel” because these computations could be split up across multiple traditional compute clusters. However, research that involved MD simulations or lattice QCD required the fast Infiniband interconnects and the large amount of memory that Titan had, so these efforts were more likely to be approved.

The lab did in fact want projects that utilized the whole machine at once to take maximum advantage of its capabilities. It’s just that oftentimes this wasn’t possible, and smaller jobs would be slotted into the “gaps” between the bigger ones.

alephnerd · a year ago
> my understanding correct

Yes

> why is it important to build supercomputers with more and more compute

A mix of

- research in distributed systems (there are plenty of open questions in Concurrency, Parallelization, Computer Architecture, etc)

- a way to maintain an ecosystem of large vendors (Intel, AMD, Nvidia and plenty of smaller vendors all get a piece of the pie to subsidize R&D)

- some problems are EXTREMELY computationally and financially expensive, so they require large On-Prem compute capabilities (eg. Protein folding, machine learning when I was in undergrad [DGX-100s were subsidized by Aurora], etc)

- some problems are extremely sensitive for national security reasons and it's best to keep all personnel in a single region (eg. Nuclear simulations, turbine simulations, some niche ML work, etc)

In reality you need to do both, and planners know this fact, and have known this fact for decades

sseagull · a year ago
You are generally correct, however there are workloads that do use larger portions of a supercomputer that wouldn't be feasible on smaller systems.

Also, I guess I'm not sure what you mean by "smaller systems that focus more on power/cost/space". A proper queueing system generally efficiently allocates the resources of a large supercomputer to smaller tasks, while also making larger tasks possible in the first place. And I imagine there's somewhat an efficiency of scale in a large installation like this.

There are, of course, many many smaller supercomputers, such as at most medium to large universities. But even those often have 10-50k cores or so.

(In general, efficiency is a consideration when building/running, but not of using. Scientists want the most computational power they can get, power usage be damned :) )

edit: A related topic is capacity vs. capability: https://en.wikipedia.org/wiki/Supercomputer#Capability_versu...

ThinkBeat · a year ago
From my experience they are running whole cluster dedicated jobs quite frequently. Climate models can use whatever resources they get, nuclear weapons modelling esp for old warheads can use a lot.
MaxikCZ · a year ago
What is being calculated with nuclear weapons? I understand it must have been computationally expensive to get them working, but once completed, what is there left to calculate?
trueismywork · a year ago
Bigger systems when utilized at 100% are more efficient than multiple smaller systems when utilized at 100%, in terms of engineering work, software, etc.

But also, bigger systems have more opportunities to achieve higher utilization than smaller systems due to the dynamics of bin packing problem.

tkuraku · a year ago
While in general there can be many smaller workloads running in parallel. However, periodically the whole supercomputer can be reserved for a "hero" run.

Deleted Comment

prpl · a year ago
there’s also Bell Prize submissions, which is the only time some machines get completely reserved
modeless · a year ago
Off topic but "Break the ___ barrier" has got to be my least favorite expression that PR people love. It's a "tell" that an article was written as meaningless pop science fluff instead of anything serious. The sound barrier is a real physical phenomenon. This is not. There's no barrier! Nothing was broken!
alephnerd · a year ago
The Exascale barrier is an actual barrier in HPC/Distributed Systems.

It took 15-20 years to reach this point [0]

A lot of innovations in the GPU and distributed ML space were subsidized by this research.

Concurrent and Parallel Computing are VERY hard problems.

[0] - http://helper.ipam.ucla.edu/publications/nmetut/nmetut_19423...

dghlsakjg · a year ago
The comment isn't saying that the benchmark isn't useful. They are saying that there is no 'barrier' to be broken.

The sound barrier was relevant because there was a significant physical effects to overcome specifically when going trans-sonic. It wasn't a question of just adding more powerful engines to existing aircraft. They didn't choose the sound barrier because its a nice number, it was a big deal because all sorts of things behaved outside of their understanding of aerodynamics at that point. People died in the pursuit of understanding the sound barrier.

The 'exascale barrier', afaict, is just another number chosen specifically because it is a round(ish) number. It didn't turn computer scientists into smoking holes in the desert when it went wrong. This is an incremental improvement in an incredible field, but not a world changing watershed moment.

theandrewbailey · a year ago
100% agreed. I might put my thoughts about it into an essay and name it "Breaking barriers considered harmful".
jp57 · a year ago
Was there actually a barrier at exascale? I mean, was this like the sound barrier in flight where there is some discontinuity that requires some qualitatively different approaches? Or is the headline just a fancy way of saying, "look how big/fast it is!"
magicalhippo · a year ago
One thing is having a bunch of computers. Another thing is to get the working efficiently on the same problem.

While I'm not sure exascale was something like the sound barrier, I do know a lot of hard work has been done to be able to efficiently utilize such large clusters.

Especially the interconnects and network topology can make a huge difference in efficiency[1] and Cray's Slingshot interconnect[2], used in Aurora[3], is an important part of that[4].

[1]: https://www.hpcwire.com/2019/07/15/super-connecting-the-supe...

[2]: https://www.nextplatform.com/2022/01/31/crays-slingshot-inte...

[3]: https://www.alcf.anl.gov/aurora

[4]: https://arxiv.org/abs/2008.08886

hi-v-rocknroll · a year ago
Diagonalization of working set and scaling-up and -out coordination. Some programs (algorithms) just have >= O(n) time and space, temporal-dependent "map" or "reduce" steps that require enormous amounts of "shared" storage and/or "IPC".

Dead Comment

tame3902 · a year ago
It isn't comparable to the sound barrier, but it was still a challenge.

It took significantly longer than it should have if it was just business as usual: "At a supercomputing conference in 2009, Computerworld projected exascale implementation by 2018." [1]

We got the first true exascale system with Frontier in 2022.

Part of the problem was the power consumption and having a purely CPU based system online for an exascale job. From slide 12 from[2]: "Aggressive design is at 70 MW" and "HW failure every 35 minutes".

[1]: https://en.wikipedia.org/wiki/Exascale_computing [2]: https://wgropp.cs.illinois.edu/bib/talks/tdata/2011/aramco-k...

colechristensen · a year ago
It's a milestone, not a barrier.
chungy · a year ago
Barrier sounds cooler in a press release than "we made it fast enough to surpass an arbitrary benchmark"
saltcured · a year ago
Yeah. Back in the 90s, "terascale" was the same kind of milestone buzzword that was being thrown around all the time.

Because of that, I felt a bit of nostalgia when I first saw consumer-accessible GPUs hitting the 1 TFLOP performance level, which now I suppose qualifies as a cheap iGPU.

AzzyHN · a year ago
It's like a speedrunning barrier
JonChesterfield · a year ago
On target to be somewhat slower than Frontier at double the power consumption.
dralley · a year ago
And very, very late
edward28 · a year ago
Just use more efficiency cores that aren't efficient.
gerdesj · a year ago
"and is the fastest AI system in the world dedicated to AI for open science"

Cool. Please ask it to sue for peace in several parts of the world, in an open way. Whilst it is at it, get it to work out how to be realistically carbon neutral.

I'm all in favour of willy waving when you have something to wave but in the end this beast will not favour humanity as a whole. It will suck up useful resources and spit out some sort of profit somewhere for someone else to enjoy.

cmdrk · a year ago
They’re simply latching onto the AI buzzwords for the good press. Leadership class HPCs have been designed around GPUs for over a decade now, it just so happens they can use those GPUs to run the AI models in addition to the QCD or Black hole simulations etc that they’ve been doing for ages.
kkielhofner · a year ago
It's much more than that.

For example, I have an "AI" project on Frontier. The process was remarkably simple and easy - a couple of Google Meets, a two page screed on what we're doing, a couple of forms, key fob showed up. Entire process took about a month and a good chunk of that was them waiting on me.

Probably half a days work total for 20k node hours (four MI250x per node) on Frontier for free, which is an incredible amount of compute my early resource constrained startup would have never been able to even fathom on some cloud, etc. It was like pulling teeth to even get a single H100 x8 on GCP for what would cost at least $250k for what we're doing. And that's with "knowing people" there...

These press releases talking about AI are intended to encourage these kinds of applications and partnerships. It's remarkable to me how many orgs, startups, etc don't realize these systems exist (or even consider them) and go out and spend money or burn "credits" that could be applied to more suitable things that make more sense.

They're saying "Hey, just so you know these things can do AI too. Come talk to us."

As an added bonus you get to say you're working with a national lab on the #1 TOP500 supercomputer in the world. That has remarkable marketing, PR, and clout value well beyond "yeah we spent X$ on $BIGCLOUD just like everyone else".

alephnerd · a year ago
Nvidia's entire DGX and Maxwell product line was subsidized by Aurora's precursor, and Nvidia worked very closely with Argonne to solve a number of problems in GPU concurrency.

A lot of the foundational models used today were trained on Aurora and its predecessors, as well as tangential research such as containerizarion (eg. In the early 2010s, a joint research project between ANL's Computing team, one of the world's largest Pharma companies, and Nvidia became one of the largest customers of Docker and sponsored a lot of it's development)

melling · a year ago
We are in an age of incessant whining

Deleted Comment

doctor_eval · a year ago
…and, apparently, unintended irony.
mperham · a year ago
Every single paragraph contains the word "AI".
gary_0 · a year ago
According to Wikipedia[0] it uses 38.7MW of power, beating Fugaku (29.9MW) to be #1 in the TOP500 for power consumption.

[0] https://en.wikipedia.org/wiki/Aurora_(supercomputer)