More context: This is related to today's release of the Spring Top 500 list (https://news.ycombinator.com/item?id=40346788). Aurora rated 1,012.00 PetaFLOPS/second Rmax, and is in 2nd place, behind Frontier.
In the November 2023 list, Aurora was also in second place, with an Rmax of 585.34 PetaFLOPS/second.
The jump in Aurora's Rmax scope is explained by Aurora's difficult birth. https://morethanmoore.substack.com/p/5-years-late-only-2 (published when the November 2023 list came out) has a good explanation of what's been going on.
Looking at the two specs, interesting to see how Frontier (the first, running AMD CPUs) has much better power efficiency than Aurora (the second, running Intel), 18.89 kW/PFLOPS vs 38.24 kW/PFLOPS respectively... Good advertisement for AMD? :)
These days this is true from top to bottom, desktop, servers, ... Even in gaming, the 7800X3D is cheaper than the 14700K, it is also more performant and yet uses roughly 20% less power at idle and the gap only grows at full charge.
AMD's current architecture is very power responsible, and Intel has more or less used watt overfeeding to catch back in performance.
Note all mentions of FLOPS in this thread refer to FP64 (double precision), unlike more popular “AI OPS”, which are typically INT8, specified for modern GPUs.
These systems are used for training which is VERY rarely INT8. On Frontier, for example, it's recommended to use bfloat16 or float32 if that doesn't work for you/your application.
Nvidia has FP8 with >=Hopper and supposedly AMD MI300 has it as well although I have no experience with the MI300 so I can't speak to that.
I'd actually be interested in an estimate of the world's overall flop/s^2. Could someone please run a back of the envelope calculation for me, e.g. looking at least year's data?
Yeah, the top500 pages cited use Flop/s (apparently using Flop for “Floating point operations” – not sure which “o” and “p” are used), I’ve could swear I’ve seen FLOPS and seen it expanded specifically as “FLoating point Operations Per Second” when I first encountered it, FLOPS/s seems to be using “FLOPS” like the “Flop” above (probably as “FLoating point OPerationS”, in which case the “/s” makes sense.)
Serious question: my understanding of HPC is that there are many workloads running on a given supercomputer at any time. There is no singular workload that takes up the entire or most of the resources of a supercomputer.
Is my understanding correct? If yes, then why is it important to build supercomputers with more and more compute? Wouldn't it be better to build smaller systems that focus more on power/cost/space efficiency?
There's many variables that go into supercomputers, of which "company/country propaganda" is just one of them.
Supercomputer admins would love to have a single code that used the whole machine, both the compute elements and the network elements, at close to 100%. In fact they spend a significant fraction on network elements to unblock the compute elements, but few codes are really so light on networking that the program scales to the full core count of the machine. So, instead they usually have several codes which can scale up to a significant fraction of the machine and then backfill with smaller jobs to keep the utilization up (because the acquisition cost and the running cost are so high).
Supercomputers have limited utility- beyond country bragging rights, only a few problems really justify spending this kind of resource. I intentionally switched my own research in molecular dynamics away from supercomputers (where I'd run one job on 64-128 processors for a 96X speedup) to closet cllusters, where I'd run 128 indpendent jobs for a 128X speedup, but then have to do a bunch of post-processing to make the results comparable to the long, large runs on the supercomputer (https://research.google/pubs/cloud-based-simulations-on-goog...). I actually was really relieved when my work no longer depended on expensive resources with little support, as my scientific productivity went up and my costs went way down.
I feel that supercomputers are good at one thing: if you need to make your country's flagship submarine about 10% faster/quieter than the competition.
The GPUs used to train them only existed because the DoE explicitly worked with Nvidia on a decade-long roadmap for delivery in it's various supercomputers, and would often work in tandem with private sector players to coordinate purchases and R&D (for example, protein folding and just about every Big Pharma company).
Hell, the only reason AMD EPYC exists is for the same reason.
> I feel that supercomputers are good at one thing: if you need to make your country's flagship submarine about 10% faster/quieter than the competition.
Or doing Numerical Weather Prediction. :-)
But seriously, as a cluster sysadmin, the “128 jobs, followed by post-processing” is great for me, because it lets those separate jobs be scheduled as soon as resources are available.
> expensive resources with little support
Unfortunately, there isn’t as much funding available in places for user training and consultation. Good writing & education is a skill, and folks aren’t always interested in a job that has is term-limited, or whose future is otherwise unclear.
Thanks for the great reply and for linking the research paper - I am going to check it out. Could you suggest a good review paper for someone new to get into modern HPC?
Most of the time yes, HPC systems are shared among many users. Sometimes though the whole system (or near it) will be used in a single run. These are sometimes referred to as "hero runs" and while they're more common for benchmarking and burn-ins there are some tightly-coupled workloads that perform well in that style of execution. It really depends on a number of factors like the workloads being run, the number of users, and what the primary business purpose of the HPC resource is. Sites that have to run both types of jobs will typically allow any user to schedule jobs most of the time but then pre-reserve blocks of time for hero runs to take place where other user jobs are held until the primary scheduled run is over.
> There is no singular workload that takes up the entire or most of the resources of a supercomputer.
I performed molecular dynamics simulations on the Titan supercomputer at ORNL during grad school. At the time, this supercomputer was the fastest in the world.
At least back then around 2012, ORNL really wanted projects that uniquely showcased the power of the machine. Many proposals for compute time were turned down for workloads that were “embarrassingly parallel” because these computations could be split up across multiple traditional compute clusters. However, research that involved MD simulations or lattice QCD required the fast Infiniband interconnects and the large amount of memory that Titan had, so these efforts were more likely to be approved.
The lab did in fact want projects that utilized the whole machine at once to take maximum advantage of its capabilities. It’s just that oftentimes this wasn’t possible, and smaller jobs would be slotted into the “gaps” between the bigger ones.
> why is it important to build supercomputers with more and more compute
A mix of
- research in distributed systems (there are plenty of open questions in Concurrency, Parallelization, Computer Architecture, etc)
- a way to maintain an ecosystem of large vendors (Intel, AMD, Nvidia and plenty of smaller vendors all get a piece of the pie to subsidize R&D)
- some problems are EXTREMELY computationally and financially expensive, so they require large On-Prem compute capabilities (eg. Protein folding, machine learning when I was in undergrad [DGX-100s were subsidized by Aurora], etc)
- some problems are extremely sensitive for national security reasons and it's best to keep all personnel in a single region (eg. Nuclear simulations, turbine simulations, some niche ML work, etc)
In reality you need to do both, and planners know this fact, and have known this fact for decades
You are generally correct, however there are workloads that do use larger portions of a supercomputer that wouldn't be feasible on smaller systems.
Also, I guess I'm not sure what you mean by "smaller systems that focus more on power/cost/space". A proper queueing system generally efficiently allocates the resources of a large supercomputer to smaller tasks, while also making larger tasks possible in the first place. And I imagine there's somewhat an efficiency of scale in a large installation like this.
There are, of course, many many smaller supercomputers, such as at most medium to large universities. But even those often have 10-50k cores or so.
(In general, efficiency is a consideration when building/running, but not of using. Scientists want the most computational power they can get, power usage be damned :) )
From my experience they are running whole cluster dedicated jobs quite frequently.
Climate models can use whatever resources they get,
nuclear weapons modelling esp for old warheads can use a lot.
What is being calculated with nuclear weapons? I understand it must have been computationally expensive to get them working, but once completed, what is there left to calculate?
Bigger systems when utilized at 100% are more efficient than multiple smaller systems when utilized at 100%, in terms of engineering work, software, etc.
But also, bigger systems have more opportunities to achieve higher utilization than smaller systems due to the dynamics of bin packing problem.
While in general there can be many smaller workloads running in parallel. However, periodically the whole supercomputer can be reserved for a "hero" run.
Off topic but "Break the ___ barrier" has got to be my least favorite expression that PR people love. It's a "tell" that an article was written as meaningless pop science fluff instead of anything serious. The sound barrier is a real physical phenomenon. This is not. There's no barrier! Nothing was broken!
The comment isn't saying that the benchmark isn't useful. They are saying that there is no 'barrier' to be broken.
The sound barrier was relevant because there was a significant physical effects to overcome specifically when going trans-sonic. It wasn't a question of just adding more powerful engines to existing aircraft. They didn't choose the sound barrier because its a nice number, it was a big deal because all sorts of things behaved outside of their understanding of aerodynamics at that point. People died in the pursuit of understanding the sound barrier.
The 'exascale barrier', afaict, is just another number chosen specifically because it is a round(ish) number. It didn't turn computer scientists into smoking holes in the desert when it went wrong. This is an incremental improvement in an incredible field, but not a world changing watershed moment.
Was there actually a barrier at exascale? I mean, was this like the sound barrier in flight where there is some discontinuity that requires some qualitatively different approaches? Or is the headline just a fancy way of saying, "look how big/fast it is!"
One thing is having a bunch of computers. Another thing is to get the working efficiently on the same problem.
While I'm not sure exascale was something like the sound barrier, I do know a lot of hard work has been done to be able to efficiently utilize such large clusters.
Especially the interconnects and network topology can make a huge difference in efficiency[1] and Cray's Slingshot interconnect[2], used in Aurora[3], is an important part of that[4].
Diagonalization of working set and scaling-up and -out coordination. Some programs (algorithms) just have >= O(n) time and space, temporal-dependent "map" or "reduce" steps that require enormous amounts of "shared" storage and/or "IPC".
It isn't comparable to the sound barrier, but it was still a challenge.
It took significantly longer than it should have if it was just business as usual: "At a supercomputing conference in 2009, Computerworld projected exascale implementation by 2018." [1]
We got the first true exascale system with Frontier in 2022.
Part of the problem was the power consumption and having a purely CPU based system online for an exascale job. From slide 12 from[2]: "Aggressive design is at 70 MW" and "HW failure every 35 minutes".
Yeah. Back in the 90s, "terascale" was the same kind of milestone buzzword that was being thrown around all the time.
Because of that, I felt a bit of nostalgia when I first saw consumer-accessible GPUs hitting the 1 TFLOP performance level, which now I suppose qualifies as a cheap iGPU.
"and is the fastest AI system in the world dedicated to AI for open science"
Cool. Please ask it to sue for peace in several parts of the world, in an open way. Whilst it is at it, get it to work out how to be realistically carbon neutral.
I'm all in favour of willy waving when you have something to wave but in the end this beast will not favour humanity as a whole. It will suck up useful resources and spit out
some sort of profit somewhere for someone else to enjoy.
They’re simply latching onto the AI buzzwords for the good press. Leadership class HPCs have been designed around GPUs for over a decade now, it just so happens they can use those GPUs to run the AI models in addition to the QCD or Black hole simulations etc that they’ve been doing for ages.
For example, I have an "AI" project on Frontier. The process was remarkably simple and easy - a couple of Google Meets, a two page screed on what we're doing, a couple of forms, key fob showed up. Entire process took about a month and a good chunk of that was them waiting on me.
Probably half a days work total for 20k node hours (four MI250x per node) on Frontier for free, which is an incredible amount of compute my early resource constrained startup would have never been able to even fathom on some cloud, etc. It was like pulling teeth to even get a single H100 x8 on GCP for what would cost at least $250k for what we're doing. And that's with "knowing people" there...
These press releases talking about AI are intended to encourage these kinds of applications and partnerships. It's remarkable to me how many orgs, startups, etc don't realize these systems exist (or even consider them) and go out and spend money or burn "credits" that could be applied to more suitable things that make more sense.
They're saying "Hey, just so you know these things can do AI too. Come talk to us."
As an added bonus you get to say you're working with a national lab on the #1 TOP500 supercomputer in the world. That has remarkable marketing, PR, and clout value well beyond "yeah we spent X$ on $BIGCLOUD just like everyone else".
Nvidia's entire DGX and Maxwell product line was subsidized by Aurora's precursor, and Nvidia worked very closely with Argonne to solve a number of problems in GPU concurrency.
A lot of the foundational models used today were trained on Aurora and its predecessors, as well as tangential research such as containerizarion (eg. In the early 2010s, a joint research project between ANL's Computing team, one of the world's largest Pharma companies, and Nvidia became one of the largest customers of Docker and sponsored a lot of it's development)
In the November 2023 list, Aurora was also in second place, with an Rmax of 585.34 PetaFLOPS/second.
See https://www.top500.org/system/180183/ for the specs on Aurora, and https://www.top500.org/system/180047/ for the specs on Frontier.
See https://www.top500.org/project/top500_description/ and https://www.top500.org/project/linpack/ for a description of Rmax and the LINPACK benchmark, by which supercomputers are generally ranked. The Top 500 list only includes supercomputers that are able to run the LINPACK benchmark, and where the owner is willing to publish the results.
The jump in Aurora's Rmax scope is explained by Aurora's difficult birth. https://morethanmoore.substack.com/p/5-years-late-only-2 (published when the November 2023 list came out) has a good explanation of what's been going on.
AMD's current architecture is very power responsible, and Intel has more or less used watt overfeeding to catch back in performance.
That being said, note that the software is also different on the two computers.
These systems are used for training which is VERY rarely INT8. On Frontier, for example, it's recommended to use bfloat16 or float32 if that doesn't work for you/your application.
Nvidia has FP8 with >=Hopper and supposedly AMD MI300 has it as well although I have no experience with the MI300 so I can't speak to that.
Deleted Comment
What do those kinds of upgrades entail from a hardware side? Software side? Is this just a horizontal scaling of a cluster?
See the last paragraph of my post for a link to more info.
Is my understanding correct? If yes, then why is it important to build supercomputers with more and more compute? Wouldn't it be better to build smaller systems that focus more on power/cost/space efficiency?
Supercomputer admins would love to have a single code that used the whole machine, both the compute elements and the network elements, at close to 100%. In fact they spend a significant fraction on network elements to unblock the compute elements, but few codes are really so light on networking that the program scales to the full core count of the machine. So, instead they usually have several codes which can scale up to a significant fraction of the machine and then backfill with smaller jobs to keep the utilization up (because the acquisition cost and the running cost are so high).
Supercomputers have limited utility- beyond country bragging rights, only a few problems really justify spending this kind of resource. I intentionally switched my own research in molecular dynamics away from supercomputers (where I'd run one job on 64-128 processors for a 96X speedup) to closet cllusters, where I'd run 128 indpendent jobs for a 128X speedup, but then have to do a bunch of post-processing to make the results comparable to the long, large runs on the supercomputer (https://research.google/pubs/cloud-based-simulations-on-goog...). I actually was really relieved when my work no longer depended on expensive resources with little support, as my scientific productivity went up and my costs went way down.
I feel that supercomputers are good at one thing: if you need to make your country's flagship submarine about 10% faster/quieter than the competition.
The GPUs used to train them only existed because the DoE explicitly worked with Nvidia on a decade-long roadmap for delivery in it's various supercomputers, and would often work in tandem with private sector players to coordinate purchases and R&D (for example, protein folding and just about every Big Pharma company).
Hell, the only reason AMD EPYC exists is for the same reason.
Or doing Numerical Weather Prediction. :-)
But seriously, as a cluster sysadmin, the “128 jobs, followed by post-processing” is great for me, because it lets those separate jobs be scheduled as soon as resources are available.
> expensive resources with little support
Unfortunately, there isn’t as much funding available in places for user training and consultation. Good writing & education is a skill, and folks aren’t always interested in a job that has is term-limited, or whose future is otherwise unclear.
The OLCF Frontier user guide[1] has some information on scheduling and Frontier specific quirks (very minor).
Current status of jobs on Frontier:
[kkielhofner@login11.frontier ~]$ squeue -h -t running -r | wc -l
137
[kkielhofner@login11.frontier ~]$ squeue -h -t pending -r | wc -l
1016
The running jobs are relatively low because there are some massive jobs using a significant number of nodes ATM.
[0] - https://slurm.schedmd.com/documentation.html
[1] - https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
EDIT: I give up on HN code formatting
Just FYI: https://news.ycombinator.com/formatdoc
> Text after a blank line that is indented by two or more spaces is reproduced verbatim. (This is intended for code.)
I performed molecular dynamics simulations on the Titan supercomputer at ORNL during grad school. At the time, this supercomputer was the fastest in the world.
At least back then around 2012, ORNL really wanted projects that uniquely showcased the power of the machine. Many proposals for compute time were turned down for workloads that were “embarrassingly parallel” because these computations could be split up across multiple traditional compute clusters. However, research that involved MD simulations or lattice QCD required the fast Infiniband interconnects and the large amount of memory that Titan had, so these efforts were more likely to be approved.
The lab did in fact want projects that utilized the whole machine at once to take maximum advantage of its capabilities. It’s just that oftentimes this wasn’t possible, and smaller jobs would be slotted into the “gaps” between the bigger ones.
Yes
> why is it important to build supercomputers with more and more compute
A mix of
- research in distributed systems (there are plenty of open questions in Concurrency, Parallelization, Computer Architecture, etc)
- a way to maintain an ecosystem of large vendors (Intel, AMD, Nvidia and plenty of smaller vendors all get a piece of the pie to subsidize R&D)
- some problems are EXTREMELY computationally and financially expensive, so they require large On-Prem compute capabilities (eg. Protein folding, machine learning when I was in undergrad [DGX-100s were subsidized by Aurora], etc)
- some problems are extremely sensitive for national security reasons and it's best to keep all personnel in a single region (eg. Nuclear simulations, turbine simulations, some niche ML work, etc)
In reality you need to do both, and planners know this fact, and have known this fact for decades
Also, I guess I'm not sure what you mean by "smaller systems that focus more on power/cost/space". A proper queueing system generally efficiently allocates the resources of a large supercomputer to smaller tasks, while also making larger tasks possible in the first place. And I imagine there's somewhat an efficiency of scale in a large installation like this.
There are, of course, many many smaller supercomputers, such as at most medium to large universities. But even those often have 10-50k cores or so.
(In general, efficiency is a consideration when building/running, but not of using. Scientists want the most computational power they can get, power usage be damned :) )
edit: A related topic is capacity vs. capability: https://en.wikipedia.org/wiki/Supercomputer#Capability_versu...
But also, bigger systems have more opportunities to achieve higher utilization than smaller systems due to the dynamics of bin packing problem.
Deleted Comment
It took 15-20 years to reach this point [0]
A lot of innovations in the GPU and distributed ML space were subsidized by this research.
Concurrent and Parallel Computing are VERY hard problems.
[0] - http://helper.ipam.ucla.edu/publications/nmetut/nmetut_19423...
The sound barrier was relevant because there was a significant physical effects to overcome specifically when going trans-sonic. It wasn't a question of just adding more powerful engines to existing aircraft. They didn't choose the sound barrier because its a nice number, it was a big deal because all sorts of things behaved outside of their understanding of aerodynamics at that point. People died in the pursuit of understanding the sound barrier.
The 'exascale barrier', afaict, is just another number chosen specifically because it is a round(ish) number. It didn't turn computer scientists into smoking holes in the desert when it went wrong. This is an incremental improvement in an incredible field, but not a world changing watershed moment.
While I'm not sure exascale was something like the sound barrier, I do know a lot of hard work has been done to be able to efficiently utilize such large clusters.
Especially the interconnects and network topology can make a huge difference in efficiency[1] and Cray's Slingshot interconnect[2], used in Aurora[3], is an important part of that[4].
[1]: https://www.hpcwire.com/2019/07/15/super-connecting-the-supe...
[2]: https://www.nextplatform.com/2022/01/31/crays-slingshot-inte...
[3]: https://www.alcf.anl.gov/aurora
[4]: https://arxiv.org/abs/2008.08886
Dead Comment
It took significantly longer than it should have if it was just business as usual: "At a supercomputing conference in 2009, Computerworld projected exascale implementation by 2018." [1]
We got the first true exascale system with Frontier in 2022.
Part of the problem was the power consumption and having a purely CPU based system online for an exascale job. From slide 12 from[2]: "Aggressive design is at 70 MW" and "HW failure every 35 minutes".
[1]: https://en.wikipedia.org/wiki/Exascale_computing [2]: https://wgropp.cs.illinois.edu/bib/talks/tdata/2011/aramco-k...
Because of that, I felt a bit of nostalgia when I first saw consumer-accessible GPUs hitting the 1 TFLOP performance level, which now I suppose qualifies as a cheap iGPU.
Cool. Please ask it to sue for peace in several parts of the world, in an open way. Whilst it is at it, get it to work out how to be realistically carbon neutral.
I'm all in favour of willy waving when you have something to wave but in the end this beast will not favour humanity as a whole. It will suck up useful resources and spit out some sort of profit somewhere for someone else to enjoy.
For example, I have an "AI" project on Frontier. The process was remarkably simple and easy - a couple of Google Meets, a two page screed on what we're doing, a couple of forms, key fob showed up. Entire process took about a month and a good chunk of that was them waiting on me.
Probably half a days work total for 20k node hours (four MI250x per node) on Frontier for free, which is an incredible amount of compute my early resource constrained startup would have never been able to even fathom on some cloud, etc. It was like pulling teeth to even get a single H100 x8 on GCP for what would cost at least $250k for what we're doing. And that's with "knowing people" there...
These press releases talking about AI are intended to encourage these kinds of applications and partnerships. It's remarkable to me how many orgs, startups, etc don't realize these systems exist (or even consider them) and go out and spend money or burn "credits" that could be applied to more suitable things that make more sense.
They're saying "Hey, just so you know these things can do AI too. Come talk to us."
As an added bonus you get to say you're working with a national lab on the #1 TOP500 supercomputer in the world. That has remarkable marketing, PR, and clout value well beyond "yeah we spent X$ on $BIGCLOUD just like everyone else".
A lot of the foundational models used today were trained on Aurora and its predecessors, as well as tangential research such as containerizarion (eg. In the early 2010s, a joint research project between ANL's Computing team, one of the world's largest Pharma companies, and Nvidia became one of the largest customers of Docker and sponsored a lot of it's development)
Deleted Comment
[0] https://en.wikipedia.org/wiki/Aurora_(supercomputer)