Intel announces the Aurora supercomputer has broken the exascale barrier

More context: This is related to today's release of the Spring Top 500 list (https://news.ycombinator.com/item?id=40346788). Aurora rated 1,012.00 PetaFLOPS/second Rmax, and is in 2nd place, behind Frontier.

In the November 2023 list, Aurora was also in second place, with an Rmax of 585.34 PetaFLOPS/second.

See https://www.top500.org/system/180183/ for the specs on Aurora, and https://www.top500.org/system/180047/ for the specs on Frontier.

See https://www.top500.org/project/top500_description/ and https://www.top500.org/project/linpack/ for a description of Rmax and the LINPACK benchmark, by which supercomputers are generally ranked. The Top 500 list only includes supercomputers that are able to run the LINPACK benchmark, and where the owner is willing to publish the results.

The jump in Aurora's Rmax scope is explained by Aurora's difficult birth. https://morethanmoore.substack.com/p/5-years-late-only-2 (published when the November 2023 list came out) has a good explanation of what's been going on.

imrehg · a year ago

Looking at the two specs, interesting to see how Frontier (the first, running AMD CPUs) has much better power efficiency than Aurora (the second, running Intel), 18.89 kW/PFLOPS vs 38.24 kW/PFLOPS respectively... Good advertisement for AMD? :)

nolok · a year ago

These days this is true from top to bottom, desktop, servers, ... Even in gaming, the 7800X3D is cheaper than the 14700K, it is also more performant and yet uses roughly 20% less power at idle and the gap only grows at full charge.

AMD's current architecture is very power responsible, and Intel has more or less used watt overfeeding to catch back in performance.

pyrale · a year ago

Also the delta between theoretical performance and benchmarked performance is much smaller for Frontier (AMD) than for Aurora (Intel).

That being said, note that the software is also different on the two computers.

p1esk · a year ago

Note all mentions of FLOPS in this thread refer to FP64 (double precision), unlike more popular “AI OPS”, which are typically INT8, specified for modern GPUs.

kkielhofner · a year ago

> which are typically INT8

These systems are used for training which is VERY rarely INT8. On Frontier, for example, it's recommended to use bfloat16 or float32 if that doesn't work for you/your application.

Nvidia has FP8 with >=Hopper and supposedly AMD MI300 has it as well although I have no experience with the MI300 so I can't speak to that.

klysm · a year ago

What does FLOPS/second mean? Isn’t FLOPS already per second? Are they accelerating?

falcor84 · a year ago

I'd actually be interested in an estimate of the world's overall flop/s^2. Could someone please run a back of the envelope calculation for me, e.g. looking at least year's data?

dragonwriter · a year ago

Yeah, the top500 pages cited use Flop/s (apparently using Flop for “Floating point operations” – not sure which “o” and “p” are used), I’ve could swear I’ve seen FLOPS and seen it expanded specifically as “FLoating point Operations Per Second” when I first encountered it, FLOPS/s seems to be using “FLOPS” like the “Flop” above (probably as “FLoating point OPerationS”, in which case the “/s” makes sense.)

pyrale · a year ago

https://en.wikipedia.org/wiki/RAS_syndrome

ngneer · a year ago

Made me chuckle. F=ma, where a is the derivative of FLOPS with respect to time.

kortilla · a year ago

Some people treat FLOPS as “FLoating point OPerationS”.

Deleted Comment

verandaguy · a year ago

I don't know anything about supercomputer architecture; are lifetime upgrades that double the performance typical, let alone YoY?

What do those kinds of upgrades entail from a hardware side? Software side? Is this just a horizontal scaling of a cluster?

CaliforniaKarl · a year ago

This isn’t really an upgrade, it’s the system still being commissioned.

See the last paragraph of my post for a link to more info.

Serious question: my understanding of HPC is that there are many workloads running on a given supercomputer at any time. There is no singular workload that takes up the entire or most of the resources of a supercomputer.

Is my understanding correct? If yes, then why is it important to build supercomputers with more and more compute? Wouldn't it be better to build smaller systems that focus more on power/cost/space efficiency?

dekhn · a year ago

There's many variables that go into supercomputers, of which "company/country propaganda" is just one of them.

Supercomputer admins would love to have a single code that used the whole machine, both the compute elements and the network elements, at close to 100%. In fact they spend a significant fraction on network elements to unblock the compute elements, but few codes are really so light on networking that the program scales to the full core count of the machine. So, instead they usually have several codes which can scale up to a significant fraction of the machine and then backfill with smaller jobs to keep the utilization up (because the acquisition cost and the running cost are so high).

Supercomputers have limited utility- beyond country bragging rights, only a few problems really justify spending this kind of resource. I intentionally switched my own research in molecular dynamics away from supercomputers (where I'd run one job on 64-128 processors for a 96X speedup) to closet cllusters, where I'd run 128 indpendent jobs for a 128X speedup, but then have to do a bunch of post-processing to make the results comparable to the long, large runs on the supercomputer (https://research.google/pubs/cloud-based-simulations-on-goog...). I actually was really relieved when my work no longer depended on expensive resources with little support, as my scientific productivity went up and my costs went way down.

I feel that supercomputers are good at one thing: if you need to make your country's flagship submarine about 10% faster/quieter than the competition.

alephnerd · a year ago

Ever used GPT-3, DALL-E, or other LLMs?

The GPUs used to train them only existed because the DoE explicitly worked with Nvidia on a decade-long roadmap for delivery in it's various supercomputers, and would often work in tandem with private sector players to coordinate purchases and R&D (for example, protein folding and just about every Big Pharma company).

Hell, the only reason AMD EPYC exists is for the same reason.

CaliforniaKarl · a year ago

> I feel that supercomputers are good at one thing: if you need to make your country's flagship submarine about 10% faster/quieter than the competition.

Or doing Numerical Weather Prediction. :-)

But seriously, as a cluster sysadmin, the “128 jobs, followed by post-processing” is great for me, because it lets those separate jobs be scheduled as soon as resources are available.

> expensive resources with little support

Unfortunately, there isn’t as much funding available in places for user training and consultation. Good writing & education is a skill, and folks aren’t always interested in a job that has is term-limited, or whose future is otherwise unclear.

trueismywork · a year ago

Not all jobs can be parallelized in that manner without communication.

Harmohit · a year ago

Thanks for the great reply and for linking the research paper - I am going to check it out. Could you suggest a good review paper for someone new to get into modern HPC?

bitfilped · a year ago

Most of the time yes, HPC systems are shared among many users. Sometimes though the whole system (or near it) will be used in a single run. These are sometimes referred to as "hero runs" and while they're more common for benchmarking and burn-ins there are some tightly-coupled workloads that perform well in that style of execution. It really depends on a number of factors like the workloads being run, the number of users, and what the primary business purpose of the HPC resource is. Sites that have to run both types of jobs will typically allow any user to schedule jobs most of the time but then pre-reserve blocks of time for hero runs to take place where other user jobs are held until the primary scheduled run is over.

Harmohit · a year ago

Thanks for the reply! Can you give some examples of these "hero runs"?

kkielhofner · a year ago

I have a project on Frontier. Generally these systems (including Frontier) use slurm[0] for scheduling and workload management.

The OLCF Frontier user guide[1] has some information on scheduling and Frontier specific quirks (very minor).

Current status of jobs on Frontier:

[kkielhofner@login11.frontier ~]$ squeue -h -t running -r | wc -l

137

[kkielhofner@login11.frontier ~]$ squeue -h -t pending -r | wc -l

1016

The running jobs are relatively low because there are some massive jobs using a significant number of nodes ATM.

[0] - https://slurm.schedmd.com/documentation.html

[1] - https://docs.olcf.ornl.gov/systems/frontier_user_guide.html

EDIT: I give up on HN code formatting

SushiHippie · a year ago

> EDIT: I give up on HN code formatting

Just FYI: https://news.ycombinator.com/formatdoc

> Text after a blank line that is indented by two or more spaces is reproduced verbatim. (This is intended for code.)

  [kkielhofner@login11.frontier ~]$ squeue -h -t running -r | wc -l

  137

  [kkielhofner@login11.frontier ~]$ squeue -h -t pending -r | wc -l

  1016

Xcelerate · a year ago

> There is no singular workload that takes up the entire or most of the resources of a supercomputer.

I performed molecular dynamics simulations on the Titan supercomputer at ORNL during grad school. At the time, this supercomputer was the fastest in the world.

At least back then around 2012, ORNL really wanted projects that uniquely showcased the power of the machine. Many proposals for compute time were turned down for workloads that were “embarrassingly parallel” because these computations could be split up across multiple traditional compute clusters. However, research that involved MD simulations or lattice QCD required the fast Infiniband interconnects and the large amount of memory that Titan had, so these efforts were more likely to be approved.

The lab did in fact want projects that utilized the whole machine at once to take maximum advantage of its capabilities. It’s just that oftentimes this wasn’t possible, and smaller jobs would be slotted into the “gaps” between the bigger ones.

alephnerd · a year ago

> my understanding correct

Yes

> why is it important to build supercomputers with more and more compute

A mix of

- research in distributed systems (there are plenty of open questions in Concurrency, Parallelization, Computer Architecture, etc)

- a way to maintain an ecosystem of large vendors (Intel, AMD, Nvidia and plenty of smaller vendors all get a piece of the pie to subsidize R&D)

- some problems are EXTREMELY computationally and financially expensive, so they require large On-Prem compute capabilities (eg. Protein folding, machine learning when I was in undergrad [DGX-100s were subsidized by Aurora], etc)

- some problems are extremely sensitive for national security reasons and it's best to keep all personnel in a single region (eg. Nuclear simulations, turbine simulations, some niche ML work, etc)

In reality you need to do both, and planners know this fact, and have known this fact for decades

sseagull · a year ago

You are generally correct, however there are workloads that do use larger portions of a supercomputer that wouldn't be feasible on smaller systems.

Also, I guess I'm not sure what you mean by "smaller systems that focus more on power/cost/space". A proper queueing system generally efficiently allocates the resources of a large supercomputer to smaller tasks, while also making larger tasks possible in the first place. And I imagine there's somewhat an efficiency of scale in a large installation like this.

There are, of course, many many smaller supercomputers, such as at most medium to large universities. But even those often have 10-50k cores or so.

(In general, efficiency is a consideration when building/running, but not of using. Scientists want the most computational power they can get, power usage be damned :) )

edit: A related topic is capacity vs. capability: https://en.wikipedia.org/wiki/Supercomputer#Capability_versu...

ThinkBeat · a year ago

From my experience they are running whole cluster dedicated jobs quite frequently. Climate models can use whatever resources they get, nuclear weapons modelling esp for old warheads can use a lot.

MaxikCZ · a year ago

What is being calculated with nuclear weapons? I understand it must have been computationally expensive to get them working, but once completed, what is there left to calculate?

trueismywork · a year ago

Bigger systems when utilized at 100% are more efficient than multiple smaller systems when utilized at 100%, in terms of engineering work, software, etc.

But also, bigger systems have more opportunities to achieve higher utilization than smaller systems due to the dynamics of bin packing problem.

tkuraku · a year ago

While in general there can be many smaller workloads running in parallel. However, periodically the whole supercomputer can be reserved for a "hero" run.

Deleted Comment

prpl · a year ago

there’s also Bell Prize submissions, which is the only time some machines get completely reserved