For sure another case for the "you doing it wrong" team at Nvidia.
But the decision to add Julia/CUDA.jl here was not the best idea on a comparison with really interesting insights about Fortran vs. Julia. The CUDA part of the code is nearly a one-to-one copy of the MPI code. Especially one sentence[1] (there are some similar, comparable blocks) explains why the code not performs very well.
For me the beginner's mistake is: on a CPU any 'if' will cost you a clock cycle (branch prediction not considered), on a GPU it costs you the sync. In best case the syncs will come back over many loops, more often some cores will go out of sync more and more, eventually all other cores will have to wait for the last. Otherwise memory accesses will come blocking and that would cost all cores thousands of cycles.
Ok, so a not very representative tl;dr was provided so adding mine here.
1. Serial CPU performance of Julia-unoptimized was around 12 times that of Python/Numpy.
2. Serial CPU performance of Julia optimized gave another 10-12x improvement leading to total 140x improvement over Python/Numpy.
(good pointers on speeding up the Python code beyond Numpy listed but not tried or benchmarked)
3. Even though the GPU was rated for a listed performance of 9.7TFlops (A100) the code performed around the same as the optimized Serial CPU which has a listed performance of 2.5TFlops over 64 cores. Surprising result, there are some pointers to why that might be.
4. When comparing Julia-MPI with Fortran-MPI, at lower processor counts both match for total time, at middle processor counts (16-128) Fortran takes more time than Julia, at higher processor counts Julia becomes slower than Fortran. However the scaling down of Julia is driven more by a less efficient implementation of the communication time with the higher processor setup. Computation time scales well for both Julia and Fortran but the communication time takes over in the Julia case.
Some nice optimization tips are provided to speed up Julia code. No plans to use Julia yet but it is good for prototyping and Fortran is good enough for now.
Yes, definitely very impressive results. Especially considering how much of an improvement over the Python/Numpy code they could achieve. If you are writing something from scratch, very good reason to work in Julia.
Is the communication issue something intrinsic to Julia, or something in the way the author of the code implemented it? (I guess MPI must be expecting arrays stored in exactly the same format as Fortran has them, MPI grew up in Fortran-and-C land; does Julia have to do some extra marshaling or something?)
The authors seem to using the MPI package from Julia [1] so not something they did on their own. I didn't see any discussion on why MPI.jl is scales differently than Fortran-MPI. Something for the MPI.jl developers to investigate.
The key elements that I consistently see missing when comparing languages is the cost/effort of maintenance and how easy is to write fast/slow code.
The former boils down to, there's little point saving time in the writing of new code when new versions of the language hinder the use of old codes. Julia, being a very young language, simply does not have the legacy of backwards compatibility that Fortran has and as a result libraries written today are unlikely to work in 10 years time. Partly because of Julia's age and partly because backwards compatibility is not valued as much as it is in Fortran. The GNU Fortran compiler can still compile code written nearly 50 years ago (https://gcc.gnu.org/fortran/).
The second point is that benchmarks like this seem to focus on tuned implementations, which is really important, but another important aspect is how easy is it write fast code. With Fortran, it is very easy to write very fast code as a lot is abstracted away from the programmer so they can focus on FORmula TRANslation. My experience with Julia has not been the same and thus for people don't have the time to perform such optimisations, Fortran may still be the better choice.
I do look forward to watching the development of Julia but for now I'll stick with using Fortran as the high performance numerical language in my tech stack.
I'll finish with something I've read but I can't remember from where:
Over the years there have been many examples of "X with replace Fortran" and the only thing that ends up getting replaced is X.
This is a little reductive. The speed comparison was between the highly parallel and optimized CPU version vs GPU. The GPU’s advertised FLOPS was only 4x the CPU setup’s. That the performance is so similar isn’t all that surprising and could come down to low level optimization details. There are likely some areas in Julia’s GPU stack that could be made better, but it’s certainly already pretty performant.
It’s also comparing against Fortran. Their Python comparison is 100x+ slower. That a high level language even comes close is impressive.
Compiled high level languages have always been pretty fast in general, for decades, before 20 years ago it started to be trendy to use interpreted scripting languages to do their work.
Anyone used to Modula-2, Object Pascal, QuickBasic, Clipper, AMOS, Lisp, Scheme, Dylan, Smalltalk, SELF, Oberon, VB, Delphi, C++, SML, Miranda, Haskell, Prolog from 1980-1990's, isn't going to be surprised about speedups over Python.
But the decision to add Julia/CUDA.jl here was not the best idea on a comparison with really interesting insights about Fortran vs. Julia. The CUDA part of the code is nearly a one-to-one copy of the MPI code. Especially one sentence[1] (there are some similar, comparable blocks) explains why the code not performs very well.
[1] https://github.com/robertstrauss/MPAS_Ocean_Julia/blob/main/...
For me the beginner's mistake is: on a CPU any 'if' will cost you a clock cycle (branch prediction not considered), on a GPU it costs you the sync. In best case the syncs will come back over many loops, more often some cores will go out of sync more and more, eventually all other cores will have to wait for the last. Otherwise memory accesses will come blocking and that would cost all cores thousands of cycles.
1. Serial CPU performance of Julia-unoptimized was around 12 times that of Python/Numpy.
2. Serial CPU performance of Julia optimized gave another 10-12x improvement leading to total 140x improvement over Python/Numpy.
(good pointers on speeding up the Python code beyond Numpy listed but not tried or benchmarked)
3. Even though the GPU was rated for a listed performance of 9.7TFlops (A100) the code performed around the same as the optimized Serial CPU which has a listed performance of 2.5TFlops over 64 cores. Surprising result, there are some pointers to why that might be.
4. When comparing Julia-MPI with Fortran-MPI, at lower processor counts both match for total time, at middle processor counts (16-128) Fortran takes more time than Julia, at higher processor counts Julia becomes slower than Fortran. However the scaling down of Julia is driven more by a less efficient implementation of the communication time with the higher processor setup. Computation time scales well for both Julia and Fortran but the communication time takes over in the Julia case.
Some nice optimization tips are provided to speed up Julia code. No plans to use Julia yet but it is good for prototyping and Fortran is good enough for now.
[1] https://github.com/robertstrauss/MPAS_Ocean_Julia/blob/main/...
Damn, it would have been a nice benchmark if they did do a comparison against numba/cython.
The former boils down to, there's little point saving time in the writing of new code when new versions of the language hinder the use of old codes. Julia, being a very young language, simply does not have the legacy of backwards compatibility that Fortran has and as a result libraries written today are unlikely to work in 10 years time. Partly because of Julia's age and partly because backwards compatibility is not valued as much as it is in Fortran. The GNU Fortran compiler can still compile code written nearly 50 years ago (https://gcc.gnu.org/fortran/).
The second point is that benchmarks like this seem to focus on tuned implementations, which is really important, but another important aspect is how easy is it write fast code. With Fortran, it is very easy to write very fast code as a lot is abstracted away from the programmer so they can focus on FORmula TRANslation. My experience with Julia has not been the same and thus for people don't have the time to perform such optimisations, Fortran may still be the better choice.
I do look forward to watching the development of Julia but for now I'll stick with using Fortran as the high performance numerical language in my tech stack.
I'll finish with something I've read but I can't remember from where:
Over the years there have been many examples of "X with replace Fortran" and the only thing that ends up getting replaced is X.
- Julia nice for prototyping
- but zero speed up from using GPU
- sometimes faster sometimes slower on CPU
- no plans to use at the moment
It’s also comparing against Fortran. Their Python comparison is 100x+ slower. That a high level language even comes close is impressive.
Compiled high level languages have always been pretty fast in general, for decades, before 20 years ago it started to be trendy to use interpreted scripting languages to do their work.
Anyone used to Modula-2, Object Pascal, QuickBasic, Clipper, AMOS, Lisp, Scheme, Dylan, Smalltalk, SELF, Oberon, VB, Delphi, C++, SML, Miranda, Haskell, Prolog from 1980-1990's, isn't going to be surprised about speedups over Python.