I found this super interesting - especially as all the data I've written code to manipulate has been small enough that I haven't needed to optimize my code, so I've never had to think in this direction.
I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".
It also underscores just how insane raw disk bandwidth is on modern ssds. Most software is not designed around a world where you have gigabytes a second of sequential scan on a laptop.
A few months ago, I had to quickly bang out a script to output about 20 million lines of text, each the output of a hash function. My naive solution took more than a few minutes - simple optimizations such as writing every 10k lines cut the time significantly. Threading would have helped quite a bit as well.
I’m kind of interested in the opposite problem, what is the simplest solution using a well known library/db that approaches the fastest hand optimized solution to this problem?
Java is often slightly faster than Go, has similar (perhaps, older, better optimized Map) constructs, perhaps better GC (older, more optimized), though I don't think the GC is a challenge, has slower startup times - so I'd say roughly the same as the idiomatic Go version?
Sounds very reasonable. In the blog post about 20s were shaved off by assuming we don't need complicated string parsing. An of the shelf library can't make that assumption so they will always have to pay the extra cost.
True in general but some (especially libs aimed at larger datasets processed in batch) are taking advantage of benchmarks like this to do things like:
Try the fast way, if it works great
Try the slow way, if above fails
This makes the slow path 2x slower at worst (and you can advise to always use the slow way with optional params) but the fast path can be 10x faster
For anyone looking for more examples of 1BRC in Go, we had a friendly competition at work and collected the results here: https://github.com/dhartunian/1brcgo/
In addition to the loop-unrolling and bit-twiddling tricks that also show up in the fastest Java and C++ versions, some Go-specific things I learned were:
- unsafe.Pointer can be used to read memory without bounds checks
- many functions in the bytes and bits packages in the standard library are written in assembly
- debug.SetGCPercent and SetMemoryLimit to turn off GC
- runtime.LockOSThread to lock a goroutine to a thread
- print is slightly faster than fmt.Printf (but writes to stderr)
Oh, I'd missed those solutions, thanks. You guys got way more hard core than I did -- nice work! Looking forward to reading the code for those solutions this week.
I think we all ended up using unsafe, though there were some solutions without mmap. It would have been interesting if we had adhered to the same constraints you did!
> I find this report confusing. Why does if items[hashIndex].key == nil show as taking 5.01s, but the call to bytes.Equal shows as only 390ms. Surely a slice lookup is much cheaper than a function call? If you are a Go performance expert and can help me interpret it, I’m all ears!
These two lines are both conditionals, so the time reported is sensitive to branch mispredictions. If the timings are not intuitive based on the complexity of the associated lines, then it may be explained by the data being not very predictable and the branch predictor having a bad time.
Yeah, someone emailed me privately after they'd dug into this. They mentioned that "items[hashIndex]" was a significant source of cache misses. They used "perf record -e cache-misses" and found it was the largest source of cache misses. They also found (by digging into the assembly) that the "bytes.Equal" line has some sort of source-line attribution issue.
I love the nerdery around 1BRC. My axe to grind is that unless you do dangerous stuff DBs are just as fast, less complicated, and more resilient to data updates than application code [0]. Do more in the database!
I agree with doing more in the database, you're closer to the data (disk/disk cache/L2 cache) than the application code is. At the same time I get really nervous around doing work in the database because you have to be really disciplined that the in-database code (functions, views, etc) match the code in source control. Also that your testing/QA database contains all the same code and enough test data to actually exercise the performance bounds of that code.
With application code I can easily step through it in a debugger and verify the deployed code matches what's in the repo. It's more difficult to do in the database because it requires more organizational discipline.
Yeah you need some different tools when working in data. I've been recommending dbt [0] as kind of the on-ramp for SREs to data work. Among other things it allows you to keep your work in a VCS and has a testing framework.
I'm not sure this part of the TOS is valid in many jurisdictions. But there is a better reason to not publish benchmarks: they do not deserve free advertisement. We should just collectively forget they exist and use other tools.
I wanted to see how the temperature was changing over time for specific regions using a map-based interface. The following chart was particularly eye-opening:
The software won a couple of awards and was heavily optimized to produce reports in under a minute. Kudos to the author for getting a parse time of a billion records down to mere seconds.
It's worth noting that if you're messing around with large text files from the CLI, awk, grep, etc will be an order-of-magnitude faster if you opt out of unicode parsing.
I'm pretty confident adding LC_ALL=C to the awk solution would get it easily under a minute.
I just want to go on record that given the simplistic (i.e fun) problem, a shell developer would have had the answer to a first, specific set of billion rows done while all the other language were still putting on their shoes.
I think my favorite part was the very first section, where he got baseline measurements with `cat`, `wc`, and friends. I wouldn't have thought to do that and its such an easy way to get a perspective on what's "reasonable".
Dead Comment
I’m kind of interested in the opposite problem, what is the simplest solution using a well known library/db that approaches the fastest hand optimized solution to this problem?
https://questdb.io/blog/billion-row-challenge-step-by-step/
https://github.com/gunnarmorling/1brc?tab=readme-ov-file#run...
The Python version:
https://github.com/gunnarmorling/1brc/blob/main/src/main/pyt...
It seems they are not running against the full dataset:
> Moving on to the 100 million file to see if size makes a difference.
One would also have to run both approaches on the same hardware for a meaningful comparison?In addition to the loop-unrolling and bit-twiddling tricks that also show up in the fastest Java and C++ versions, some Go-specific things I learned were:
- unsafe.Pointer can be used to read memory without bounds checks
- many functions in the bytes and bits packages in the standard library are written in assembly
- debug.SetGCPercent and SetMemoryLimit to turn off GC
- runtime.LockOSThread to lock a goroutine to a thread
- print is slightly faster than fmt.Printf (but writes to stderr)
Update: for reference, Jason Chu's solution (https://github.com/dhartunian/1brcgo/blob/494eabd6ea958cc193...) seems to be the fastest on my machine, and runs in about 1.3s!
These two lines are both conditionals, so the time reported is sensitive to branch mispredictions. If the timings are not intuitive based on the complexity of the associated lines, then it may be explained by the data being not very predictable and the branch predictor having a bad time.
0: https://geraldonit.com/2024/01/31/1-billion-row-challenge-in...
With application code I can easily step through it in a debugger and verify the deployed code matches what's in the repo. It's more difficult to do in the database because it requires more organizational discipline.
[0]: getdbt.com
https://www.youtube.com/watch?v=10KEr3sEG80
I wanted to see how the temperature was changing over time for specific regions using a map-based interface. The following chart was particularly eye-opening:
https://www.youtube.com/watch?v=iEtvf9xzRB4&t=164s
The software won a couple of awards and was heavily optimized to produce reports in under a minute. Kudos to the author for getting a parse time of a billion records down to mere seconds.
Dead Comment
I'm pretty confident adding LC_ALL=C to the awk solution would get it easily under a minute.