Readit News logoReadit News
EdSchouten commented on I made my own Git   tonystr.net/blog/git_immi... · Posted by u/TonyStr
oconnor663 · 13 days ago
As far as I know, most CDC schemes requires a single-threaded pass over the whole file to find the chunk boundaries? (You can try to "jump to the middle", but usually there's an upper bound on chunk length, so you might need to backtrack depending on what you learn later about the last chunk you skipped?) The more cores you have, the more of a bottleneck that becomes.
EdSchouten · 12 days ago
You can always use a divide and conquer strategy to compute the chunks. Chunk both halves of the file independently. Once that’s done, you redo the chunking around the midpoint of the file forward, until it starts to match the chunks obtained previously.
EdSchouten commented on I made my own Git   tonystr.net/blog/git_immi... · Posted by u/TonyStr
jrockway · 13 days ago
sha256 is a very slow algorithm, even with hardware acceleration. BLAKE3 would probably make a noticeable performance difference.

Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/

It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.

EdSchouten · 13 days ago
It depends on the architecture. On ARM64, SHA-256 tends to be faster than BLAKE3. The reasons being that most modern ARM64 CPUs have native SHA-256 instructions, and lack an equivalent of AVX-512.

Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.

For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.

EdSchouten commented on Germany Forces Lexus to Remotely Kill Car Heating in Dead of Winter   gadgetreview.com/germany-... · Posted by u/josephcsible
EdSchouten · 19 days ago
That’s great! People who do that are often inconsiderate of how it affect others. First of all, it generates unnecessary noise, which is annoying for neighbors who are still trying to sleep. Pedestrians/cyclists also need to breathe those exhaust gases.
EdSchouten commented on Spinlocks vs. Mutexes: When to Spin and When to Sleep   howtech.substack.com/p/sp... · Posted by u/birdculture
EdSchouten · 2 months ago
I don’t understand why I would need to care about this. Can’t my operating system and/or pthread library sort this out by itself?
EdSchouten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/GalaxySnail
quotemstr · 4 months ago
I wonder whether there's a role for AI here.

(Please don't hurt me.)

AI turns out to be useful for data compression (https://statusneo.com/creating-lossless-compression-algorith...) and RF modulation optimization (https://www.arxiv.org/abs/2509.04805).

Maybe it'd be useful to train a small model (probably of the SSM variety) to find optimal chunking boundaries.

EdSchouten · 4 months ago
Yeah, that's true. Having some kind of chunking algorithm that's content/file format aware could make it work even better. For example, it makes a lot of sense to chunk source files at function/scope boundaries.

In my case I need to ensure that all producers of data use exactly the same algorithm, as I need to look up build cache results based on Merkle tree hashes. That's why I'm intentionally focusing on having algorithms that are not only easy to implement, but also easy to implement consistently. I think that MaxCDC implementation that I shared strikes a good balance in that regard.

EdSchouten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/GalaxySnail
rokkamokka · 4 months ago
What would you estimate the performance implications of using go-cdc instead of fastcdc in their cdc_rsync are?
EdSchouten · 4 months ago
In my case I observed a ~2% reduction in data storage when attempting to store and deduplicate various versions of the Linux kernel source tree (see link above). But that also includes the space needed to store the original version.

If we take that out of the equation and only measure the size of the additional chunks being transferred, it's a reduction of about 3.4%. So it's not an order of magnitude difference, but not bad for a relatively small change.

EdSchouten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/GalaxySnail
Scaevolus · 4 months ago
This lookahead is very similar to the "lazy matching" used in Lempel-Ziv compressors! https://fastcompression.blogspot.com/2010/12/parsing-level-1...

Did you compare it to Buzhash? I assume gearhash is faster given the simpler per iteration structure. (also, rand/v2's seeded generators might be better for gear init than mt19937)

EdSchouten · 4 months ago
Yeah, GEAR hashing is simple enough that I haven't considered using anything else.

Regarding the RNG used to seed the GEAR table: I don't think it actually makes that much of a difference. You only use it once to generate 2 KB of data (256 64-bit constants). My suspicion is that using some nothing-up-my-sleeve numbers (e.g., the first 2048 binary digits of π) would work as well.

EdSchouten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/GalaxySnail
EdSchouten · 4 months ago
I’ve also been doing lots of experimenting with Content Defined Chunking since last year (for https://bonanza.build/). One of the things I discovered is that the most commonly used algorithm FastCDC (also used by this project) can be improved significantly by looking ahead. An implementation of that can be found here:

https://github.com/buildbarn/go-cdc

EdSchouten commented on Which NPM package has the largest version number?   adamhl.dev/blog/largest-n... · Posted by u/genshii
EdSchouten · 5 months ago
So 19494 is the largest? That's far lower than I expected. There's nobody out there that has put a date in a version number (e.g., 20250915)?
EdSchouten commented on Show HN: What country you would hit if you went straight where you're pointing   apps.apple.com/us/app/lea... · Posted by u/brgross
flowardnut · 6 months ago
I wrote one of these but it only works for residents of san marino
EdSchouten · 6 months ago
Soon available for people in the Vatican as well?

u/EdSchouten

KarmaCake day1423February 18, 2016View Original