EdSchouten (u/EdSchouten)

EdSchouten commented on I made my own Git tonystr.net/blog/git_immi... · Posted by u/TonyStr

oconnor663 · 13 days ago

As far as I know, most CDC schemes requires a single-threaded pass over the whole file to find the chunk boundaries? (You can try to "jump to the middle", but usually there's an upper bound on chunk length, so you might need to backtrack depending on what you learn later about the last chunk you skipped?) The more cores you have, the more of a bottleneck that becomes.

EdSchouten · 12 days ago

You can always use a divide and conquer strategy to compute the chunks. Chunk both halves of the file independently. Once that’s done, you redo the chunking around the midpoint of the file forward, until it starts to match the chunks obtained previously.

EdSchouten commented on I made my own Git tonystr.net/blog/git_immi... · Posted by u/TonyStr

jrockway · 13 days ago

sha256 is a very slow algorithm, even with hardware acceleration. BLAKE3 would probably make a noticeable performance difference.

Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/

It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.

EdSchouten · 13 days ago

It depends on the architecture. On ARM64, SHA-256 tends to be faster than BLAKE3. The reasons being that most modern ARM64 CPUs have native SHA-256 instructions, and lack an equivalent of AVX-512.

Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.

For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.

EdSchouten commented on Germany Forces Lexus to Remotely Kill Car Heating in Dead of Winter gadgetreview.com/germany-... · Posted by u/josephcsible

EdSchouten · 19 days ago

That’s great! People who do that are often inconsiderate of how it affect others. First of all, it generates unnecessary noise, which is annoying for neighbors who are still trying to sleep. Pedestrians/cyclists also need to breathe those exhaust gases.

EdSchouten commented on Spinlocks vs. Mutexes: When to Spin and When to Sleep howtech.substack.com/p/sp... · Posted by u/birdculture

EdSchouten · 2 months ago

I don’t understand why I would need to care about this. Can’t my operating system and/or pthread library sort this out by itself?

EdSchouten commented on CDC File Transfer github.com/google/cdc-fil... · Posted by u/GalaxySnail

quotemstr · 4 months ago

I wonder whether there's a role for AI here.

(Please don't hurt me.)

AI turns out to be useful for data compression (https://statusneo.com/creating-lossless-compression-algorith...) and RF modulation optimization (https://www.arxiv.org/abs/2509.04805).

Maybe it'd be useful to train a small model (probably of the SSM variety) to find optimal chunking boundaries.

EdSchouten · 4 months ago

Yeah, that's true. Having some kind of chunking algorithm that's content/file format aware could make it work even better. For example, it makes a lot of sense to chunk source files at function/scope boundaries.

In my case I need to ensure that all producers of data use exactly the same algorithm, as I need to look up build cache results based on Merkle tree hashes. That's why I'm intentionally focusing on having algorithms that are not only easy to implement, but also easy to implement consistently. I think that MaxCDC implementation that I shared strikes a good balance in that regard.

EdSchouten commented on CDC File Transfer github.com/google/cdc-fil... · Posted by u/GalaxySnail

rokkamokka · 4 months ago

What would you estimate the performance implications of using go-cdc instead of fastcdc in their cdc_rsync are?

EdSchouten · 4 months ago

In my case I observed a ~2% reduction in data storage when attempting to store and deduplicate various versions of the Linux kernel source tree (see link above). But that also includes the space needed to store the original version.

If we take that out of the equation and only measure the size of the additional chunks being transferred, it's a reduction of about 3.4%. So it's not an order of magnitude difference, but not bad for a relatively small change.

EdSchouten commented on CDC File Transfer github.com/google/cdc-fil... · Posted by u/GalaxySnail

Scaevolus · 4 months ago

This lookahead is very similar to the "lazy matching" used in Lempel-Ziv compressors! https://fastcompression.blogspot.com/2010/12/parsing-level-1...

Did you compare it to Buzhash? I assume gearhash is faster given the simpler per iteration structure. (also, rand/v2's seeded generators might be better for gear init than mt19937)

EdSchouten · 4 months ago

Yeah, GEAR hashing is simple enough that I haven't considered using anything else.

Regarding the RNG used to seed the GEAR table: I don't think it actually makes that much of a difference. You only use it once to generate 2 KB of data (256 64-bit constants). My suspicion is that using some nothing-up-my-sleeve numbers (e.g., the first 2048 binary digits of π) would work as well.

EdSchouten commented on CDC File Transfer github.com/google/cdc-fil... · Posted by u/GalaxySnail

EdSchouten · 4 months ago

I’ve also been doing lots of experimenting with Content Defined Chunking since last year (for https://bonanza.build/). One of the things I discovered is that the most commonly used algorithm FastCDC (also used by this project) can be improved significantly by looking ahead. An implementation of that can be found here:

https://github.com/buildbarn/go-cdc

EdSchouten commented on Which NPM package has the largest version number? adamhl.dev/blog/largest-n... · Posted by u/genshii

EdSchouten · 5 months ago

So 19494 is the largest? That's far lower than I expected. There's nobody out there that has put a date in a version number (e.g., 20250915)?

EdSchouten commented on Show HN: What country you would hit if you went straight where you're pointing apps.apple.com/us/app/lea... · Posted by u/brgross

flowardnut · 6 months ago

I wrote one of these but it only works for residents of san marino

EdSchouten · 6 months ago

Soon available for people in the Vatican as well?