egorpv (u/egorpv) - Readit News

mbreese · a year ago

In bioinformatics we use a modified gzip format called bgzip that exploits this fact heavily. The entire file is made of concatenated gzip chunks. Each chunk then contains the size of the chunk (stored in the gzip header). This lets you do random access inside the compressed blocks more efficiently.

Sadly, the authors hard coded the expected headers so it’s not fully gzip compatible (you can’t add your own arbitrary headers). For example, I wanted to add a chunk hash and optional encryption by adding my own header elements. But as the original tooling all expects a fixed header, it can’t be done in the existing format.

But overall it is easily indexed and makes reading compressed data pretty easy.

So, there you go - a practical use for a gzip party trick!

egorpv · a year ago

In such case, imo, it would be more appropriate to use ZIP container format which supports multiple independent entries (files) and index table (including sizes) for them. Compression algorithm is essentially the same (Deflate) so it would not be bloated in any way. As for practical implementations of serializing complex "objects" into ZIP, numpy [0] can be mentioned as an example.

[0] https://numpy.org/doc/stable/reference/generated/numpy.savez...