It takes roughly 100us for light to travel 30km – Can you explain how the speed of light is relevant here?
It takes roughly 100us for light to travel 30km – Can you explain how the speed of light is relevant here?
Naively and knowing little about CRAM, I would expect that OpenZL would beat Zstd handily out of the box, but need additional capabilities to match the performance of CRAM, since genomics hasn't been a focus as of yet. But it would be interesting to see how much we need to add is generic to all compression (but useful for genomics), vs. techniques that are specific only to genomics.
We're planning on setting up a blog on our website to highlight use cases of OpenZL. I'd love to make a post about this.
Happy to discuss further
Edit: Have you any specific advice for training a fasta compressor beyond that given in e.g. "Using OpenZL" (https://openzl.org/getting-started/using-openzl/)
> Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses
The first stage of Zstd does LZ77 matching, which transforms the input into "sequences", a series of instructions each of which describes some literals and one match. The literals component of the instruction says "the next L bytes of the message are these L bytes". The match component says "the next M bytes of the input are the M bytes N bytes ago".
If you want to construct a match between two strings that differ by one character, rather than saying "the next N bytes are the N bytes M bytes ago except for this one byte here which is X instead", Zstd just breaks it up into two sequences, the first part of the match, and then a single literal byte describing the changed byte, and then the rest of the match, which is described as being at offset 0. The encoding rules for Zstd define offset 0 to mean "the previously used match offset". This isn't as powerful as a Levenshtein edit, but it's a reasonable approximation.
The big advantage of this approach is that it doesn't require much additional machinery on the encoder or decoder, and thus remains very fast. Whereas implementing a whole edit description state machine would (I think) slow down decompression and especially compression enormously.
[0] https://datatracker.ietf.org/doc/html/rfc8878#name-repeat-of...
Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(
> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I even confused myself about this while writing :-)
> title
bases with optional newlines
> title
bases with optional newlines
...
The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.
I had the same epiphany as you days after acquiring a CO2 monitor. Most people notice poor indoor air quality from proxies such as humidity and temperature. AC (without ventilation) eliminates these and tricks our senses very effectively, giving us cool and fresh feeling indoor spaces full of CO2 and devoid of oxygen.