Text classification with Python 3.14's ZSTD module

ks2048 · a month ago

This looks like a nice rundown of how to do this with Python's zstd module.

But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).

Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).

[0] https://kenschutte.com/gzip-knn-paper/

[1] https://kenschutte.com/gzip-knn-paper2/

duskwuff · a month ago

Concur. Zstandard is a good compressor, but it's not magical; comparing the compressed size of Zstd(A+B) to the common size of Zstd(A) + Zstd(B) is effectively just a complicated way of measuring how many words and phrases the two documents have in common. Which isn't entirely ineffective at judging whether they're about the same topic, but it's an unnecessarily complex and easily confused way of doing so.

andai · a month ago

If I'm reading this right, you're saying it's functionally equivalent to measuring the intersection of ngrams? That sounds very testable.

srean · a month ago

I do not know inner details of Zstandard, but I would expect that it to least do suffix/prefix stats or word fragment stats, not just words and phrases.

D-Machine · a month ago

Yup. Data compression ≠ semantic compression.

shoo · a month ago

Good on you for attempting to reproduce the results & writing it up, and reporting the issue to the authors.

> It turns out that the classification method used in their code looked at the test label as part of the decision method and thus led to an unfair comparison to the baseline results

Lemaxoxo · a month ago

Author here. Thank you very much for the comment. I will take a look. This is a great case of Cunningham's law!

pornel · a month ago

The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.

(KL divergence of letter frequencies is the same thing as ratio of lengths of their Huffman-compressed bitstreams, but you don't need to do all this bit-twiddling for real just to count the letters)

The article views compression entirely through Python's limitations.

> gzip and LZW don’t support incremental compression

This may be true in the Python's APIs, but is not true about these algorithms in general.

They absolutely support incremental compression even in APIs of popular lower-level libraries.

Snapshotting/rewinding of the state isn't exposed usually (custom gzip dictionary is close enough in practice, but a dedicated API would reuse its internal caches). Algorithmically it is possible, and quite frequently used by the compressors themselves: Zopfli tries lots of what-if scenarios in a loop. Good LZW compression requires rewinding to a smaller symbol size and restarting compression from there after you notice the dictionary stopped being helpful. The bitstream has a dedicated code for this, so this isn't just possible, but baked into the design.

notpushkin · a month ago

> The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.

I think it makes sense to explore it from practical standpoint, too. It’s in Python stdlib, and works reasonably well, so for some applications it might be good enough.

It’s also fairly easy to implement in other languages with zstd bindings, or even shell scripts:

  $ echo 'taco burrito tortilla salsa guacamole cilantro lime' > /tmp/tacos.txt
  $ zstd --train $(yes '/tmp/tacos.txt' | head -n 50) -o tacos.dict
  [...snip]

  $ echo 'racket court serve volley smash lob match game set' > /tmp/padel.txt
  $ zstd --train $(yes '/tmp/padel.txt' | head -n 50) -o padel.dict
  [...snip]

  $ echo 'I ordered three tacos with extra guacamole' | zstd -D tacos.dict | wc -c
        57
  $ echo 'I ordered three tacos with extra guacamole' | zstd -D padel.dict | wc -c
        60

notpushkin · a month ago

Or with the newsgroup20 dataset:

  curl http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz | tar -xzf -
  cd 20_newsgroups
  for f in *; do zstd --train "$f/*" -o "../$f.dict"; done
  cd ..
  for d in *.dict; do
    cat 20_newsgroups/misc.forsale/74150 | zstd -D "$d" | wc -c | tr -d '\n'; echo " $d";
  done | sort | head -n 3

Output:

     422 misc.forsale.dict
     462 rec.autos.dict
     463 comp.sys.mac.hardware.dict

Pretty neat IMO.

Lemaxoxo · a month ago

Author here. Thanks for your comment!

Compression algorithms may have been supporting incremental compression for a while. But as some have pointed out, the point of the post is that it is practical and simple to have this available in Python's standard library. You could indeed do this in Bash, but then people don't do machine learning in Bash.

xxs · a month ago

gzip/deflate has had SYNC_FLUSH - to concat message, and/or try something else. Also it has always had adler hash for dictionaries

staplung · a month ago

This has been possible with the zlib module since 1997 [EDIT: zlib is from '97. The zdict param wasn't added until 2012]. You even get similar byte count outputs to the example and on my machine, it's about 10x faster to use zlib.

  import zlib

  input_text = b"I ordered three tacos with extra guacamole"

  tacos = b"taco burrito tortilla salsa guacamole cilantro lime " * 50
  taco_comp = zlib.compressobj(zdict=tacos)
  print(len(taco_comp.compress(input_text) + taco_comp.flush()))
  # prints 41

  padel = b"racket court serve volley smash lob match game set " * 50
  padel_comp = zlib.compressobj(zdict=padel)
  print(len(padel_comp.compress(input_text) +  padel_comp.flush()))
  # prints 54

notpushkin · a month ago

True. The post calls out that “you have to recompress the training data for each test document” with zlib (otherwise input_text would taint it), but you can actually call Compress.copy().

zdict was added in Python 3.3, though, so it’s 2012, not 1997. (It might have worked before, just not a part of the official API :-)

staplung · a month ago

Ah, okay. Didn't realize that. I used either zlib or gzip long, long ago but never messed with the `zdict` param. Thanks for pointing that out.

physicsguy · a month ago

In my PhD more than a decade ago, I ended up using png image file sizes to classify different output states from simulations of a system under different conditions. Because of the compressions, homogenous states led to much smaller file size than the heterogenous states. It was super super reliable.

chaps · a month ago

Ooh, totally. Many years ago I was doing some analysis of parking ticket data using gnuplot and had it output a chart png per-street. Not great, but worked well to get to the next step of that project of sorting the directory by file size. The most dynamic streets were the largest files by far.

Another way I've used image compression to identify cops that cover their body cameras while recording -- the filesize to length ratio reflects not much activity going on.

earthscienceman · a month ago

Have any more information on the cop camera footage?

chaps · a month ago

Sure -- it's something I figured out during the 2020 protests for some reporting work I was doing which led to this reporting: https://thetriibe.com/2020/12/hundreds-of-chicago-police-mad...

This reporting was made possible because it's surprisingly easy to export recording start/stop time, file size, duration, notes, cop badge and model name from the underlying system with a couple clicks (this is true for any agency that uses axon: https://my.axon.com/s/article/Justice-Exporting-search-resul...). I threw that info into postgres, made a materialized view with a column that gets the filesize:duration ratio and filtered for videos with a certain ratio. I never did anything with it besides that article I posted before.

Here's an observable about the BWC analysis that went into the reporting (disclaimer: the observable is mid-iteration that never received a followup. the analysis itself is separate from the reporting): https://observablehq.com/d/9f09764dbbdfc4b5

m-hodges · a month ago

Great overview. In 2023 I wrote about classifying political emails with Zstd.¹

¹ https://matthodges.com/posts/2023-10-01-BIDEN-binary-inferen...

Lemaxoxo · a month ago

That's very cool, thanks for sharing. Our of curiosity, did you ever get to run on a Twitter/X stream of political tweets?

stephantul · a month ago

The speed comparison is weird.

The author sets the solver to saga, doesn’t standardize the features, and uses a very high max_iter.

Logistic Regression takes longer to converge when features are not standardized.

Also, the zstd classifier time complexity scales linearly with the number of classes, logistic regression doesn’t. You have 20 (it’s in the name of the dataset), so why only use 4.

It’s a cool exploration of zstd. But please give the baseline some love. Not everything has to be better than something to be interesting.

Lemaxoxo · a month ago

You are correct. To be fair I wasn't focused on comparing the runtimes of both methods. I just wanted to give a baseline and show that the batch approach is more accurate.

stephantul · a month ago

Yeah sorry, reading it back I was a bit too harsh haha. It was my pre-coffee comment. Nice post!

throwaway81523 · a month ago

Python's zlib does support incremental compression with the zdict parameter. gzip has something similar but you have to do some hacky thing to get at it since the regular Python API doesn't expose the entry point. I did manage to use it from Python a while back, but my memory is hazy about how I got to it. The entry point may have been exposed in the code module but undocumented in the Python manual.