Readit News logoReadit News
Lemaxoxo commented on Text classification with Python 3.14's ZSTD module   maxhalford.github.io/blog... · Posted by u/alexmolas
m-hodges · a month ago
Great overview. In 2023 I wrote about classifying political emails with Zstd.¹

¹ https://matthodges.com/posts/2023-10-01-BIDEN-binary-inferen...

Lemaxoxo · a month ago
That's very cool, thanks for sharing. Our of curiosity, did you ever get to run on a Twitter/X stream of political tweets?
Lemaxoxo commented on Text classification with Python 3.14's ZSTD module   maxhalford.github.io/blog... · Posted by u/alexmolas
stephantul · a month ago
The speed comparison is weird.

The author sets the solver to saga, doesn’t standardize the features, and uses a very high max_iter.

Logistic Regression takes longer to converge when features are not standardized.

Also, the zstd classifier time complexity scales linearly with the number of classes, logistic regression doesn’t. You have 20 (it’s in the name of the dataset), so why only use 4.

It’s a cool exploration of zstd. But please give the baseline some love. Not everything has to be better than something to be interesting.

Lemaxoxo · a month ago
You are correct. To be fair I wasn't focused on comparing the runtimes of both methods. I just wanted to give a baseline and show that the batch approach is more accurate.
Lemaxoxo commented on Text classification with Python 3.14's ZSTD module   maxhalford.github.io/blog... · Posted by u/alexmolas
ks2048 · a month ago
This looks like a nice rundown of how to do this with Python's zstd module.

But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).

Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).

[0] https://kenschutte.com/gzip-knn-paper/

[1] https://kenschutte.com/gzip-knn-paper2/

Lemaxoxo · a month ago
Author here. Thank you very much for the comment. I will take a look. This is a great case of Cunningham's law!
Lemaxoxo commented on Text classification with Python 3.14's ZSTD module   maxhalford.github.io/blog... · Posted by u/alexmolas
pornel · a month ago
The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.

(KL divergence of letter frequencies is the same thing as ratio of lengths of their Huffman-compressed bitstreams, but you don't need to do all this bit-twiddling for real just to count the letters)

The article views compression entirely through Python's limitations.

> gzip and LZW don’t support incremental compression

This may be true in the Python's APIs, but is not true about these algorithms in general.

They absolutely support incremental compression even in APIs of popular lower-level libraries.

Snapshotting/rewinding of the state isn't exposed usually (custom gzip dictionary is close enough in practice, but a dedicated API would reuse its internal caches). Algorithmically it is possible, and quite frequently used by the compressors themselves: Zopfli tries lots of what-if scenarios in a loop. Good LZW compression requires rewinding to a smaller symbol size and restarting compression from there after you notice the dictionary stopped being helpful. The bitstream has a dedicated code for this, so this isn't just possible, but baked into the design.

Lemaxoxo · a month ago
Author here. Thanks for your comment!

Compression algorithms may have been supporting incremental compression for a while. But as some have pointed out, the point of the post is that it is practical and simple to have this available in Python's standard library. You could indeed do this in Bash, but then people don't do machine learning in Bash.

Lemaxoxo commented on A 2.5x faster Postgres parser with Claude Code   multigres.com/blog/ai-par... · Posted by u/kiwicopple
guptamanan100 · a month ago
Good question! The main reason is that sqlglot is written in Python, so it wouldn't integrate natively with our Go codebase. We actually faced a similar decision with pg_query_go (https://github.com/pganalyze/pg_query_go) and passed on that too, anything that requires bridging another language means translating the AST back into Go, which adds a performance cost we wanted to avoid.
Lemaxoxo · a month ago
Ok that makes sense! On my side I can get away with using it through WASM. But your performance needs won't allow that.
Lemaxoxo commented on A 2.5x faster Postgres parser with Claude Code   multigres.com/blog/ai-par... · Posted by u/kiwicopple
guptamanan100 · a month ago
Hey HN, I'm the author. Happy to answer questions or discuss further.
Lemaxoxo · a month ago
I'm curious because I have a similar use case for a querying frontend. Did you consider using https://github.com/tobymao/sqlglot? If so, what was missing to justify writing your own parser?
Lemaxoxo commented on Text classification with Python 3.14's ZSTD module   maxhalford.github.io/blog... · Posted by u/Lemaxoxo
Lemaxoxo · a month ago
Hello HN. 5 years ago I posted an article about text classification via data compression. I got helpful and educative comments in response. Now that Python have shipped zstd in 3.14, I thought it would be time to revisit this approach. The throughput figures are much better. This means you can do baseline machine learning with Python's standard library!
Lemaxoxo commented on Do LLMs identify fonts?   maxhalford.github.io/blog... · Posted by u/alexmolas
she46BiOmUerPVj · 7 months ago
I would have never thought to not use "what the font"

https://www.myfonts.com/pages/whatthefont

Lemaxoxo · 7 months ago
Op here. I tried what the font a bit but didn't mention it in the article. I didn't get good results with it. Although it's probably a good idea to ask it for a guess, and feed that to the LLM too.

u/Lemaxoxo

KarmaCake day268March 15, 2018
About
https://maxhalford.github.io/bio/
View Original