Lemaxoxo (u/Lemaxoxo)

Lemaxoxo commented on Text classification with Python 3.14's ZSTD module maxhalford.github.io/blog... · Posted by u/alexmolas

m-hodges · a month ago

Great overview. In 2023 I wrote about classifying political emails with Zstd.¹

¹ https://matthodges.com/posts/2023-10-01-BIDEN-binary-inferen...

Lemaxoxo · a month ago

That's very cool, thanks for sharing. Our of curiosity, did you ever get to run on a Twitter/X stream of political tweets?

Lemaxoxo commented on Text classification with Python 3.14's ZSTD module maxhalford.github.io/blog... · Posted by u/alexmolas

stephantul · a month ago

The speed comparison is weird.

The author sets the solver to saga, doesn’t standardize the features, and uses a very high max_iter.

Logistic Regression takes longer to converge when features are not standardized.

Also, the zstd classifier time complexity scales linearly with the number of classes, logistic regression doesn’t. You have 20 (it’s in the name of the dataset), so why only use 4.

It’s a cool exploration of zstd. But please give the baseline some love. Not everything has to be better than something to be interesting.

Lemaxoxo · a month ago

You are correct. To be fair I wasn't focused on comparing the runtimes of both methods. I just wanted to give a baseline and show that the batch approach is more accurate.

Lemaxoxo commented on Text classification with Python 3.14's ZSTD module maxhalford.github.io/blog... · Posted by u/alexmolas

ks2048 · a month ago

This looks like a nice rundown of how to do this with Python's zstd module.

But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).

Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).

[0] https://kenschutte.com/gzip-knn-paper/

[1] https://kenschutte.com/gzip-knn-paper2/

Lemaxoxo · a month ago

Author here. Thank you very much for the comment. I will take a look. This is a great case of Cunningham's law!

Lemaxoxo commented on Text classification with Python 3.14's ZSTD module maxhalford.github.io/blog... · Posted by u/alexmolas

pornel · a month ago

The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.

(KL divergence of letter frequencies is the same thing as ratio of lengths of their Huffman-compressed bitstreams, but you don't need to do all this bit-twiddling for real just to count the letters)

The article views compression entirely through Python's limitations.

> gzip and LZW don’t support incremental compression

This may be true in the Python's APIs, but is not true about these algorithms in general.

They absolutely support incremental compression even in APIs of popular lower-level libraries.

Snapshotting/rewinding of the state isn't exposed usually (custom gzip dictionary is close enough in practice, but a dedicated API would reuse its internal caches). Algorithmically it is possible, and quite frequently used by the compressors themselves: Zopfli tries lots of what-if scenarios in a loop. Good LZW compression requires rewinding to a smaller symbol size and restarting compression from there after you notice the dictionary stopped being helpful. The bitstream has a dedicated code for this, so this isn't just possible, but baked into the design.

Lemaxoxo · a month ago

Author here. Thanks for your comment!

Compression algorithms may have been supporting incremental compression for a while. But as some have pointed out, the point of the post is that it is practical and simple to have this available in Python's standard library. You could indeed do this in Bash, but then people don't do machine learning in Bash.

Lemaxoxo commented on A 2.5x faster Postgres parser with Claude Code multigres.com/blog/ai-par... · Posted by u/kiwicopple

guptamanan100 · a month ago

Good question! The main reason is that sqlglot is written in Python, so it wouldn't integrate natively with our Go codebase. We actually faced a similar decision with pg_query_go (https://github.com/pganalyze/pg_query_go) and passed on that too, anything that requires bridging another language means translating the AST back into Go, which adds a performance cost we wanted to avoid.

Lemaxoxo · a month ago

Ok that makes sense! On my side I can get away with using it through WASM. But your performance needs won't allow that.

Lemaxoxo commented on A 2.5x faster Postgres parser with Claude Code multigres.com/blog/ai-par... · Posted by u/kiwicopple

guptamanan100 · a month ago

Hey HN, I'm the author. Happy to answer questions or discuss further.

Lemaxoxo · a month ago

I'm curious because I have a similar use case for a querying frontend. Did you consider using https://github.com/tobymao/sqlglot? If so, what was missing to justify writing your own parser?

Lemaxoxo commented on Text classification with Python 3.14's ZSTD module maxhalford.github.io/blog... · Posted by u/Lemaxoxo

Lemaxoxo · a month ago

Hello HN. 5 years ago I posted an article about text classification via data compression. I got helpful and educative comments in response. Now that Python have shipped zstd in 3.14, I thought it would be time to revisit this approach. The throughput figures are much better. This means you can do baseline machine learning with Python's standard library!

Lemaxoxo commented on Ask HN: Share your personal website · Posted by u/susam

Lemaxoxo · 2 months ago

https://maxhalford.github.io/

Lemaxoxo commented on Do LLMs identify fonts? maxhalford.github.io/blog... · Posted by u/alexmolas

she46BiOmUerPVj · 7 months ago

I would have never thought to not use "what the font"

https://www.myfonts.com/pages/whatthefont

Lemaxoxo · 7 months ago

Op here. I tried what the font a bit but didn't mention it in the article. I didn't get good results with it. Although it's probably a good idea to ask it for a guess, and feed that to the LLM too.