Readit News logoReadit News
AbrahamParangi · 2 years ago
Probably disappointing to the authors but an excellent rebuttal.

This is the sort of mistake that's awfully easy to make in ML. The unfortunate thing about the field is that subtle methodological errors often cause subtle failures rather than catastrophic failures as we're used to in many other branches of engineering or science. You can easily make a system slightly worse or slightly better by contaminating your training set with bad data or accidentally leaking some information about your target and the ML system will take it in stride (but with slightly contaminated results).

This result makes sense to me because as much as I would like it to be true, applying existing compression algorithms to ML feels like too much of a "free lunch". If there was any special magic happening in compression algorithms we'd use compression algorithms as encoders instead of using transformers as compressors.

godelski · 2 years ago
> This is the sort of mistake that's awfully easy to make in ML.

It is important to remember this! Mistakes are common because they are easy to make. Science is a noisy process, but there is signal there and what we see here is exactly what peer review is about. I tend to argue that open publications are a better form of peer review than conferences/journals because of exactly this. Peer review is about your peers reviewing your work, less about whatever random and noisy standard a conference/journal puts forward. Remember that this was the way things happened for most of our history and that our modern notion of peer review is very recent (mid 70's). Older journals were more about accomplishing the mission that arxiv accomplishes today: disseminating works.

https://mitcommlab.mit.edu/broad/commkit/peer-review-a-histo...

[side note] another reason I'd advocate for the abolishment of conferences/journals is that through this we can actively advocate for reproduction papers, failure papers, and many other important aspects since we would not be held to the "novelty" criteria (almost everything is incremental). "Publishing" is about communicating your work to your peers and having them validate or invalidate your results.

[edit] I think conferences are good in the fact that they bring people together and that encourages collaboration. That's great. But I should clarify that I'm specifically talking about using these platforms as a means to judge the validity of works. If a conference system wants to just invite works and the community, then I'm totally cool with that. I do also like journals in theory given that there's a conversation happening between authors and reviewers, but I believe this also could just easily be accomplished through arxiv + github or OpenReview (preferred).

mindesc · 2 years ago
Those are used. Search for minimum description principle and entropy based classifier. The performance is poor, but it is definitely there and really easy to deploy. I have seen gzip being used for plagiarism detection as similar text tends to compress better. Use the compression ratio as weights on spring model then for visualisation. Also works with network communication metadata ...
iamflimflam1 · 2 years ago
It's true in many experiments. The desire to get the result you want can often overwhelm the need to validate what you are getting.

Especially true when the results confirm any pre-existing thinking you may have.

mananaysiempre · 2 years ago
One particular example that I remember from an introductory particle physics class is the History Plots section[1] of the biennial review of experimental data.

Knowing these quantities is important, but their particular values largely aren’t; nobody’s funding or career really depended on them being equal to one thing or another. Yet look at all the jumps, where the measurements after the initial very rough ones got stuck in the completely wrong place until the jump to the right value—when it happened—was of a completely implausible magnitude, like four, six, or ten sigma.

[1] https://pdg.lbl.gov/2023/reviews/rpp2022-rev-history-plots.p...

mattsan · 2 years ago
Yep, confirmation bias. Luckily helped with peer review!
jsight · 2 years ago
> The unfortunate thing about the field is that subtle methodological errors often cause subtle failures rather than catastrophic failures as we're used to in many other branches of engineering or science.

I've been doing a lot of studying in the ML field lately, and I'm seeing this a lot. It is just another thing that feels like the polar opposite of everything else that I've done as a software engineer.

Miss a semicolon? Instant error.

Miscalculate some grads on one out of three layers? Every now and then it might even work! But the results will be weird.

godelski · 2 years ago
How about this one: tune your hyper-parameters based on the results on your test data.

This is prolific, even the norm, but it is a form of information leakage. You're passing information about the test dataset to the model. The solution to this is to use 3 partitions: train, validation, test. Validation is for HP tuning (you can do cross-validation btw) and test is a single shot.

TX81Z · 2 years ago
Academic research code is largely dogshit written as quickly as possible by amateurs, barely tested whatsoever, and the primary intended output of all such code is accumulating paper citations.

A world with half as many scientific papers and twice as much care would produce far more value but the whole enterprise is hopelessly gamified.

achileas · 2 years ago
Having worked in other sciences (neuroscience for me), I’m not sure what catastrophic obvious errors you’re used to seeing. The vast majority IME are like this, except with even longer feedback loops (on the order of several months).
AbrahamParangi · 2 years ago
Well, biology is probably a lot closer to ML in this way. My experience in chemistry or material science is that 99% of the time if you do anything wrong it totally doesn't work at all.

This is fairly typical in software as well.

BSEdlMMldESB · 2 years ago
now shift fields such that the subtle methodological errors don't come to light in 20 years.

which field are you on now? economics!? haahha

ks2048 · 2 years ago
Hi, that's my blog post. I'm pretty sure about what I wrote here, but may need the authors to chime-in in case I am missing something. I just submited an issue on github, https://github.com/bazingagin/npc_gzip/issues/3
eyegor · 2 years ago
You may want to consider adding a note to the top. Seems like a lot of people are lazily skimming/reading the headline and see it as "gzip paper full of beans, gzip approach sucks" when really I see this as "gzip approach not better than dnn models but mostly competes and much cheaper to run". The paper is still solid.
marcinzm · 2 years ago
>gzip approach not better than dnn models but mostly competes and much cheaper to run

Does it? It looks to do worse than FastText in all benchmarks and kNN is not a cheap algorithm to run so it might actually be slower than FastText.

edit: It looks like FastText takes 5 seconds to train on the Yahoo Answers data set while the gzip approach took them 6 days. So definitely not faster.

tensor · 2 years ago
I honestly don't know why anyone would use this gzip approach in production. If you want to do text classification, really the two options you should consider are a best in class linear model like confidence weighted linear classification by Crammer (https://www.cs.jhu.edu/~mdredze/publications/icml_variance.p...) or a much more expensive LLMs.
light_hue_1 · 2 years ago
If this story is true the paper is not solid.

Claims in the abstract and claim 3 in the paper, as well as much of the publicity around the paper is just wrong.

It takes gzip from being great out of domain to being middling at best. It goes from something really interesting to a "meh" model. The main part that was intellectually interesting is how robust gzip is out of domain, if that's gone, there isn't much here.

If I was the reviewer for this paper, this would take the paper from an accept to a "submit to a workshop".

Also, kNN methods are slow O(n^2).

ginbazinga · 2 years ago
Hi, I'm the first author of the paper and I read your blog post. The reason I chose k=2 is because it's recommended to use n^{1/2} and I wanted to pick a k that's compatible with 5-shot setting. But you are right this is a bit odd. As I mentioned in both the paper and the twitter, different values of k will affect the result and I reported the _max_ result we can get so it does mean an ideal situation when the guess is always right. I also use this strategy for W2V and SentBERT. But it doesn't make the result top2 accuracy. As far as I know, top2 accuracy means in the top2 predicted classes, either one is correct, we score. However, as you mentioned, in knn when k=2, there is a situation when the 2 nearest neighbours point to the same class - we are missing another class candidate if reporting top2 accuracy. But I would love to add results for different strategies and different k values when I upload newest version to arxiv when I have time. The strategy of "decrement" you mentioned in the blog is really nice and I would love to add it to the repo if you want. I apologize for the short and late reply; I haven't checked the repo yet. I'm preparing for tomorrow's thesis defence and will reply & resolve the issue when I finish.
lalaland1125 · 2 years ago
Just wanted to say, thanks for your work debugging this.

You have probably saved other researchers an unfathomable amount of time

Deleted Comment

syats · 2 years ago
Thanks for the replication, this is important.

One question, did you try to replicate the other result table (Table 3)?

If I understand correctly, top-2 accuracy would be 1 if you have only 2 classes, but it will differ from "normal" accuracy less and less as the number of classes increases (on average). So this shouldn't change the results for table 3 thaaat much as the datasets have large amounts of classes (see table 1).

In any case, top-2 accuracy of 0.685 for the 20-newsgroups dataset is pretty neat for a method that doesn't even consider characters as characters[1], let alone tokens, n-grams, embeddings and all the nice stuff that those of use working on NLP have been devoting years to.

[1] In my understanding of gzip, it considers only bit sequences, which are not necessarily aligned with words (aka. bytes).

ks2048 · 2 years ago
I haven't yet replicated Table 3 because most of those datasets are much larger and it will take awhile to run (they said the YahooAnswers database took them 6 days).

Also, I have only tried the "gzip" row because that is all that is in the github repo they referenced.

Yeah, you're right, the more classes there are, probably the lower the effect this will have.

p1esk · 2 years ago
Did you try contacting the authors before you went public with your discovery?
_b · 2 years ago
We're adult enough to have discussions like this in public. They are healthy to have. People make mistakes. Kudos to the original authors for releasing the source code so people could inspect and replicate their results.
jarym · 2 years ago
It isn't a security issue and doesn't warrant responsible disclosure so why would op be expected to?
ks2048 · 2 years ago
I did not, but I see why that could be a better approach. I mainly am trying to be more open with little side projects I do, so wanting to start blogging what I'm working on. Also, this paper was beiung widely discussed so thought this would be one more entry in that.
cs702 · 2 years ago
Whoa, I just read your post and saw this:

> tldr: it is reporting a top-2 accuracy rather than a kNN(k=2) accuracy

If the accuracy figures shown for other models are top-1, that's a pretty significant mistake, hopefully an innocent one.

Thank you for doing this and sharing your findings!

---

Previous discussion on HN: https://news.ycombinator.com/item?id=36707193

DebtDeflation · 2 years ago
Also:

>k=2 is a bit of an odd choice for kNN classification

That's an understatement. Choosing an even number for any sort of voting algorithm doesn't make much sense, choosing 2 specifically probably makes the least sense of all.

Deleted Comment

softwaredoug · 2 years ago
When we did search relevance experimentation at Shopify we made lots of mistakes. I can empathize with the authors. I’ve had a lot of my own public screw ups.

At the end of my time at Shopify I learned good science requires good software engineering. It’s really easy to make mistakes at so many places in the stack.

We spent a lot of time on creating rigorous, heavily tested and high quality software for our experiments so we could trust our numbers and reproduce each others experiments. We tried to discourage one-off evaluation methods, but if we created a new one, to add it to our suite and test the metric to understand what it meant.

It seems obvious, but sadly not as common as I wish it were in my experience with this kind of experimentation. Companies want velocity, and thinking deeply statistically, and building internal tools, is not in the interest of most higher ups.

thomasahle · 2 years ago
> At the end of my time at Shopify I learned good science requires good software engineering.

This is a positive side of industry research. First, you tend to have more software engineering expertise available, and secondly you have more of an insentive to not exaggerate your claims, as if you say it works, you'll be expected to put it into production.

chaxor · 2 years ago
I appreciate that this blog post was made.

This is like so, so many little projects that I do (even specifically showing problems in papers like this) that never see the light of day. I usually just make a small noise, and then it sits on my hard drive and that's it.

So thank you for putting this out there.

thomasahle · 2 years ago
I've started using Twitter as a "lower effort blogging" platform. After I spend a day on something like this, I'm usually too tired to actually write a blog post about it, which feels like a waste. But then writing a small Twitter thread is usually within my capabilities.
refulgentis · 2 years ago
Really happy to see this: KNN + classification task + doing classification that's based on pure text similarity is a recipe for stacked results.

Schaudenfreude responses to this paper misunderstand that the natural language stuff is crucially important for embeddings: sure, phrases that share words will classify well and GZIP well, so GZIP can be used as ersatz classification.

The miracle of BERT / embeddings is _not_ having to share words: for instance, "what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but not "my jewelry is stored safely in the safe".

This is also an important thing to consider with LLMs: people are using embeddings intended to do text similarity, whereas you want to use an SBERT model: that is trained to correlate a question to a document that will answer it.

https://www.sbert.net/ for the full rabbit hole.

Previously: Should you use OpenAI's embeddings? Probably not, and here's why. https://iamnotarobot.substack.com/p/should-you-use-openais-e....

HN discussion: https://news.ycombinator.com/item?id=35377935

jabbany · 2 years ago
> The miracle of BERT / embeddings is _not_ having to share words

To be fair, the original task is specifically chosen where something like knn+compression has a chance of being good: i.e. out of domain + low resource.

Under these conditions the training inputs could be too sparse for a highly parameterized model to learn good embeddings from.

In traditional in domain + big data classification settings there's no chance that non-parametric methods like compression would beat a learned representation.

usgroup · 2 years ago
It wasn't obvious to me why the authors chose kNN for the classifier. Since they produce a distance matrix, they could have used multi-dimensional scaling to convert the matrix to factors, and then used a tree algorithm such as xgboost which would likely make use of more information than kNN and produce significantly better results. They could have also used a PAQ compression algorithm which are much better than the LZ compressors -- all of which could have significantly improved the results and possibly delivered on their original conclusions.

what i liked about the subject paper is that the compression algorithm is abstracted away and it led me to consider what else one could do with compression via the p(x) ~ K^(-|x|) relationship, where K is the alphabet size and |x| is the length of the string x, and assuming optimal coding.

For example it occurred to me that one could do traditional classification by packing the factors for each response into separate documents and then proceeding as the paper does to classify which document best compresses the next sample in order to determine its class: a sort of supervised classification using compression algorithms. The closer the compressor is to an optimal code for the dataset, the better it will work.

Schemes for sequence prediction are equally straightforward to implement.

I found it to be a pleasant surprise.

skrebbel · 2 years ago
Can anyone explain to me how a compression algorithm can beat an LLM at anything? Isn’t that like saying horses are better than graffiti?

I’m sure the answer is in there somewhere but I’m not well versed in AI and I simply can’t figure it out.

GuB-42 · 2 years ago
Generally, compression = model + entropy coding.

The model's job is to predict what comes next. The entropy coder's job is to encode the difference between the prediction and what actually comes next so that the most likely outcome uses as few bits as possible. The more accurate the model is, the less the difference between reality and prediction, the less bits the entropy coder needs and the better the compression.

Simple compression algorithms have simple models, like "if I see the same byte 10 times, the 11th is likely to be the same". But you can also use a LLM as your model, as completing text with the most likely word is what LLMs do.

Here they did the opposite. Instead of using a model for compression, by using a few tricks, they used a compression algorithm as a model: the most likely outcome is when the compression algorithm uses less bits to encode the result. And the original authors have shown that, in some tasks, the simple model that can be extracted out of gzip beats much more complex LLMs.

hoosieree · 2 years ago
I almost feel like compression and embeddings are duals of each other, but I can't quite articulate it.

Embeddings use fixed-size vectors to minimize the dot product between vectors of similar inputs. Compressors use a variable-length encoding to minimize the overall stored size.

contravariant · 2 years ago
Generally compression algorithm try to give structured data a distribution more similar to random data.

If any byte sequence is a correct file (unlikely, but mostly because compression algorithms try to be robust against corruption), then this is easy to reverse, you just generate a random sequence of bytes and then decompress it.

Basically you can turn a compression algorithm into a probability distribution by inserting random bytes wherever the decompression algorithm tries to read one, but sometimes not all bytes are allowed.

You can then reason about this probability distribution and see what it's properties are. Typically something with a probability of 'p' will require -log(p)/log(2) bits.

edwintorok · 2 years ago
Why gzip, and not a more complex compression algorithm like 'xz'? Or if you want it to be fast then 'zstd' or 'lz4'. Gzip is an odd choice: is it neither the highest compression ratio, nor the fastest.
awegio · 2 years ago
A language model estimates the probability of a sequence of words P(w_1, ..., w_n) or equivalently P(word | context).

For compression, word sequences that have higher probability should be encoded with shorter codes, so there is a direct relationship. A well known method to construct such codes based on probabilities is Huffman coding.

This works whether you use a statistical language model using word frequencies or an LLM to estimate probabilities. The better your language model (lower perplexity) the shorter the compressed output will be.

Conversely, you can probably argue that a compression algorithm implicitly defines a language model by the code lengths, e.g., it assumes duplicate strings are more likely than random noise.

ozr · 2 years ago
The intuition about how the gzip method works goes like so:

If you compress `ABC`, it will be X bytes. If you then compress `ABCABC`, it will not take 2x bytes. The more similar the two strings that you concatenate, the less bytes it will take. `ABCABD` will take more than `ABCABC`, but less than `ABCXYZ`.

BERT is, by todays standards, a very small LLM, which we know has weaker performance than the billion-param scale models most of us are interacting with today.

jabbany · 2 years ago
> very small LLM

Heh. So does that make it a MLM (medium)?

I've always found it funny that we've settled on a term for a class of models that has a size claim... Especially given how fast things are evolving...

optimalsolver · 2 years ago
Compression is equivalent to intelligence:

https://mattmahoney.net/dc/rationale.html

ks2048 · 2 years ago
It's a very limited task: take a document and classify it into one of (for example) 10 or so categories. Things like detecting certain words can do pretty well in some cases. Things that compress well have the occurrence of common substrings.
IKantRead · 2 years ago
LLMs and essentially all neural networks can be viewed as learning compression algorithms where the behavior of the compression algorithm is learned and subject to potential constraints beyond mere file reconstruction.

Highly recommend reading Ted Chiang's "ChatGPT Is a Blurry JPEG of the Web"[0] to get a better sense of this.

Keeping this fact in your mental model neural networks can also go a long way to demystify them.

0. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

FeepingCreature · 2 years ago
(The human brain is also, in part, a blurry JPEG of the world.)
kachnuv_ocasek · 2 years ago
One way to interpret/understand language models is as quite involved compression algorithms.
stuartaxelowen · 2 years ago
Many other replies here are wrong - the primary reason is that the LLMs were used on completely out of distribution data (e.g. trained on English, evaluated on completely different language that shared some characters). The points about compression's relatedness to understanding are valid, but they are not the primary reason for LLMs underperforming relative to naive compression.
refulgentis · 2 years ago
Other reply is great, more in-depth on details from me here: https://news.ycombinator.com/item?id=36758681

Plain english TL;DR:

- if you limit your task to binning snippets of text

- and the snippets are very well-defined (ex. code vs. Filipino text)

- the snippets _bytes_ could be compared and score well, no text understanding needed

- size delta of a GZIP after adding one more sentence acts as an ersatz way to compare sets of bytes to eachother (ex. you can imagine a GZIP containing 0xFFABCDEF that has 0xFFABCDEF added to it will have a size delta of 0)

fsmv · 2 years ago
Did you read the recent Douglas Hofstadter article or do you just always use the word ersatz?
ajtulloch · 2 years ago
https://www.inference.org.uk/itprnn/book.pdf is a classic text on this connection.
bjord · 2 years ago
If this is true, I'm looking forward to seeing how all the people who made grandiose statements about that paper now quietly scrub them.

LinkedIn and Twitter influencers, I'm looking at you in particular.

If it's not true, I guess I'll be the one looking stupid—I only skimmed the article.