Bad numbers in the “gzip beats BERT” paper?

Hi, that's my blog post. I'm pretty sure about what I wrote here, but may need the authors to chime-in in case I am missing something. I just submited an issue on github, https://github.com/bazingagin/npc_gzip/issues/3

eyegor · 2 years ago

You may want to consider adding a note to the top. Seems like a lot of people are lazily skimming/reading the headline and see it as "gzip paper full of beans, gzip approach sucks" when really I see this as "gzip approach not better than dnn models but mostly competes and much cheaper to run". The paper is still solid.

marcinzm · 2 years ago

>gzip approach not better than dnn models but mostly competes and much cheaper to run

Does it? It looks to do worse than FastText in all benchmarks and kNN is not a cheap algorithm to run so it might actually be slower than FastText.

edit: It looks like FastText takes 5 seconds to train on the Yahoo Answers data set while the gzip approach took them 6 days. So definitely not faster.

tensor · 2 years ago

I honestly don't know why anyone would use this gzip approach in production. If you want to do text classification, really the two options you should consider are a best in class linear model like confidence weighted linear classification by Crammer (https://www.cs.jhu.edu/~mdredze/publications/icml_variance.p...) or a much more expensive LLMs.

light_hue_1 · 2 years ago

If this story is true the paper is not solid.

Claims in the abstract and claim 3 in the paper, as well as much of the publicity around the paper is just wrong.

It takes gzip from being great out of domain to being middling at best. It goes from something really interesting to a "meh" model. The main part that was intellectually interesting is how robust gzip is out of domain, if that's gone, there isn't much here.

If I was the reviewer for this paper, this would take the paper from an accept to a "submit to a workshop".

Also, kNN methods are slow O(n^2).

ginbazinga · 2 years ago

Hi, I'm the first author of the paper and I read your blog post. The reason I chose k=2 is because it's recommended to use n^{1/2} and I wanted to pick a k that's compatible with 5-shot setting. But you are right this is a bit odd. As I mentioned in both the paper and the twitter, different values of k will affect the result and I reported the _max_ result we can get so it does mean an ideal situation when the guess is always right. I also use this strategy for W2V and SentBERT. But it doesn't make the result top2 accuracy. As far as I know, top2 accuracy means in the top2 predicted classes, either one is correct, we score. However, as you mentioned, in knn when k=2, there is a situation when the 2 nearest neighbours point to the same class - we are missing another class candidate if reporting top2 accuracy. But I would love to add results for different strategies and different k values when I upload newest version to arxiv when I have time. The strategy of "decrement" you mentioned in the blog is really nice and I would love to add it to the repo if you want. I apologize for the short and late reply; I haven't checked the repo yet. I'm preparing for tomorrow's thesis defence and will reply & resolve the issue when I finish.

lalaland1125 · 2 years ago

Just wanted to say, thanks for your work debugging this.

You have probably saved other researchers an unfathomable amount of time

Deleted Comment

syats · 2 years ago

Thanks for the replication, this is important.

One question, did you try to replicate the other result table (Table 3)?

If I understand correctly, top-2 accuracy would be 1 if you have only 2 classes, but it will differ from "normal" accuracy less and less as the number of classes increases (on average). So this shouldn't change the results for table 3 thaaat much as the datasets have large amounts of classes (see table 1).

In any case, top-2 accuracy of 0.685 for the 20-newsgroups dataset is pretty neat for a method that doesn't even consider characters as characters[1], let alone tokens, n-grams, embeddings and all the nice stuff that those of use working on NLP have been devoting years to.

[1] In my understanding of gzip, it considers only bit sequences, which are not necessarily aligned with words (aka. bytes).

ks2048 · 2 years ago

I haven't yet replicated Table 3 because most of those datasets are much larger and it will take awhile to run (they said the YahooAnswers database took them 6 days).

Also, I have only tried the "gzip" row because that is all that is in the github repo they referenced.

Yeah, you're right, the more classes there are, probably the lower the effect this will have.

p1esk · 2 years ago

Did you try contacting the authors before you went public with your discovery?

_b · 2 years ago

We're adult enough to have discussions like this in public. They are healthy to have. People make mistakes. Kudos to the original authors for releasing the source code so people could inspect and replicate their results.

jarym · 2 years ago

It isn't a security issue and doesn't warrant responsible disclosure so why would op be expected to?

ks2048 · 2 years ago

I did not, but I see why that could be a better approach. I mainly am trying to be more open with little side projects I do, so wanting to start blogging what I'm working on. Also, this paper was beiung widely discussed so thought this would be one more entry in that.

cs702 · 2 years ago

Whoa, I just read your post and saw this:

> tldr: it is reporting a top-2 accuracy rather than a kNN(k=2) accuracy

If the accuracy figures shown for other models are top-1, that's a pretty significant mistake, hopefully an innocent one.

Thank you for doing this and sharing your findings!

---

Previous discussion on HN: https://news.ycombinator.com/item?id=36707193

DebtDeflation · 2 years ago

Also:

>k=2 is a bit of an odd choice for kNN classification

That's an understatement. Choosing an even number for any sort of voting algorithm doesn't make much sense, choosing 2 specifically probably makes the least sense of all.

Deleted Comment

Can anyone explain to me how a compression algorithm can beat an LLM at anything? Isn’t that like saying horses are better than graffiti?

I’m sure the answer is in there somewhere but I’m not well versed in AI and I simply can’t figure it out.

GuB-42 · 2 years ago

Generally, compression = model + entropy coding.

The model's job is to predict what comes next. The entropy coder's job is to encode the difference between the prediction and what actually comes next so that the most likely outcome uses as few bits as possible. The more accurate the model is, the less the difference between reality and prediction, the less bits the entropy coder needs and the better the compression.

Simple compression algorithms have simple models, like "if I see the same byte 10 times, the 11th is likely to be the same". But you can also use a LLM as your model, as completing text with the most likely word is what LLMs do.

Here they did the opposite. Instead of using a model for compression, by using a few tricks, they used a compression algorithm as a model: the most likely outcome is when the compression algorithm uses less bits to encode the result. And the original authors have shown that, in some tasks, the simple model that can be extracted out of gzip beats much more complex LLMs.

hoosieree · 2 years ago

I almost feel like compression and embeddings are duals of each other, but I can't quite articulate it.

Embeddings use fixed-size vectors to minimize the dot product between vectors of similar inputs. Compressors use a variable-length encoding to minimize the overall stored size.

contravariant · 2 years ago

Generally compression algorithm try to give structured data a distribution more similar to random data.

If any byte sequence is a correct file (unlikely, but mostly because compression algorithms try to be robust against corruption), then this is easy to reverse, you just generate a random sequence of bytes and then decompress it.

Basically you can turn a compression algorithm into a probability distribution by inserting random bytes wherever the decompression algorithm tries to read one, but sometimes not all bytes are allowed.

You can then reason about this probability distribution and see what it's properties are. Typically something with a probability of 'p' will require -log(p)/log(2) bits.

edwintorok · 2 years ago

Why gzip, and not a more complex compression algorithm like 'xz'? Or if you want it to be fast then 'zstd' or 'lz4'. Gzip is an odd choice: is it neither the highest compression ratio, nor the fastest.

awegio · 2 years ago

A language model estimates the probability of a sequence of words P(w_1, ..., w_n) or equivalently P(word | context).

For compression, word sequences that have higher probability should be encoded with shorter codes, so there is a direct relationship. A well known method to construct such codes based on probabilities is Huffman coding.

This works whether you use a statistical language model using word frequencies or an LLM to estimate probabilities. The better your language model (lower perplexity) the shorter the compressed output will be.

Conversely, you can probably argue that a compression algorithm implicitly defines a language model by the code lengths, e.g., it assumes duplicate strings are more likely than random noise.

ozr · 2 years ago

The intuition about how the gzip method works goes like so:

If you compress `ABC`, it will be X bytes. If you then compress `ABCABC`, it will not take 2x bytes. The more similar the two strings that you concatenate, the less bytes it will take. `ABCABD` will take more than `ABCABC`, but less than `ABCXYZ`.

BERT is, by todays standards, a very small LLM, which we know has weaker performance than the billion-param scale models most of us are interacting with today.

jabbany · 2 years ago

> very small LLM

Heh. So does that make it a MLM (medium)?

I've always found it funny that we've settled on a term for a class of models that has a size claim... Especially given how fast things are evolving...

optimalsolver · 2 years ago

Compression is equivalent to intelligence:

https://mattmahoney.net/dc/rationale.html

ks2048 · 2 years ago

It's a very limited task: take a document and classify it into one of (for example) 10 or so categories. Things like detecting certain words can do pretty well in some cases. Things that compress well have the occurrence of common substrings.

IKantRead · 2 years ago

LLMs and essentially all neural networks can be viewed as learning compression algorithms where the behavior of the compression algorithm is learned and subject to potential constraints beyond mere file reconstruction.

Highly recommend reading Ted Chiang's "ChatGPT Is a Blurry JPEG of the Web"[0] to get a better sense of this.

Keeping this fact in your mental model neural networks can also go a long way to demystify them.

0. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

FeepingCreature · 2 years ago

(The human brain is also, in part, a blurry JPEG of the world.)

kachnuv_ocasek · 2 years ago

One way to interpret/understand language models is as quite involved compression algorithms.

stuartaxelowen · 2 years ago

Many other replies here are wrong - the primary reason is that the LLMs were used on completely out of distribution data (e.g. trained on English, evaluated on completely different language that shared some characters). The points about compression's relatedness to understanding are valid, but they are not the primary reason for LLMs underperforming relative to naive compression.

refulgentis · 2 years ago

Other reply is great, more in-depth on details from me here: https://news.ycombinator.com/item?id=36758681

Plain english TL;DR:

- if you limit your task to binning snippets of text

- and the snippets are very well-defined (ex. code vs. Filipino text)

- the snippets _bytes_ could be compared and score well, no text understanding needed

- size delta of a GZIP after adding one more sentence acts as an ersatz way to compare sets of bytes to eachother (ex. you can imagine a GZIP containing 0xFFABCDEF that has 0xFFABCDEF added to it will have a size delta of 0)

fsmv · 2 years ago

Did you read the recent Douglas Hofstadter article or do you just always use the word ersatz?

ajtulloch · 2 years ago

https://www.inference.org.uk/itprnn/book.pdf is a classic text on this connection.