Chinchilla Scaling: A replication attempt

> To extract the data from the figure, we first downloaded the PDF from Hoffmann et al.’s arXiv submission and saved it in SVG format. We then parsed the SVG content to navigate and search the SVG structure. Within the SVG, we identified the group of points representing the scatter plot data and iterated over each point to extract its fill color and position (x and y coordinates) using the attributes of the corresponding SVG elements.

> To map the SVG coordinates to the model size and training FLOP values, we used the location of the labels or ticks on the respective axes. This allowed us to establish a correspondence between the SVG coordinates and the actual data values represented in the plot.

They ... reconstructed the data ... from a plot ... using ruler and eyes? Why not just emailed the original authors for the raw data? I can't help but feel like @yuvaltheterrible debunking papers.

mxwsn · 2 years ago

Funnily enough, I've done this for a paper I wrote as well. Emailing authors is kind of a crapshoot. It's normal to get no response if it's been several years since the paper came out. In this case, a pdf plot is essentially lossless, and it's much faster than waiting for authors to maybe respond.

V1ndaar · 2 years ago

And not only that, in many cases they will tell you (if they reply) "oh, we can't find the source of that plot anymore". Happened to me quite a few times (although in physics).

I'm pretty sure I'm not the only one who's written themselves a mini tool to even extract data from a bitmap plot based on the axes. Involves some manual steps (cropping mainly), but is very convenient for the cases where people not even use vector graphics, but sometimes even just screenshots of plots... Do I like it? Hell no! It's why I've put quite some effort in doing it better for my PhD thesis.

acc_297 · 2 years ago

In fairness they did not use a ruler or eyes based on the excerpts you quote they extracted exact coordinates of data from an svg format which if the svg was created correctly should at least give a non-biased dataset maybe with less precision than the source

ege_erdil · 2 years ago

we did and gave them a two week grace period to respond, but they only responded to us after we published on arxiv

also, we didn't reconstruct the data using a ruler, you can automate that entire process so that it's much more reliable than that

saurabh20n · 2 years ago

Looks like you’re one of the authors.

It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.

levocardia · 2 years ago

I do that all the time using WebPlotDigitizer [1]. Works great.

[1] https://apps.automeris.io/wpd/

dynm · 2 years ago

Seconded. When I first saw this, I thought it looked unintuitive and difficult to use, but when I tried it, it was very easy and I had the extracted data in a few minutes.

Ajoo · 2 years ago

They claimed that they did ask several times in one of the replies.

polygamous_bat · 2 years ago

> Why not just emailed the original authors for the raw data?

Industry research labs, especially Google deepmind, are notoriously closed up about their “proprietary” data. I’ve hit this wall multiple times in my own work in AI.

sp332 · 2 years ago

https://twitter.com/borgeaud_s/status/1780988694163321250 says they're going to open the data from the paper. Not sure why they didn't do it before, but good news.

williamdclt · 2 years ago

I particularly like this second quote, I appreciate them taking the time to explain "what is a graph" in a scientific paper!

Key claims:

"We have found three potential issues with Hoffmann et al.’s estimates of the Chinchilla scaling law that rely on Approach 3: 1. Their estimated model fits the reconstructed data very poorly. These conclusions hold even when accounting for potential noise in data reconstruction and excluding outlier models. 2. The confidence are implausibly tight given the number of data points. Obtaining confidence intervals that tight would require many hundreds of thousands of observations, while they likely had only ∼400. 3. Their estimated model implies a scaling policy that is inconsistent with their other approach"

Data point most people are probably looking for: "We find a range consistent with the 20 tokens per parameter rule of thumb. Indeed, our point estimates imply that 25.6 tokens per parameters is optimal."

Deleted Comment

moffkalast · 2 years ago

Their rule of thumb would imply that a 70B model is saturated with 1.7T tokens, that's inconsistent with reality.

famouswaffles · 2 years ago

The Chinchilla laws were compute optimal scaling laws. They're not supposed to tell you what parameter-token combination will saturate a model.

eldenring · 2 years ago

No their claim is that there are dimishing returns for a fixed compute budget (in training) to scaling up data past that threshold vs. scaling up params.

This doesn't take inference into account either, obviously.

magnio · 2 years ago

cs702 · 2 years ago

Interesting! If the authors are right, it seems that the number of training tokens required per parameter (slowly) declines as models become larger (Figure 5).

That's good news. I think it deserves wider dissemination, so I'm upvoting your post.

Thank you for sharing this on HN!

dzdt · 2 years ago

Could be that the independence of training points available decline as the dataset size grows? At some point it becomes hard to add data that isn't essentially similar to something youve already added.

Yes, could be. Not sure how or even if anyone could prove it, though.

Kronopath · 2 years ago

This is not good news, this means that we could end up with a dangerously superintelligent AI just by scaling up the number of parameters, without increasing the amount of training data.

kelseyfrog · 2 years ago

No, but LLMs require orders of magnitude more language input than humans[1]. It's very reasonable to assume that architectural differences (size among them) is more likely a constraint for performance.

1. Specifically larger than the upper bound on lifetime language input for humans, even assuming 24/7 at max reading speed.

exe34 · 2 years ago

Like a corporation then. We should ban them until we can figure out how to align them!

pfdietz · 2 years ago

It's only bad news if you don't want a dangerously superintelligent AI.

gwern · 2 years ago

The original Chinchilla authors have now identified the original bug, apparently: https://twitter.com/borgeaud_s/status/1780988694163321250

mirekrusin · 2 years ago

Lovely, they are also open sourcing data.

anonymousDan · 2 years ago

The scientific process at work!

cgearhart · 2 years ago

TL;DR—couldn’t exactly replicate their results, but broadly confirmed their findings. They agree that the optimal range is 5–40 tokens per parameter, and close to 20 for the “chinchilla” model from the original paper.

Very unusual choice to reconstruct the dataset by eyeballing the graph in the source paper (why not just ask for it…?) and it’s not really clear why the result is dressed up behind the salacious-seeming abstract.

we didn't eyeball the graph, there are more accurate ways of extracting the data from a pdf file than that

we did ask for the data but got no response until we published on arxiv

what is supposed to be "salacious" about the abstract?

Dead Comment

newfocogi · 2 years ago

warbaker · 2 years ago

Calling this a "replication attempt" implied to me that they tried to replicate the Chinchilla Scaling paper and found that it did not replicate, which would be a very big deal!

Instead, they just redid the analysis based on a figure in the paper and found that the old model with slightly different parameters gave a better fit to the data. This is a valuable contribution, but a bit over-stated by the paper title, and the confrontational, "gotcha" tone of the paper is unwarranted.

A better framing would have been something like "Chinchilla Scaling: Reanalyzed".

one of their three approaches does not replicate and it's because of a software bug in the optimizer they used, i don't know what else we were supposed to say