> To extract the data from the figure, we first downloaded the PDF from Hoffmann et al.’s arXiv submission and saved it in SVG format. We then parsed the SVG content to navigate and search the SVG structure. Within the SVG, we identified the group of points representing the scatter plot data and iterated over each point to extract its fill color and position (x and y coordinates) using the attributes of the corresponding SVG elements.
> To map the SVG coordinates to the model size and training FLOP values, we used the location of the labels or ticks on the respective axes. This allowed us to establish a correspondence between the SVG coordinates and the actual data values represented in the plot.
They ... reconstructed the data ... from a plot ... using ruler and eyes? Why not just emailed the original authors for the raw data? I can't help but feel like @yuvaltheterrible debunking papers.
Funnily enough, I've done this for a paper I wrote as well. Emailing authors is kind of a crapshoot. It's normal to get no response if it's been several years since the paper came out. In this case, a pdf plot is essentially lossless, and it's much faster than waiting for authors to maybe respond.
And not only that, in many cases they will tell you (if they reply) "oh, we can't find the source of that plot anymore". Happened to me quite a few times (although in physics).
I'm pretty sure I'm not the only one who's written themselves a mini tool to even extract data from a bitmap plot based on the axes. Involves some manual steps (cropping mainly), but is very convenient for the cases where people not even use vector graphics, but sometimes even just screenshots of plots... Do I like it? Hell no! It's why I've put quite some effort in doing it better for my PhD thesis.
In fairness they did not use a ruler or eyes based on the excerpts you quote they extracted exact coordinates of data from an svg format which if the svg was created correctly should at least give a non-biased dataset maybe with less precision than the source
It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.
Seconded. When I first saw this, I thought it looked unintuitive and difficult to use, but when I tried it, it was very easy and I had the extracted data in a few minutes.
> Why not just emailed the original authors for the raw data?
Industry research labs, especially Google deepmind, are notoriously closed up about their “proprietary” data. I’ve hit this wall multiple times in my own work in AI.
Interesting! If the authors are right, it seems that the number of training tokens required per parameter (slowly) declines as models become larger (Figure 5).
That's good news. I think it deserves wider dissemination, so I'm upvoting your post.
Could be that the independence of training points available decline as the dataset size grows? At some point it becomes hard to add data that isn't essentially similar to something youve already added.
This is not good news, this means that we could end up with a dangerously superintelligent AI just by scaling up the number of parameters, without increasing the amount of training data.
No, but LLMs require orders of magnitude more language input than humans[1]. It's very reasonable to assume that architectural differences (size among them) is more likely a constraint for performance.
1. Specifically larger than the upper bound on lifetime language input for humans, even assuming 24/7 at max reading speed.
TL;DR—couldn’t exactly replicate their results, but broadly confirmed their findings. They agree that the optimal range is 5–40 tokens per parameter, and close to 20 for the “chinchilla” model from the original paper.
Very unusual choice to reconstruct the dataset by eyeballing the graph in the source paper (why not just ask for it…?) and it’s not really clear why the result is dressed up behind the salacious-seeming abstract.
"We have found three potential issues with Hoffmann et al.’s
estimates of the Chinchilla scaling law that rely on Approach
3:
1. Their estimated model fits the reconstructed data very
poorly. These conclusions hold even when accounting
for potential noise in data reconstruction and excluding
outlier models.
2. The confidence are implausibly tight given the number of data points. Obtaining confidence intervals that
tight would require many hundreds of thousands of observations, while they likely had only ∼400.
3. Their estimated model implies a scaling policy that is
inconsistent with their other approach"
Data point most people are probably looking for:
"We find a range consistent with the 20
tokens per parameter rule of thumb. Indeed, our point estimates imply that 25.6 tokens per parameters is optimal."
No their claim is that there are dimishing returns for a fixed compute budget (in training) to scaling up data past that threshold vs. scaling up params.
This doesn't take inference into account either, obviously.
Calling this a "replication attempt" implied to me that they tried to replicate the Chinchilla Scaling paper and found that it did not replicate, which would be a very big deal!
Instead, they just redid the analysis based on a figure in the paper and found that the old model with slightly different parameters gave a better fit to the data. This is a valuable contribution, but a bit over-stated by the paper title, and the confrontational, "gotcha" tone of the paper is unwarranted.
A better framing would have been something like "Chinchilla Scaling: Reanalyzed".
one of their three approaches does not replicate and it's because of a software bug in the optimizer they used, i don't know what else we were supposed to say
> To map the SVG coordinates to the model size and training FLOP values, we used the location of the labels or ticks on the respective axes. This allowed us to establish a correspondence between the SVG coordinates and the actual data values represented in the plot.
They ... reconstructed the data ... from a plot ... using ruler and eyes? Why not just emailed the original authors for the raw data? I can't help but feel like @yuvaltheterrible debunking papers.
I'm pretty sure I'm not the only one who's written themselves a mini tool to even extract data from a bitmap plot based on the axes. Involves some manual steps (cropping mainly), but is very convenient for the cases where people not even use vector graphics, but sometimes even just screenshots of plots... Do I like it? Hell no! It's why I've put quite some effort in doing it better for my PhD thesis.
also, we didn't reconstruct the data using a ruler, you can automate that entire process so that it's much more reliable than that
It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.
[1] https://apps.automeris.io/wpd/
Industry research labs, especially Google deepmind, are notoriously closed up about their “proprietary” data. I’ve hit this wall multiple times in my own work in AI.
That's good news. I think it deserves wider dissemination, so I'm upvoting your post.
Thank you for sharing this on HN!
1. Specifically larger than the upper bound on lifetime language input for humans, even assuming 24/7 at max reading speed.
Very unusual choice to reconstruct the dataset by eyeballing the graph in the source paper (why not just ask for it…?) and it’s not really clear why the result is dressed up behind the salacious-seeming abstract.
we did ask for the data but got no response until we published on arxiv
what is supposed to be "salacious" about the abstract?
Dead Comment
Deleted Comment
"We have found three potential issues with Hoffmann et al.’s estimates of the Chinchilla scaling law that rely on Approach 3: 1. Their estimated model fits the reconstructed data very poorly. These conclusions hold even when accounting for potential noise in data reconstruction and excluding outlier models. 2. The confidence are implausibly tight given the number of data points. Obtaining confidence intervals that tight would require many hundreds of thousands of observations, while they likely had only ∼400. 3. Their estimated model implies a scaling policy that is inconsistent with their other approach"
Data point most people are probably looking for: "We find a range consistent with the 20 tokens per parameter rule of thumb. Indeed, our point estimates imply that 25.6 tokens per parameters is optimal."
Deleted Comment
This doesn't take inference into account either, obviously.
Instead, they just redid the analysis based on a figure in the paper and found that the old model with slightly different parameters gave a better fit to the data. This is a valuable contribution, but a bit over-stated by the paper title, and the confrontational, "gotcha" tone of the paper is unwarranted.
A better framing would have been something like "Chinchilla Scaling: Reanalyzed".