Readit News logoReadit News
magnio · 2 years ago
> To extract the data from the figure, we first downloaded the PDF from Hoffmann et al.’s arXiv submission and saved it in SVG format. We then parsed the SVG content to navigate and search the SVG structure. Within the SVG, we identified the group of points representing the scatter plot data and iterated over each point to extract its fill color and position (x and y coordinates) using the attributes of the corresponding SVG elements.

> To map the SVG coordinates to the model size and training FLOP values, we used the location of the labels or ticks on the respective axes. This allowed us to establish a correspondence between the SVG coordinates and the actual data values represented in the plot.

They ... reconstructed the data ... from a plot ... using ruler and eyes? Why not just emailed the original authors for the raw data? I can't help but feel like @yuvaltheterrible debunking papers.

mxwsn · 2 years ago
Funnily enough, I've done this for a paper I wrote as well. Emailing authors is kind of a crapshoot. It's normal to get no response if it's been several years since the paper came out. In this case, a pdf plot is essentially lossless, and it's much faster than waiting for authors to maybe respond.
V1ndaar · 2 years ago
And not only that, in many cases they will tell you (if they reply) "oh, we can't find the source of that plot anymore". Happened to me quite a few times (although in physics).

I'm pretty sure I'm not the only one who's written themselves a mini tool to even extract data from a bitmap plot based on the axes. Involves some manual steps (cropping mainly), but is very convenient for the cases where people not even use vector graphics, but sometimes even just screenshots of plots... Do I like it? Hell no! It's why I've put quite some effort in doing it better for my PhD thesis.

acc_297 · 2 years ago
In fairness they did not use a ruler or eyes based on the excerpts you quote they extracted exact coordinates of data from an svg format which if the svg was created correctly should at least give a non-biased dataset maybe with less precision than the source
ege_erdil · 2 years ago
we did and gave them a two week grace period to respond, but they only responded to us after we published on arxiv

also, we didn't reconstruct the data using a ruler, you can automate that entire process so that it's much more reliable than that

saurabh20n · 2 years ago
Looks like you’re one of the authors.

It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.

levocardia · 2 years ago
I do that all the time using WebPlotDigitizer [1]. Works great.

[1] https://apps.automeris.io/wpd/

dynm · 2 years ago
Seconded. When I first saw this, I thought it looked unintuitive and difficult to use, but when I tried it, it was very easy and I had the extracted data in a few minutes.
Ajoo · 2 years ago
They claimed that they did ask several times in one of the replies.
polygamous_bat · 2 years ago
> Why not just emailed the original authors for the raw data?

Industry research labs, especially Google deepmind, are notoriously closed up about their “proprietary” data. I’ve hit this wall multiple times in my own work in AI.

sp332 · 2 years ago
https://twitter.com/borgeaud_s/status/1780988694163321250 says they're going to open the data from the paper. Not sure why they didn't do it before, but good news.
williamdclt · 2 years ago
I particularly like this second quote, I appreciate them taking the time to explain "what is a graph" in a scientific paper!
cs702 · 2 years ago
Interesting! If the authors are right, it seems that the number of training tokens required per parameter (slowly) declines as models become larger (Figure 5).

That's good news. I think it deserves wider dissemination, so I'm upvoting your post.

Thank you for sharing this on HN!

dzdt · 2 years ago
Could be that the independence of training points available decline as the dataset size grows? At some point it becomes hard to add data that isn't essentially similar to something youve already added.
cs702 · 2 years ago
Yes, could be. Not sure how or even if anyone could prove it, though.
Kronopath · 2 years ago
This is not good news, this means that we could end up with a dangerously superintelligent AI just by scaling up the number of parameters, without increasing the amount of training data.
kelseyfrog · 2 years ago
No, but LLMs require orders of magnitude more language input than humans[1]. It's very reasonable to assume that architectural differences (size among them) is more likely a constraint for performance.

1. Specifically larger than the upper bound on lifetime language input for humans, even assuming 24/7 at max reading speed.

exe34 · 2 years ago
Like a corporation then. We should ban them until we can figure out how to align them!
pfdietz · 2 years ago
It's only bad news if you don't want a dangerously superintelligent AI.
gwern · 2 years ago
The original Chinchilla authors have now identified the original bug, apparently: https://twitter.com/borgeaud_s/status/1780988694163321250
mirekrusin · 2 years ago
Lovely, they are also open sourcing data.
anonymousDan · 2 years ago
The scientific process at work!
cgearhart · 2 years ago
TL;DR—couldn’t exactly replicate their results, but broadly confirmed their findings. They agree that the optimal range is 5–40 tokens per parameter, and close to 20 for the “chinchilla” model from the original paper.

Very unusual choice to reconstruct the dataset by eyeballing the graph in the source paper (why not just ask for it…?) and it’s not really clear why the result is dressed up behind the salacious-seeming abstract.

ege_erdil · 2 years ago
we didn't eyeball the graph, there are more accurate ways of extracting the data from a pdf file than that

we did ask for the data but got no response until we published on arxiv

what is supposed to be "salacious" about the abstract?

Dead Comment

Deleted Comment

newfocogi · 2 years ago
Key claims:

"We have found three potential issues with Hoffmann et al.’s estimates of the Chinchilla scaling law that rely on Approach 3: 1. Their estimated model fits the reconstructed data very poorly. These conclusions hold even when accounting for potential noise in data reconstruction and excluding outlier models. 2. The confidence are implausibly tight given the number of data points. Obtaining confidence intervals that tight would require many hundreds of thousands of observations, while they likely had only ∼400. 3. Their estimated model implies a scaling policy that is inconsistent with their other approach"

Data point most people are probably looking for: "We find a range consistent with the 20 tokens per parameter rule of thumb. Indeed, our point estimates imply that 25.6 tokens per parameters is optimal."

Deleted Comment

moffkalast · 2 years ago
Their rule of thumb would imply that a 70B model is saturated with 1.7T tokens, that's inconsistent with reality.
famouswaffles · 2 years ago
The Chinchilla laws were compute optimal scaling laws. They're not supposed to tell you what parameter-token combination will saturate a model.
eldenring · 2 years ago
No their claim is that there are dimishing returns for a fixed compute budget (in training) to scaling up data past that threshold vs. scaling up params.

This doesn't take inference into account either, obviously.

warbaker · 2 years ago
Calling this a "replication attempt" implied to me that they tried to replicate the Chinchilla Scaling paper and found that it did not replicate, which would be a very big deal!

Instead, they just redid the analysis based on a figure in the paper and found that the old model with slightly different parameters gave a better fit to the data. This is a valuable contribution, but a bit over-stated by the paper title, and the confrontational, "gotcha" tone of the paper is unwarranted.

A better framing would have been something like "Chinchilla Scaling: Reanalyzed".

ege_erdil · 2 years ago
one of their three approaches does not replicate and it's because of a software bug in the optimizer they used, i don't know what else we were supposed to say