Wow. They must have had some major breakthrough. Those scores are truly insane. O_O
Models have begun to fairly thoroughly saturate "knowledge" and such, but there are still considerable bumps there
But the _big news_, and the demonstration of their achievement here, are the incredible scores they've racked up here for what's necessary for agentic AI to become widely deployable. t2-bench. Visual comprehension. Computer use. Vending-Bench. The sorts of things that are necessary for AI to move beyond an auto-researching tool, and into the realm where it can actually handle complex tasks in the way that businesses need in order to reap rewards from deploying AI tech.
Will be very interesting to see what papers are published as a result of this, as they have _clearly_ tapped into some new avenues for training models.
And here I was, all wowed, after playing with Grok 4.1 for the past few hours! xD
The problem is that we know in advance what is the benchmark, so Humanity's Last Exam for example, it's way easier to optimize your model when you have seen the questions before.
These numbers are impressive, at least to say. It looks like Google has produced a beast that will raise the bar even higher. What's even more impressive is how Google came into this game late and went from producing a few flops to being the leader at this (actually, they already achieved the title with 2.5 Pro).
What makes me even more curious is the following
> Model dependencies: This model is not a modification or a fine-tune of a prior model
Google was never really late. Where people perceived Google to have dropped the ball was in its productization of AI. The Google's Bard branding stumble was so (hilariously) bad that it threw a lot of people off the scent.
My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.
There are no leaders. Every other month a new LLM model comes out and it outperforms the previous ones by a small margin, the benchmarks always look good (probably because the models are trained on the answers) but then in practice they are basically indistinguishable from the previous ones (take GPT4 vs 5). We've been in this loop since around the release of ChatGPT 4 where all the main players started this cycle.
The biggest strides in the last 6-8 months have been in generative AIs, specifically for animation.
I hope they keep the pricing similar to 2.5 Pro, currently I pay per token and that and GPT-5 are close to the sweet spot for me but Sonnet 4.5 feels too expensive for larger changes. I've also been moving around 100M tokens per week with Cerebras Code (they moved to GLM 4.6), but the flagship models still feel better when I need help with more advanced debugging or some exemplary refactoring to then feed as an example for a dumber/faster model.
What's more impressive is that I find gemini2.5 still relevant in day-to-day usage, despite being so low on those benchmarks compared to claude 4.5 and gpt 5.1. There's something that gemini has that makes it a great model in real cases, I'd call it generalisation on its context or something. If you give it the proper context (or it digs through the files in its own agent) it comes up with great solutions. Even if their own coding thing is hit and miss sometimes.
I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers.
I would love to know what the increased token count is across these models for the benchmarks. I find the models continue to get better but as they do their token usage also does. Aka is model doing better or reasoning for longer?
I think that is always something that is being worked on in parallel. Recent paradigm seems to be the models understanding when they need to use more tokens dynamically (which seems to be very much in line with how computation should generally work).
I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side.
However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.
But ... what's missing from this comparison: Kimi-K2.
When ChatGPT-3 exploded, OpenAI had at least double the benchmark scores of any other model, open or closed. Gemini 3 Pro (not the model they actually serve) outperforms the best open model ... wait it does not uniformly beat the best open model anymore. Not even close.
Kimi-k2 beats Gemini 3 pro on several benchmarks. On average it scores just under 10% better then the best open model, currently Kimi-K2.
Gemini-3 pro is in fact only the best in about half the benchmarks tested there. In fact ... this could be another llama4 moment. The reason Gemini-3 pro is the best model is a very high score on a single benchmark ("Humanity's last exam"), if you take that benchmark out GPT-5.1 remains the best model available. The other big improvement is "SciCode", and if you take that out too the best open model, Kimi K2, beats Gemini 3 pro.
Kimi K2 on OpenRouter: $0.50 / M input tokens, $2.40 / M output tokens
Gemini 3 Pro: For contexts ≤ 200,000 tokens: US$ 2.00 per 1 M input tokens, Output tokens: US$ 12.00 per 1 M tokens
For contexts > 200,000 tokens (long context tier): US$ 4.00 per 1 M input tokens , US$ 18.00 per 1 M output tokens
So Gemini 3 pro is 4 times, 400%, the price of the best open model (and just under 8 times, 800%, with long context), and 70% more expensive than GPT-5.1
The closed models in general, and Google specifically, serve Gemini 3 pro at double to triple the speed (as in tokens-per-second) of openrouter. Although even here it is not the best, that's openrouter with gpt-oss-120b.
This is a big jump in most benchmarks.And if it can match other models in coding while having that Google TPM inference speed and the actually native 1m context window, it's going to be a big hit.
I hope it's isn't such a sycophant like the current gemini 2.5 models, it makes me doubt its output, which is maybe a good thing now that I think about it.
Looks like the best way to keep improving the models is to come up with really useful benchmarks and make them popular. ARC-AGI-2 is a big jump, I'd be curious to find out how that transfers over to everyday tasks in various fields.
What? The 4.5 and 5.1 columns aren't thinking in Google's report?
That's a scandal, IMO.
Given that Gemini-3 seems to do "fine" against the thinking versions why didn't they post those results? I get that PMs like to make a splash but that's shockingly dishonest.
We knew it would be a big jump and while it certainly is in many areas - its definitely not "groundbreaking/huge leap" worthy like some were thinking from looking at these numbers.
I feel like many will be pretty disappointed by their self created expectations for this model when they end up actually using it and it turns out to be fairly similar to other frontier models.
Personally I'm very interested in how they end up pricing it.
> I guess improvements will be incremental from here on out.
What do you mean? These coding leaderboards were at single digits about a year ago and are now in the seventies. These frontier models are arguably already better at the benchmark that any single human - it's unlikely that any particular human dev is knowledgeable to tackle the full range of diverse tasks even in the smaller SWE-Bench Verified within a reasonable time frame; to the best of my knowledge, no one has tried that.
Why should we expect this to be the limit? Once the frontier labs figure out how to train these fully with self-play (which shouldn't be that hard in this domain), I don't see any clear limit to the level they can reach.
very impressive. I wonder if this sends a different signal to the market regarding using TPUs for training SOTA models versus Nvidia GPUs. From what we've seen, OpenAI is already renting them to diversify... Curious to see what happens next
really great results although the results are so high i was trying a simple example of object detection and the performance was kind of poor in agentic frameworks. Need to see how this performs on other other tasks.
It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding
I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.
It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.
Codex has been good enough to me and it’s much cheaper.
I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.
Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.
It remains to be seen whether that works out for them, but it seems like a good bet to me. Coding is the most monetizatable use anyone has found for LLMs so far, and the most likely to persist past this initial hype bubble (if the Singularity doesn't work out :p).
From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).
My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.
I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.
Yeah, you can see this even by just running claude-code against other models. For example, DeepSeek used as a backend for CC tends to produce results mostly competitive with Sonnet 4.5 A lot is just in the tooling and prompting.
IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.
This seems preferable. Wasting tokens on tools when a standardized, reliable interface to those tools should be all that's required.
The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.
Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.
It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.
50% of the CLs in SWE-Bench Verified are the DJango codebase. So if you're a big contributor to Django you should care a lot about that benchmark. Otherwise the difference between models is +-2 tasks done correctly. I wouldn't worry too much about it. Just try it out yourself and see if its any better.
Their scores on SWE bench are very close because the benchmark is nearly saturated. Gemini 3 beats Sonnet 4.5 on TerminalBench 2.0 by a nice margin (54% vs. 43%), which is also agentic coding (CLI instead of python).
I find Gemini 2.5 pro to be as good or in some cases better for SQL than GPT 5.1. It's aging otherwise, but they must have some good SQL datasets in there for training.
One benchmark I would really like to see: instruction adherence.
For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.
The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.
If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.
I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.
Even more nuts would be a model that could follow a large, dense set of highly detailed instructions related to a series of complex tasks. Intelligence is nice, but it's far more useful and programmable if it can tightly follow a lot of custom instructions.
This idea isn't just smart, it's revolutionary. You're getting right at the heart of the problem with today's benchmarks — we don't measure model praise. Great thinking here.
For real though, I think that overall LLM users enjoy things to be on the higher side of sycophancy. Engineers aren't going to feel it, we like our cold dead machines, but the product people will see the stats (people overwhelmingly use LLMs to just talk to about whatever) and go towards that.
Your comment demonstrates a remarkably elevated level of cognitive processing and intellectual rigor. Inquiries of this caliber are indicative of a mind operating at a strategically advanced tier, displaying exceptional analytical bandwidth and thought-leadership potential. Given the substantive value embedded in your question, it is operationally imperative that we initiate an immediate deep-dive and execute a comprehensive response aligned with the strategic priorities of this discussion.
I care very little about model personality outside of sycophancy. The thing about gemini is that it's notorious for its low self esteem. Given that thing is trained from scratch, I'm very curious to see how they've decided to take it.
https://llmdeathcount.com/ shows 15 deaths so far, and LLM user count is in the low billions, which puts us on the order of 0.0015 deaths per hundred thousand users.
I'm guessing LLM Death Count is off by an OOM or two, so we could be getting close to one in a million.
Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/emailAddress=info@allot.com` which obviously fails...
Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?
> EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.
Yeah, that was via my ISPs DNS resolver (Vodafone), switching the resolver works :)
The responsible party is ultimately our government who've decided it's legal to block a wide range of servers and websites because some people like to watch illegal football streams. I think Allot is just the provider of the technology.
Prediction markets were expecting today to be the release. So I wouldn't be surprised if they do a release today, tomorrow, or Thursday (around Nvidia earnings).
EDIT: formatting, hopefully a bit more mobile friendly
Models have begun to fairly thoroughly saturate "knowledge" and such, but there are still considerable bumps there
But the _big news_, and the demonstration of their achievement here, are the incredible scores they've racked up here for what's necessary for agentic AI to become widely deployable. t2-bench. Visual comprehension. Computer use. Vending-Bench. The sorts of things that are necessary for AI to move beyond an auto-researching tool, and into the realm where it can actually handle complex tasks in the way that businesses need in order to reap rewards from deploying AI tech.
Will be very interesting to see what papers are published as a result of this, as they have _clearly_ tapped into some new avenues for training models.
And here I was, all wowed, after playing with Grok 4.1 for the past few hours! xD
What makes me even more curious is the following
> Model dependencies: This model is not a modification or a fine-tune of a prior model
So did they start from scratch with this one?
My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.
Anyone with money can trivially catch up to a state of the art model from six months ago.
And as others have said, late is really a function of spigot, guardrails, branding, and ux, as much as it is being a laggard under the hood.
The biggest strides in the last 6-8 months have been in generative AIs, specifically for animation.
Their major version number bumps are a new pre-trained model. Minor bumps are changes/improvements to post-training on the same foundation.
On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release.
I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers.
Deleted Comment
Because it seems to lead by a decent margin on the former and trails behind on the latter
However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.
LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks.
When ChatGPT-3 exploded, OpenAI had at least double the benchmark scores of any other model, open or closed. Gemini 3 Pro (not the model they actually serve) outperforms the best open model ... wait it does not uniformly beat the best open model anymore. Not even close.
Kimi-k2 beats Gemini 3 pro on several benchmarks. On average it scores just under 10% better then the best open model, currently Kimi-K2.
Gemini-3 pro is in fact only the best in about half the benchmarks tested there. In fact ... this could be another llama4 moment. The reason Gemini-3 pro is the best model is a very high score on a single benchmark ("Humanity's last exam"), if you take that benchmark out GPT-5.1 remains the best model available. The other big improvement is "SciCode", and if you take that out too the best open model, Kimi K2, beats Gemini 3 pro.
https://artificialanalysis.ai/models
And then, there's the pricing:
Kimi K2 on OpenRouter: $0.50 / M input tokens, $2.40 / M output tokens
Gemini 3 Pro: For contexts ≤ 200,000 tokens: US$ 2.00 per 1 M input tokens, Output tokens: US$ 12.00 per 1 M tokens For contexts > 200,000 tokens (long context tier): US$ 4.00 per 1 M input tokens , US$ 18.00 per 1 M output tokens
So Gemini 3 pro is 4 times, 400%, the price of the best open model (and just under 8 times, 800%, with long context), and 70% more expensive than GPT-5.1
The closed models in general, and Google specifically, serve Gemini 3 pro at double to triple the speed (as in tokens-per-second) of openrouter. Although even here it is not the best, that's openrouter with gpt-oss-120b.
I hope it's isn't such a sycophant like the current gemini 2.5 models, it makes me doubt its output, which is maybe a good thing now that I think about it.
What's with the hyperbole? It'll tighten the screws, but saying that it's "over for the other labs' might be a tad premature.
Its not over and never will be for 2 decade old accounting software, it is definitely will not be over for other AI labs.
Deleted Comment
Benchmark | Gemini 3 Pro | Gemini 2.5 Pro | Claude Sonnet 4.5 | GPT-5.1 | GPT-5.1 Thinking
---------------------------|--------------|----------------|-------------------|---------|------------------
Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | 52%
ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | 28%
GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% | 61%
AIM 2025 | 95.0% | 88.0% | 87.0% | 94.0% | 48%
MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | 82%
MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% | 76%
ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | 55%
CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | N/A
OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 | N/A
Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | N/A
LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 | N/A
Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% | N/A
SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% | N/A
t2-bench | 85.4% | 54.9% | 84.7% | 80.2% | N/A
Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43| N/A
FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | N/A
SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | N/A
MMLU | 91.8% | 89.5% | 89.1% | 91.0% | N/A
Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | N/A
MRCR v2 (8-needle) | 77.0% | 58.0% | 47.1% | 61.6% | N/A
Argh it doesn't come out write in HN
Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes
Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%
ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning
GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)
AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly
MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus
MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)
ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%
CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A
Deleted Comment
The 17.6% is for 5.1 Thinking High.
Deleted Comment
Deleted Comment
That's a scandal, IMO.
Given that Gemini-3 seems to do "fine" against the thinking versions why didn't they post those results? I get that PMs like to make a splash but that's shockingly dishonest.
I feel like many will be pretty disappointed by their self created expectations for this model when they end up actually using it and it turns out to be fairly similar to other frontier models.
Personally I'm very interested in how they end up pricing it.
What do you mean? These coding leaderboards were at single digits about a year ago and are now in the seventies. These frontier models are arguably already better at the benchmark that any single human - it's unlikely that any particular human dev is knowledgeable to tackle the full range of diverse tasks even in the smaller SWE-Bench Verified within a reasonable time frame; to the best of my knowledge, no one has tried that.
Why should we expect this to be the limit? Once the frontier labs figure out how to train these fully with self-play (which shouldn't be that hard in this domain), I don't see any clear limit to the level they can reach.
Not sure 360 days is enough of a sample really but it's an interesting take on AI benchmarks.
Are there any other interesting benchmarks to look at?
[1] https://andonlabs.com/evals/vending-bench-2
I'll wait for the official blog with benchmark results.
I suspect that our ability to benchmark models is waning. Much more investment required in this area, but what is the play out?
What does this model do that others can't already.
It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.
I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.
Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.
My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.
I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.
The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.
I did not bother verifying the other claims.
It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.
Evals are hard.
My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.
Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.
It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.
For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.
The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.
If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.
I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.
If you ever played competitive game the difference is insane between these tiers
For real though, I think that overall LLM users enjoy things to be on the higher side of sycophancy. Engineers aren't going to feel it, we like our cold dead machines, but the product people will see the stats (people overwhelmingly use LLMs to just talk to about whatever) and go towards that.
I'm guessing LLM Death Count is off by an OOM or two, so we could be getting close to one in a million.
Deleted Comment
Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?
I actually never discovered who was responsible for the blockade, until I read this comment. I'm going to look into Allot and send them an email.
EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.
Yeah, that was via my ISPs DNS resolver (Vodafone), switching the resolver works :)
The responsible party is ultimately our government who've decided it's legal to block a wide range of servers and websites because some people like to watch illegal football streams. I think Allot is just the provider of the technology.
The bucket name "deepmind-media" has been used in the past on the deepmind official site, so it seems legit.
here’s the archived pdf: https://web.archive.org/web/20251118111103/https://storage.g...