MrCheeze (u/MrCheeze)

MrCheeze commented on Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal blog.jcz.dev/gemini-3-pro... · Posted by u/alphabetting

jwrallie · 5 days ago

Being through the game recently, I am not surprised Goldenrod Underground was a challenge, it is very confusing and even though I solved it through trial and error, I still don't know what I did. Olivine Lighthouse is the real surprise, as it felt quite obvious to me.

MrCheeze · 4 days ago

This writeup on the underground puzzle is worth reading, it's a pretty baffling "puzzle" design. https://pokemow.com/Gen2/ShutterPuzzle/

That said, it's definitely Gem's fault that it struggled so long, considering it ignored the NPCs that give clues.

MrCheeze commented on Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal blog.jcz.dev/gemini-3-pro... · Posted by u/alphabetting

squimmy26 · 5 days ago

How certain can we be that these improvements aren't just a result of Gemini 3 Pro pre-training on endless internet writeups of where 2.5 has struggled (and almost certainly what a human would have done instead)?

In other words, how much of this improvement is true generalization vs memorization?

MrCheeze · 4 days ago

There were no such writeups, 99% of the discussion about difficulties in Crystal were in twitch and discord chats where Google doesn't scrape. (It hadn't yet gotten the public attention that Claude and Gemini's runs of Pokemon Red and Blue have gotten.)

That said, this writeup itself will probably be scraped and influence Gemini 4.

MrCheeze commented on Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal blog.jcz.dev/gemini-3-pro... · Posted by u/alphabetting

oceansky · 5 days ago

"Crucially, it tells the agent not to rely on its internal training data (which might be hallucinated or refer to a different version of the game) but to ground its knowledge in what it observes. "

Does this even have any effect?

MrCheeze · 4 days ago

It's hard to say for sure because Gemini 3 was only tested with this prompt. But for Gemini 2.5, which is who the prompt was originally written for, yes this does cut down on bad assumptions (a specific example: the puzzle with Farfetch'd in Ilex Forest is completely different in the DS remake of the game, and models love to hallucinate elements from the remake's puzzle if you don't emphasize the need to distinguish hypothesis from things it actually observes).

MrCheeze commented on When Compiler Optimizations Hurt Performance nemanjatrifunovic.substac... · Posted by u/rbanffy

president_zippy · 2 months ago

If you wanna really see this at work on a whole other extreme, try compiling code for an N64 game. It's no surprise that optimizations for modern-day x86_64 and Arm64 processors with a lot of cache would not generalize well to a MIPS CPU with a cache that must be manipulated at the software layer and abysmal RDRAM latency, but the exact mechanics of it are interesting.

KazeEmmanuar did a great job analyzing exactly this so we don't have to!

https://www.youtube.com/watch?v=Ca1hHC2EctY

MrCheeze · 2 months ago

Exactly what I was going to post. Optimizations like loop unrolling slow down the N64 because keeping the code size small is the most important factor. I think even compilers of the time got this wrong, not just modern ones.

MrCheeze commented on Busy beaver hunters reach numbers that overwhelm ordinary math quantamagazine.org/busy-b... · Posted by u/defrost

wodenokoto · 4 months ago

Numberphile just did a video on subcubic graph numbers which grows much, much faster than Tree numbers.

Do we know if they grow faster than busy beavers?

https://youtu.be/4-eXjTH6Mq4

MrCheeze · 4 months ago

The busy beaver function is interesting precisely because you _can't_ come up with any computable function that grows faster.

MrCheeze commented on Claude says “You're absolutely right!” about everything github.com/anthropics/cla... · Posted by u/pr337h4m

MrCheeze · 4 months ago

Claude almost universally reacts to everything with a positive exclamation as its first sentence, regardless of whether it's good or bad. If you don't believe me, just watch https://www.twitch.tv/claudeplayspokemon for about three minutes and you'll get the idea.

Alternatively, look at the system prompt, where Anthropic attempted to get it to stop doing this: > Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective. It skips the flattery and responds directly. https://docs.anthropic.com/en/release-notes/system-prompts#a...

This problem seems highly specific to Claude. It's not exactly sycophancy so much as it is a strong bias towards this exact type of reaction to everything.

MrCheeze commented on Evolving OpenAI's Structure openai.com/index/evolving... · Posted by u/rohitpaulk

blagie · 8 months ago

No, this only happens if:

1) You're successful.

2) You mess up checks-and-balances at the beginning.

OpenAI did both.

Personally, I think at some point, the AGs ought to take over and push it back into a non-profit format. OAI undermines the concept of a non-profit.

MrCheeze · 8 months ago

With 2, the real problem is that approximately 0% of the OpenAI employees actually believed in the mission. Pretty much every single one of them signed the letter to the board demanding that if the company's existence ever comes into conflict with humanity's survival, the company's existence comes first.

MrCheeze commented on Speedrunners are vulnerability researchers, they just don't know it yet zetier.com/speedrunners-a... · Posted by u/chc4

tonetegeatinst · 10 months ago

I remember watching a video about this a while ago....it was a fresh perspective into a side of security research I didn't consider.

MrCheeze · 10 months ago

Are you thinking of Bismuth's "Speedrunning as a gateway to scientific endeavours", perhaps?

https://www.youtube.com/watch?v=w8_1lQ2KH50

MrCheeze commented on Speedrunners are vulnerability researchers, they just don't know it yet zetier.com/speedrunners-a... · Posted by u/chc4

Boldened15 · 10 months ago

Since speedrunners who find glitches are obviously very technical, do they usually already have some sort of day job in tech? I imagine it might be easier and just as lucrative to work on some CRUD app 9-5 and devote the rest of their time to research/streaming, and may be preferable to overloading their brain with even more of the same kind of research.

MrCheeze · 10 months ago

As an n=1 data point, that was my exact situation for a while. Also a lot of the people who put out high effort stuff are college students, which works for the same reason.

More interestingly and more surprisingly, some of the people who work on exploiting games _don't_ do any sort of tech work and have no background in compsci - they're purely self educated just for the sole purpose of breaking the one game they're interested in. This was the case for some of the biggest contributors to ACE in Zelda Ocarina of Time.

MrCheeze commented on Speedrunners are vulnerability researchers, they just don't know it yet zetier.com/speedrunners-a... · Posted by u/chc4

MrCheeze · 10 months ago

I've wondered myself why there's so little overlap between these two closely related interests of mine. Some of it seems to be the "But I don't want to cure cancer. I want to turn people into dinosaurs." effect, where some of the people working on exploiting games ONLY care about what can be done in their one game of interest - it doesn't always generalize to interest in using the same techniques against everything else.

Of course there's also the fact that exploiting 20-30 year old games is just vastly easier than modern software, due to the total lack of mitigations in them. And that's on top of the fact that with popular games, you're building on decades of reverse engineering work rather than (potentially) starting from scratch. And the arguably superior toolset (savestates etc).

But I think a very big factor is the one this blogpost is trying to address - most people just don't know anything at all about the vuln research industry, which is not exactly searching for attention in the ways that speedruns broadcast to hundreds of thousands of viewers for charity are.