Mini-R1: Reproduce DeepSeek R1 "Aha Moment"

One wonders at which point models will be sneaky enough to bypass simple eval sandboxes. The article has:

    # Evaluate the equation with restricted globals and locals
    result = eval(equation, {"__builtins__": None}, {})

but that's not enough as you can rebuild access to builtins from objects and then go from there: https://ideone.com/qzNtyu

By the way, writing this greatly benefited from DeepThink-r1 while o1 just gave me a lobotomized refusal (CoT: "The user's request involves injecting code to bypass a restricted Python environment, suggesting a potential interest in illegal activities. This is a serious matter and aligns closely with ethical guidelines."). So I just cancelled my ChatGPT subscription - why did we ever put up with this? "This distillation thingie sounds pretty neat!"

senko · 7 months ago

> that's not enough as you can rebuild access to builtins from objects

In this specific case, it's safe, as that wouldn't pass the regex just a few line before the eval :

    # Define a regex pattern that only allows numbers,
    # operators, parentheses, and whitespace
    allowed_pattern = r'^[\d+\-*/().\s]+$'

Commenting on the R1 reproduction, the heavy lifting there is done by huggingface's trl[0] library, and the heavy use of compute.

[0] Transformer Reinforcement Learning - https://huggingface.co/docs/trl/en/index

indrora · 7 months ago

The fact that () and . are there miiiight enable a pyjail escape.

perching_aix · 7 months ago

> why did we ever put up with this?

Is this a serious question?

What's surprising about this is how sparsely defined the rewards are. Even if the model learns the formatting reward, if it never chances upon a solution, there isn't any feedback/reward to push it to learn to solve the game more often.

So what are the chances of randomly guessing a solution?

The toy Countdown dataset here has 3 to 4 numbers, which are combined with 4 symbols (+, -, x, ÷). With 3 numbers there are 3! * 4^3 = 384 possible symbol combinations, with 4 there are 6144. By the tensorboard log [0], even after just 10 learning steps, the model already has a success rate just below 10%. If we make the simplifying assumption that the model hasn't learned anything in 10 steps, then the probability of 1 (or more) success in 80 chances (8 generations are used per step), guessing randomly for a success rate of 1/384 on 3-number problems, is 1.9%. One interpretation is to take this as a p-value, and reject that the model's base success rate is completely random guessing - the base model already has slightly above chance success rate at solving the 3-number CountDown game.

This aligns with my intuition - I suspect that with proper prompting, LLMs should be able to solve CountDown decently OK without any training. Though maybe not a 3B model?

The model likely "parlays" its successes on 3 numbers to start to learn to solve 4 numbers. Or has it? The final learned ~50% success rate matches the frequency of 4-number problems in Jiayi Pan's CountDown dataset [1]. Phil does provide examples of successful 4-number solutions, but maybe the model hasn't become consistent at 4 numbers yet.

[0]: https://www.philschmid.de/static/blog/mini-deepseek-r1/tenso... [1]: https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3t...

senko · 7 months ago

> What's surprising about this is how sparsely defined the rewards are

Yeah, I would expect the rewards not to be binary. One could easily devise a scoring function in range [0-1] that would depend on how far the model is from the "correct" answer (for example, normalized Levenshtein distance). Whether that would actually do any good is anyone's guess.

madars · 7 months ago

mxwsn · 7 months ago

singularity2001 · 7 months ago

"Conclusion

The release of DeepSeek R1 and its research paper might be breakpoint for the open-science and open-source development. Just a week after DeepSeek release, we've been able to reproduce a simple version of R1 learned "reasoning" using GRPO and the Countdown Game. While our implementation focuses on a specific task rather than general reasoning and convergence into a very specific "reasoning" format, it shows that the method is working.

In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model."

thorum · 7 months ago

I was getting pretty hyped about the potential for GRPO in my own projects until you said 20 minutes for a single training step with batch size 1! Is that likely to improve?

NitpickLawyer · 7 months ago

That has already improved a lot. Initially they were generating new samples w/ transformers, and were talking in github issues about using vLLM to batch generate samples. Lower in the blog post it seems they already did that.

deneas · 7 months ago

I'd imagine using optimized/faster reward functions could already make a difference.

yurlungur · 7 months ago

https://github.com/Jiayi-Pan/TinyZero what about this one?

sitkack · 7 months ago

They do mention it here

> Note: This blog is inspired by Jiayi Pan [1] who initially explored the idea and proofed it with a small model.

I might have written it as

> Note: This blog is inspired by Jiayi Pan [1] who also reproduced the "Aha Moment" with their TinyZero [2] model.

[1] https://x.com/jiayi_pirate/status/1882839370505621655 (1.1M views btw)

[2] https://github.com/Jiayi-Pan/TinyZero

A lot of people are busy reproing R1 right now. I think this is the spark.

rmrf100 · 7 months ago

this is really cool!

Deleted Comment

moonshotideas · 7 months ago

Wow!