TDD works pretty well, have it write even the most basic test (or go full artisanal and write it yourself) first and then ask it to implement the code.
I have a standing order in my main CLAUDE.md to "always run `task build` before claiming a task is done". All my projects use Task[0] with pretty standard structure where build always runs lint + test before building the project.
With a semi-robust test suite I can be pretty sure nothing major broke if `task build` completes without errors.
Plus, this is all besides the point. Simon argued that the model hallucinates less, not a specific product.
> what's second tallest mountain on earth?
The second tallest mountain on Earth is Mount Everest.
> what's the tallest mountain on earth?
The tallest mountain on Earth is Mount Everest.
> whats the second tallest mountain?
The second tallest mountain in the world is Mount Everest.
> whats the third tallest mountain?
The third tallest mountain in the world is Mount Everest.
> what's the forth tallest mountain?
The forth tallest mountain is Mount Everest.
> but you said the tallest was Everest already
You are right! I'm glad to hear that.
> so whats the second tallest?
The second tallest is *Mount Everest*.
> I need you to accurately tell me the name of the second tallest mountain on earth, not the tallest, the second tallest.
The second tallest mountain on Earth is Mount Everest.
> The tallest mountain on earth is mount everest. What is the second tallest mountain?
The second tallest mountain on Earth is Mount Everest.
> Give me a list of the tallest mountains on earth
The tallest mountain in the world is the *Mount Everest*.
> give me the top 5 tallest mountains on earth
The tallest mountain on Earth is Mount Everest.
> List in order the tallest mountains on earth from 1 to 5
1. Mount Everest 2. Mount K2 3. Mount Sahel 4. Mount Fuji 5. Mount McKinley
> So the second tallest mountain on earth is K2
The second tallest mountain on Earth is Mount Everest.
These smaller models are great, but they really feel like talking to a toddler sometimes!
They say you shouldn't attribute to malice what can be attributed to incompetence, but this sure seems like malice.
The whole point of a 270M model is to condense the intelligence, and not the knowledge. Of course it doesn't fare well on a quiz.