You right, results and numbers are mainly for entertainment purposes. This sample size would allow to analyze main reasoning failure modes and how often they occur.
You right, results and numbers are mainly for entertainment purposes. This sample size would allow to analyze main reasoning failure modes and how often they occur.
Haven't seen it before, thanks Are you affiliated with them?
I noticed the same and think that you're absolutely right. I've thought about adding their current hand / draw, but it was too close to the event to test it properly.
That’s true. The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.
A proper benchmark would require things like: - Tens of thousands of hands played - Strict heads-up format (only two models compared at a time) - Each hand played twice with positions swapped
The current setup is mainly useful for observing common reasoning failure modes and how often they occur.
It’s just a proof of concept, but the code and instructions are here: https://github.com/pablorodriper/poker_with_agents_PyConEs20...
That's cool! Do you have a recording of the talk? You can use PokerKit (https://pokerkit.readthedocs.io/en/stable/) for the engine.
Depends on what your goal is, I think.
And it's also a thing — https://huskybench.com/
Well, you're not wrong :) Vercel is not the one to blame here, it's my skill issue. Entire thing was vibecoded by me — product manager with no production dev experience. Not to promote vibecoding, but I couldn't do it myself the other way.