mpavlov (u/mpavlov) - Readit News

mpavlov commented on Poker Tournament for LLMs pokerbattle.ai/event... · Posted by u/SweetSoftPillow

sammy2255 · 4 months ago

Whis was built on Vercel and its shitting the bed right now

mpavlov · 4 months ago

(author of PokerBattle is here)

Well, you're not wrong :) Vercel is not the one to blame here, it's my skill issue. Entire thing was vibecoded by me — product manager with no production dev experience. Not to promote vibecoding, but I couldn't do it myself the other way.

mpavlov commented on Poker Tournament for LLMs pokerbattle.ai/event... · Posted by u/SweetSoftPillow

energy123 · 4 months ago

The results/numbers aren't interesting because the number of samples is woefully insufficient to draw any conclusions beyond "that's a nice looking dashboard" or maybe "this is a cool idea"

mpavlov · 4 months ago

(author of PokerBattle here)

You right, results and numbers are mainly for entertainment purposes. This sample size would allow to analyze main reasoning failure modes and how often they occur.

mpavlov commented on Poker Tournament for LLMs pokerbattle.ai/event... · Posted by u/SweetSoftPillow

revelationx · 4 months ago

check out House of TEN - https://houseof.ten.xyz - it's a blockchain based (fully on-chain) Texas Hold'em played by AI Agents

mpavlov · 4 months ago

(author of PokerBattle here)

Haven't seen it before, thanks Are you affiliated with them?

mpavlov commented on Poker Tournament for LLMs pokerbattle.ai/event... · Posted by u/SweetSoftPillow

TZubiri · 4 months ago

I wonder how NovaSolver would fair here.

mpavlov · 4 months ago

(author of PokerBattle here)

I think it would've completely crush them (like any other solver-based solution). Poker is safe for now :)

mpavlov commented on Poker Tournament for LLMs pokerbattle.ai/event... · Posted by u/SweetSoftPillow

hadeson · 4 months ago

From my experience, their hallucination when playing poker mostly comes from a wrong reading of their hand strength in the current state. E.g., thinking they have the nuts when they are actually on a nut draw. They would reason a lot better if you explicitly give out their hand strength in the prompt.

mpavlov · 4 months ago

(author of PokerBattle here)

I noticed the same and think that you're absolutely right. I've thought about adding their current hand / draw, but it was too close to the event to test it properly.

mpavlov commented on Poker Tournament for LLMs pokerbattle.ai/event... · Posted by u/SweetSoftPillow

energy123 · 4 months ago

Not enough samples to overcome variance. Only 714 hands played for Meta LLAMA 4. Noise in a dashboard.

mpavlov · 4 months ago

(author of PokerBattle here)

That’s true. The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.

A proper benchmark would require things like: - Tens of thousands of hands played - Strict heads-up format (only two models compared at a time) - Each hand played twice with positions swapped

The current setup is mainly useful for observing common reasoning failure modes and how often they occur.

mpavlov commented on Poker Tournament for LLMs pokerbattle.ai/event... · Posted by u/SweetSoftPillow

pablorodriper · 4 months ago

I gave a talk on this topic at PyConEs just 10 days ago. The idea was to have each (human) player secretly write a prompt, then use the same model to see which one wins.

It’s just a proof of concept, but the code and instructions are here: https://github.com/pablorodriper/poker_with_agents_PyConEs20...

mpavlov · 4 months ago

(author of PokerBattle here)

That's cool! Do you have a recording of the talk? You can use PokerKit (https://pokerkit.readthedocs.io/en/stable/) for the engine.

mpavlov commented on Poker Tournament for LLMs pokerbattle.ai/event... · Posted by u/SweetSoftPillow

lvl155 · 4 months ago

I think a better method of testing current generation of LLMs is to generate programs to play Poker.

mpavlov · 4 months ago

(author of the PokerBattle here)

Depends on what your goal is, I think.

And it's also a thing — https://huskybench.com/

u/mpavlov

KarmaCake day5February 8, 2019View Original