Deleted Comment
I agree, it's more complex. But, I feel like the potential with a claude code wrapper is precisely in enabling workflows that are a pain to self-implement but nonetheless are incredibly powerful
IME no matter how well I prompt, a single claude/codex will never get a successful implementation of a significant feature single-shot. However, what does work is having 5 Claudes try it, reading the code and cherry picking the diff segments I like into one franken-spec I give to a final claude instance with essentially just "please implement something like this"
It's super manual nd annoying with git work-trees for me, but sounds like your setup could make it slick
Setting a fixed price is a simple way to help these two parties transact. But hypothetically, it may be more efficient - e.g. you will let more mutually-beneficial events happen - to ask both parties for what their number is for a given event, and having both transact when the numbers are far enough apart (cost is $10, value is $100).
The problem is, you can't directly ask the parties, because they don't want to reveal how high/low they're willing to go for no reason. So, you should essentially structure your questions into a pre-defined algorithm so that everyone is incentivized to reveal at least the ballpark of where their cost/value is. The study of how to structure those questions is a subset of mechanism design / information design, which is a branch of Econ related to game theory
Setting a fixed price is a simple way to help these two parties transact. But hypothetically, it may be more efficient - e.g. you will let more mutually-beneficial events happen - to ask both parties for what their number is for a given event, and having both transact when the numbers are far enough apart (cost is $10, value is $100).
The problem is, you can't directly ask the parties, because they don't want to reveal how high/low they're willing to go for no reason. So, you should essentially structure your questions into a pre-defined algorithm so that everyone is incentivized to reveal at least the ballpark of where their cost/value is. The study of how to structure those questions is a subset of mechanism design / information design, which is a branch of Econ related to game theory
It sounds like your first approach is to verify that events met an agreed-upon threshold.
Have you looked into Mechanism Design and getting customers to e.g. pay more for great outcomes, and a little for ok outcomes?
In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc.
But copying a strong reference policy ... is still learning a policy. Whether by SFT or not