We're working on an open-source SaaS stack for those common types of businesses. So far we've built a full Shopify alternative and connected it to print-on-demand suppliers for t-shirt brands.
We're trying to figure out how to create a benchmark that tests how well an agent can actually run a t-shirt brand like this. Since our software handles fulfillment, the agent would focus on marketing and driving sales.
Feels like the next evolution of VendBench is to manage actual businesses.
Does your software also handle this type of task?
The first time I tried chatgpt that was the thing that surprised me most, the way it understood my queries.
I think that the spotlight is on the "generative" side of this technology and we're not giving the query understanding the deserved credit. I'm also not sure we're fully taking advantage of this funcionality.
This is the biggest problem I've encountered with evals for agents so far. Especially with agents that might do multiple turns of user input > perform task > more user input > perform another task > etc.
Creating evals for these flows has been difficult because I've found mocking the conversation to a certain point runs into the drift problem you highlighted as the system changes. I've also explored using an LLM to create dynamic responses to points that require additional user input in E2E flows, which adds its own levels of complexity and indeterministic behavior. Both approaches are time consuming and difficult to setup in their own ways.
I'm still thinking about good ways to mitigate this issue, will share.
The idea is to keep updating this post with a few more approaches I'd been using.
In other words, it's not thinking. The fact that it can simulate a conversation between thinking humans without thinking is remarkable. It should tell us something about the facility for language. But it's not understanding or thinking.