Readit News logoReadit News
mfalcon commented on Turning an old Amazon Kindle into a eInk development platform (2021)   blog.lidskialf.net/2021/0... · Posted by u/fanf2
mfalcon · 18 hours ago
Would be nice to use as an interface to interact with Claude Code.
mfalcon commented on Project Vend: Phase Two   anthropic.com/research/pr... · Posted by u/kubami
theturtletalks · 3 days ago
VendBench is really interesting, but vending machines are pretty specialized. Most businesses people actually run look more like online stores, restaurants, hotels, barbershops, or grocery shops.

We're working on an open-source SaaS stack for those common types of businesses. So far we've built a full Shopify alternative and connected it to print-on-demand suppliers for t-shirt brands.

We're trying to figure out how to create a benchmark that tests how well an agent can actually run a t-shirt brand like this. Since our software handles fulfillment, the agent would focus on marketing and driving sales.

Feels like the next evolution of VendBench is to manage actual businesses.

mfalcon · 3 days ago
Nice, I'll take a look. I was thinking about building a benchmark similar to the one you described, but first focusing on the negotiation between the store and the product suppliers.

Does your software also handle this type of task?

mfalcon commented on Speech and Language Processing (3rd ed. draft)   web.stanford.edu/~jurafsk... · Posted by u/atomicnature
mfalcon · 16 days ago
I was eagerly waiting for a chapter on semantic similarity as I was using Universal Sentence Encoder for paraphrase detection, then LLMs showed up before that chapter :).
mfalcon commented on The "confident idiot" problem: Why AI needs hard rules, not vibe checks   steerlabs.substack.com/p/... · Posted by u/steer_dev
mfalcon · 23 days ago
I had been working on NLP, NLU mostly, some years before LLMs. I've tried the universal sentence encoder alongside many ML "techniques" in order to understand user intentions and extract entities from text.

The first time I tried chatgpt that was the thing that surprised me most, the way it understood my queries.

I think that the spotlight is on the "generative" side of this technology and we're not giving the query understanding the deserved credit. I'm also not sure we're fully taking advantage of this funcionality.

mfalcon commented on Evaluating Agents   aunhumano.com/index.php/2... · Posted by u/mfalcon
yuzhun · 4 months ago
I'm a beginner user. My current agent is built using Java. I'm hesitant whether to use Python to call the api for evaluation or to introduce some tools into the Java project for evaluation, such as those related to OpenTelemetry.
mfalcon · 4 months ago
You can evaluate with your programming language of choice.
mfalcon commented on Evaluating Agents   aunhumano.com/index.php/2... · Posted by u/mfalcon
codazoda · 4 months ago
Would love to see some examples
mfalcon · 4 months ago
Good idea for a follow up post :)
mfalcon commented on Evaluating Agents   aunhumano.com/index.php/2... · Posted by u/mfalcon
localbuilder · 4 months ago
> There’s one issue with this, you’ll have to be careful to keep the “N – 1” interactions updated whenever you make some changes because you will be “simulating” something that will never happen again in your agent.

This is the biggest problem I've encountered with evals for agents so far. Especially with agents that might do multiple turns of user input > perform task > more user input > perform another task > etc.

Creating evals for these flows has been difficult because I've found mocking the conversation to a certain point runs into the drift problem you highlighted as the system changes. I've also explored using an LLM to create dynamic responses to points that require additional user input in E2E flows, which adds its own levels of complexity and indeterministic behavior. Both approaches are time consuming and difficult to setup in their own ways.

mfalcon · 4 months ago
Yes, and these problems are more present in the first iterations, when you are still trying to get a good enough agent behaviour.

I'm still thinking about good ways to mitigate this issue, will share.

mfalcon commented on Evaluating Agents   aunhumano.com/index.php/2... · Posted by u/mfalcon
mfalcon · 4 months ago
Hey fellow hners, OP here. Been working on agents for a while so I started sharing some things.

The idea is to keep updating this post with a few more approaches I'd been using.

mfalcon commented on An LLM is a lossy encyclopedia   simonwillison.net/2025/Au... · Posted by u/tosh
Zigurd · 4 months ago
It's parsing. It's tokenizing. But it's a stretch to call it understanding. It creates a pattern that it can use to compose a response. Ensuring the response is factual is not fundamental to LLM algorithms.

In other words, it's not thinking. The fact that it can simulate a conversation between thinking humans without thinking is remarkable. It should tell us something about the facility for language. But it's not understanding or thinking.

mfalcon · 4 months ago
I know that the "understanding" is a stretch, but I refer to the Understanding of the NLU that wasn't really understanding either.

u/mfalcon

KarmaCake day819July 31, 2009
About
I'm Mariano, a Software Engineer specialized in Machine Learning from BsAs, Argentina.

If you want to contact me, my e-mail is mf2286@gmail.com

View Original