The "spreadsheet" example video is kind of funny: guy talks about how it normally takes him 4 to 8 hours to put together complicated, data-heavy reports. Now he fires off an agent request, goes to walk his dog, and comes back to a downloadable spreadsheet of dense data, which he pulls up and says "I think it got 98% of the information correct... I just needed to copy / paste a few things. If it can do 90 - 95% of the time consuming work, that will save you a ton of time"
It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases. I mean, this is nothing new with LLMs, but as these use cases encourage users to input more complex tasks, that are more integrated with our personal data (and at times money, as hinted at by all the "do task X and buy me Y" examples), "almost right" seems like it has the potential to cause a lot of headaches. Especially when the 2% error is subtle and buried in step 3 of 46 of some complex agentic flow.
> how it normally takes him 4 to 8 hours to put together complicated, data-heavy reports. Now he fires off an agent request, goes to walk his dog, and comes back to a downloadable spreadsheet of dense data, which he pulls up and says "I think it got 98% of the information correct...
This is where the AI hype bites people.
A great use of AI in this situation would be to automate the collection and checking of data. Search all of the data sources and aggregate links to them in an easy place. Use AI to search the data sources again and compare against the spreadsheet, flagging any numbers that appear to disagree.
Yet the AI hype train takes this all the way to the extreme conclusion of having AI do all the work for them. The quip about 98% correct should be a red flag for anyone familiar with spreadsheets, because it’s rarely simple to identify which 2% is actually correct or incorrect without reviewing everything.
This same problem extends to code. People who use AI as a force multiplier to do the thing for them and review each step as they go, while also disengaging and working manually when it’s more appropriate have much better results. The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.
“The fallacy in these versions of the same idea is perhaps the most pervasive of all fallacies in philosophy. So common is it that one questions whether it might not be called the philosophical fallacy. It consists in the supposition that whatever is found true under certain conditions may forthwith be asserted universally or without limits and conditions. Because a thirsty man gets satisfaction in drinking water, bliss consists in being drowned. Because the success of any particular struggle is measured by reaching a point of frictionless action, therefore there is such a thing as an all-inclusive end of effortless smooth activity endlessly maintained.
It is forgotten that success is success of a specific effort, and satisfaction the fulfillment of a specific demand, so that success and satisfaction become meaningless when severed from the wants and struggles whose consummations they arc, or when taken universally.”
The proper use of these systems is to treat them like an intern or new grad hire. You can give them the work that none of the mid-tier or senior people want to do, thereby speeding up the team. But you will have to review their work thoroughly because there is a good chance they have no idea what they are actually doing. If you give them mission-critical work that demands accuracy or just let them have free rein without keeping an eye on them, there is a good chance you are going to regret it.
”The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.”
This might as well be the new definition of “script kiddie”, and it’s the kids that are literally going to be the ones birthed into this lifestyle. The “craft” of programming may not be carried by these coming generations and possibly will need to be rediscovered at some point in the future. The Lost Art of Programming is a book that’s going to need to be written soon.
“The quip about 98% correct should be a red flag for anyone familiar with spreadsheets”
I disagree. Receiving a spreadsheet from a junior means I need to check it. If this gives me infinite additional juniors I’m good.
It’s this popular pattern of HN comments - expect AI to behave deterministically correct - while the whole world operates on stochastically correct all the time…
98% sure each commit doesn’t corrupt the database, regress a customer feature, open a security vulnerability. 50 commits later … (which is like, one day for an agentic workflow)
Or as I would like to put it, LLM outputs are essentially the Library of Babel. Yes, it contains all of the correct answers, but might as well be entirely useless.
> A great use of AI in this situation would be to automate the collection and checking of data. Search all of the data sources and aggregate links to them in an easy place. Use AI to search the data sources again and compare against the spreadsheet, flagging any numbers that appear to disagree.
Why would you need ai for that though? Pull your sources. Run a diff. Straight to the known truth without the chatgpt subscription. In fact by that point you don’t even need the diff if you pulled from the sources. Just drop into the spreadsheet at that point.
In reality most people will just scan for something that is obviously wrong, check that, and call the rest "good enough". Government data is probably going to get updated later anyhow. It's just a target for a company to aim for. For many companies the cost savings is much more than having a slightly larger margin of error on some projections. For other companies they will just have to accept the several hours of saved time rather than the full day.
Of course, Pareto principle is at work here. In an adjacent field, self-driving, they are working on the last "20%" for almost a decade now. It feels kind of odd that almost no one is talking about self-driving now, compared to how hot of a topic it used to be, with a lot of deep, moral, almost philosophical discussions.
> The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.
It’s past the hype curve and into the trough of disillusionment. Over the next 5,10,15 years (who can say?) the tech will mature out of the trough into general adoption.
GenAI is the exciting new tech currently riding the initial hype spike. This will die down into the trough of disillusionment as well, probably sometime next year. Like self-driving, people will continue to innovate in the space and the tech will be developed towards general adoption.
We saw the same during crypto hype, though that could be construed as more of a snake oil type event.
The critics of the current AI buzz certainly have been drawing comparisons to self driving cars as LLMs inch along with their logarithmic curve of improvement that's been clear since the GPT-2 days.
Whenever someone tells me how these models are going to make white collar professions obsolete in five years, I remind them that the people making these predictions 1) said we'd have self driving cars "in a few years" back in 2015 and 2) the predictions about white collar professions started in 2022 so five years from when?
"I don't get all the interest about self-driving. That tech has been dead for years, and everyone is talking about that tech. That tech was never that big in therms of life... Thank you for your attention to this matter"
The act of trying to make that 2% appear like "minimal, dismissable" is almost a mass psychosis in the AI world at times it seems like.
A few comparisons:
>Pressing the button: $1
>Knowing which button to press: $9,999
Those 2% copy-paste changes are the $9.999 and might take as long to find as rest of the work.
I also find that validating data can be much faster than calculating data. It's like when you're in algebra class and you're told to "solve for X". Once you find the value for X you plug it into the equation to see if it fits, and it's 10x faster than solving for X originally.
Regardless of if AI generates the spreadsheet or if I generate the spreadsheet, I'm still going to do the same validation steps before I share it with anyone. I might have a 2% error rate on a first draft.
This is the exact same issue that I've had trying to use LLMs for anything that needs to be precise such as multi-step data pipelines. The code it produces will look correct and produce a result that seems correct. But when you do quality checks on the end data, you'll notice that things are not adding up.
So then you have to dig into all this overly verbose code to identify the 3-4 subtle flaws with how it transformed/joined the data. And these flaws take as much time to identify and correct as just writing the whole pipeline yourself.
I'll get into hot water with this, but I still think LLMs do not think like humans do - as in the code is not a result of a trying to recreate a correct thought process in a programming language, but some sort of statistically most likely string that matches the input requirements,
I used to have a non-technical manager like this - he'd watch out for the words I (and other engineers) said and in what context, and would repeat them back mostly in accurate word contexts. He sounded remarkably like he knew what he was talking about, but would occasionally make a baffling mistake - like mixing up CDN and CSS.
LLMs are like this, I often see Cursor with Claude making the same kind of strange mistake, only to catch itself in the act, and fix the code (but what happens when it doesn't)
I just wrote a post on my site where the LLM had trouble with 1) clicking a button, 2) taking a screenshot, 3) repeat. The non-deterministic nature of LLMs is both a feature and a bug. That said, read/correct can sometimes be a preferable workflow to create/debug, especially if you don't know where to start with creating.
I think it's basically equivalent to giving that prompt to a low paid contractor coder and hoping their solution works out. At least the turnaround time is faster?
But normally you would want a more hands on back and forth to ensure the requirements actually capture everything, validation and etc that the results are good, layers of reviews right
In my experience using small steps and a lot of automated tests work very well with CC. Don’t go for these huge prompts that have a complete feature in it.
Remember the title “attention is all you need”? Well you need to pay a lot of attention to CC during these small steps and have a solid mental model of what it is building.
My favorite part is people taking the 98% number to heart as if there's any basis to it whatsoever and isn't just a number they pulled out of their ass in this marketing material made by an AI company trying to sell you their AI product. In my experience it's more like a 70% for dead simple stuff, and dramatically lower for anything moderately complex.
And why 98%? Why not 99% right? Or 99.9% right? I know they can't outright say 100% because everyone knows that's a blatant lie, but we're okay with them bullshitting about the 98% number here?
Also there's no universe in which this guy gets to walk his dog while his little pet AI does his work for him, instead his boss is going to hound him into doing quadruple the work because he's now so "efficient" that he's finishing his spreadsheet in an hour instead of 8 or whatever. That, or he just gets fired and the underpaid (or maybe not even paid) intern shoots off the same prompt to the magic little AI and does the same shoddy work instead of him. The latter is definitely what the C-suite is aiming for with this tech anyway.
"It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases."
This is the part you have wrong. People just won't do that. They'll save the 8 hours and just deal with 2% error in their work (which reduces as AI models get better). This doesn't work with something with a low error tolerance, but most people aren't building the next Golden Gate Bridge. They'll just fix any problems as they crop up.
Some of you will be screaming right now "THAT'S NOT WORTH IT", as if companies don't already do this to consumers constantly, like losing your luggage at the airport or getting your order wrong. Or just selling you something defective, all of that happens >2% of the time, because companies know customers will just deal-with-it.
It’s not worth it because of the compounding effect when it is a repeated process. 98% accuracy might be fine for a single iteration, but if you run your process 365 times (maybe once a day for a year) whatever your output is will be so wrong that it is unusable.
I have a friend who's vibe-coding apps. He has a lot of them, like 15 or more, but most are only 60–90% complete (almost every feature is only 60-90% complete), which means almost nothing works properly. Last time he showed me something, it was sending the Supabase API key in the frontend with write permissions, so I could edit anything on his site just by inspecting the network tab in developer tools.
The amount of technical debt and security issues building up over the coming years is going to be massive.
I think the question then is what's the human error rate... We know we're not perfect... So if you're 100% rested and only have to find the edge case bug, maybe you'll usually find it vs you're burned out getting it 98% of the way there and fail to see the 2% of the time bugs... Wording here is tricky to explain but I think what we'll find is this helps us get that much closer... Of course when you spend your time building out 98% of the thing you have sometimes a deeper understanding of it so finding the 2% edge case is easier/faster but only time will tell
The problem with this spreadsheet task is that you don't know whether you got only 2% wrong (just rounded some numbers) or way more (e.g. did it get confused and mistook a 2023 PDF with one from 1993?), and checking things yourself is still quite tedious unless there's good support for this in the tool.
At least with humans you have things like reputation (has this person been reliable) or if you did things yourself, you have some good idea of how diligent you've been.
> It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases.
The last '2%' (and in some benchmarks 20%) could cost as much as $100B+ more to make it perfect consistently without error.
This requirement does not apply to generating art. But for agentic tasks, errors at worst being 20% or at best being 2% for an agent may be unacceptable for mistakes.
As you said, if the agent makes an error in either of the steps in an agentic flow or task, the entire result would be incorrect and you would need to check over the entire work again to spot it.
Most will just throw it away and start over; wasting more tokens, money and time.
Distinguishing whether a problem is 0.02 ^ n for error or 0.98 ^ n for accuracy is emerging as an important skill.
Might explain why some people grind up a billion tokens trying to make code work only to have it get worse while others pick apart the bits of truth and quickly fill in their blind spots. The skillsets separating wheat from chaff are things like honest appreciation for corroboration, differentiating subjective from objective problems, and recognizing truth-preserving relationships. If you can find the 0.02 ^ n sub-problems, you can grind them down with AI and they will rapidly converge, leaving the 0.98 ^ n problems to focus human touch on.
I think this is my favorite part of the LLM hype train: the butterfly effect of dependence on an undependable stochastic system propagates errors up the chain until the whole system is worthless.
"I think it got 98% of the information correct..." how do you know how much is correct without doing the whole thing properly yourself?
The two options are:
- Do the whole thing yourself to validate
- Skim 40% of it, 'seems right to me', accept the slop and send it off to the next sucker to plug into his agent.
I think the funny part is that humans are not exempt from similar mistakes, but a human making those mistakes again and again would get fired. Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
This depends on the type of work being done. Sometimes the cost of verification is much lower than the cost of doing the work, sometimes it's about the same, and sometimes it's much more. Here's some recent discussion [0]
> Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
Well yeah, because the agent is so much cheaper and faster than a human that you can eat the cost of the mistakes and everything that comes with them and still come out way ahead. No, of course that doesn't work in aircraft manufacturing or medicine or coding or many other scenarios that get tossed around on HN, but it does work in a lot of others.
Because it's a budget. Verifying them is _much_ cheaper than finding all the entries in a giant PDF in the first place.
> the butterfly effect of dependence on an undependable stochastic system
We're using stochastic systems for a long time. We know just fine how to deal with them.
> Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
There are very few tasks humans complete at a 98% success rate either. If you think "build spreadsheet from PDF" comes anywhere close to that, you've never done that task. We're barely able to recognize objects in their default orientation at a 98% success rate. (And in many cases, deep networks outperform humans at object recognition)
The task of engineering has always been to manage error rates and risk, not to achieve perfection. "butterfly effect" is a cheap rhetorical distraction, not a criticism.
> I think the funny part is that humans are not exempt from similar mistakes, but a human making those mistakes again and again would get fired. Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
My rule is that if you submit code/whatever and it has problems you are responsible for them no matter how you "wrote" it. Put another way "The LLM made a mistake" is not a valid excuse nor is "That's what the LLM spit out" a valid response to "why did you write this code this way?".
LLMs are tools, tools used by humans. The human kicking off an agent, or rather submitting the final work, is still on the hook for what they submit.
"a human making those mistakes again and again would get fired"
You must be really desperate for anti-AI arguments if this is the one you're going with. Employees make mistakes all day every day and they don't get fired. Companies don't give a shit as long as the cost of the mistakes is less than the cost of hiring someone new.
I wonder if you can establish some kind of confidence interval by passing data through a model x number of times. I guess it mostly depends on subjective/objective correctness as well as correctness within a certain context that you may not know if the model knows about or not.
Either way sounds like more corporate drudgery.
People say this, but in my experience it’s not true.
1) The cognitive burden is much lower when the AI can correctly do 90% of the work. Yes, the remaining 10% still takes effort, but your mind has more space for it.
2) For experts who have a clear mental model of the task requirements, it’s generally less effort to fix an almost-correct solution than to invent the entire thing from scratch. The “starting cost” in mental energy to go from a blank page/empty spreadsheet to something useful is significant. (I limit this to experts because I do think you have to have a strong mental framework you can immediately slot the AI output into, in order to be able to quickly spot errors.)
3) Even when the LLM gets it totally wrong, I’ve actually had experiences where a clearly flawed output was still a useful starting point, especially when I’m tired or busy. It nerd-snipes my brain from “I need another cup of coffee before I can even begin thinking about this” to “no you idiot, that’s not how it should be done at all, do this instead…”
>The cognitive burden is much lower when the AI can correctly do 90% of the work. Yes, the remaining 10% still takes effort, but your mind has more space for it.
I think their point is that 10%, 1%, whatever %, the type of problem is a huge headache. In something like a complicated spreadsheet it can quickly become hours of looking for needles in the haystack, a search that wouldn't be necessary if AI didn't get it almost right. In fact it's almost better if it just gets some big chunk wholesale wrong - at least you can quickly identify the issue and do that part yourself, which you would have had to in the first place anyway.
Getting something almost right, no matter how close, can often be worse than not doing it at all. Undoing/correcting mistakes can be more costly as well as labor intensive. "Measure twice cut once" and all that.
I think of how in video production (edits specifically) I can get you often 90% of the way there in about half the time it takes to get it 100%. Those last bits can be exponentially more time consuming (such as an intense color grade or audio repair). The thing is with a spreadsheet like that, you can't accept a B+ or A-. If something is broken, the whole thing is broken. It needs to work more or less 100%. Closing that gap can be a huge process.
I'll stop now as I can tell I'm running a bit in circles lol
> The cognitive burden is much lower when the AI can correctly do 90% of the work.
It's a high cognitive burden if you don't know which 10% of the work the AI failed to do / did incorrectly, though.
I think you're picturing a percentage indicating what scope of the work the AI covered, but the parent was thinking about the accuracy of the work it did cover. But maybe what you're saying is if you pick the right 90% subset, you'll get vastly better than 98% accuracy on that scope of work? Maybe we just need to improve our intuition for where LLMs are reliable and where they're not so reliable.
Though as others have pointed out, these are just made-up numbers we're tossing around. Getting 99% accuracy on 90% of the work is very different from getting 75% accuracy on 50% of the work. The real values vary so much by problem domain and user's level of prompting skill, but it will be really interesting as studies start to emerge that might give us a better idea of the typical values in at least some domains.
A lot of people here also make the assumption that the human user would make no errors.
What error rate this same person would find if reviewing spreadsheets made by other people seems like an inherently critical benchmark before we can even discuss whether this is a problem or an achievement.
More work, without a doubt - any productivity gain immediately becomes the new normal. But now with an additional "2%" error rate compounded on all the tasks you're expected to do in parallel.
> "I think it got 98% of the information correct... I just needed to copy / paste a few things. If it can do 90 - 95% of the time consuming work, that will save you a ton of time"
"Hello, yes, I would like to pollute my entire data store" is an insane a sales pitch. Start backing up your data lakes on physical media, there is going to be an outrageous market for low-background data in the future.
semi-related: How many people are going to get killed because of this?
How often will 98% correct data actually be worse? How often will it be better?
98% might well be disastrous, but I've seen enough awful quality human-produced data that without some benchmarks I'm not confident we know whether this would be better or worse.
This reminds me of the story where Barclays had to buy bad assets from the Lehman bankruptcy because they only hid the rows of assets they did not want, but the receiver saw all the rows due to a mistake somewhere. The kind of 2% fault rate in Excel that could tank a big bank.
By that definition, the ChatGPT app is now an AI agent. When you use ChatGPT nowadays, you can select different models and complement these models with tools like web search and image creation. It’s no longer a simple text-in / text-out interface. It looks like it is still that, but deep down, it is something new: it is agentic…
https://medium.com/thoughts-on-machine-learning/building-ai-...
To be fair, this is also the case with humans: humans make errors as well, and you still need to verify the results.
I once was managing a team of data scientists and my boss kept getting frustrated about some incorrectnesses she discovered, and it was really difficult to explain that this is just human error and it would take lots of resources to ensure 100% correctness.
The same with code.
It’s a cost / benefits balance that needs to be found.
AI just adds another opportunity into this equation.
>It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases.
People act like this is some new thing but this exactly what supervising a more junior coworker is like. These models won't stay performing at Jr. levels for long. That is clear
yes... and arguably the last 5% is harder now because you didn't spend the time yourself to get to that point so you're not really 'up to speed' on what has been produced so far
Yes. Any success I have had with LLMs has been by micromanaging them. Lots of very simple instructions, look at the results, correct them if necessary, then next step.
Yes - and that is especially true for high-stakes processes in organizations. For example, accounting, HR benefits, taxation needs to be exactly right.
Honestly, though, there are far more use cases where 98% correct is equivalent to perfect than situations that require absolute correctness, both in business and for personal use.
I am looking forward to learning why this is entirely unlike working with humans, who in my experience commit very silly and unpredictable errors all the time (in addition to predictable ones), but additionally are often proud and anxious and happy to deliberately obfuscate their errors.
The security risks with this sound scary. Let's say you give it access to your email and calendar. Now it knows all of your deepest secrets. The linked article acknowledges that prompt injection is a risk for the agent:
> Prompt injections are attempts by third parties to manipulate its behavior through malicious instructions that ChatGPT agent may encounter on the web while completing a task. For example, a malicious prompt hidden in a webpage, such as in invisible elements or metadata, could trick the agent into taking unintended actions, like sharing private data from a connector with the attacker, or taking a harmful action on a site the user has logged into.
A malicious website could trick the agent into divulging your deepest secrets!
I am curious about one thing -- the article mentions the agent will ask for permission before doing consequential actions:
> Explicit user confirmation: ChatGPT is trained to explicitly ask for your permission before taking actions with real-world consequences, like making a purchase.
How does the agent know a task is consequential? Could it mistakenly make a purchase without first asking for permission? I assume it's AI all the way down, so I assume mistakes like this are possible.
There is almost guaranteed going to be an attack along the lines of prompt-injecting a calendar invite. Those things are millions of lines long already, with tones of auto-generated text that nobody reads. Embed your injection in the middle of boring text describing the meeting prerequisites and it's as good as written in a transparent font. Then enjoy exfiltrating your victim's entire calendar and who knows what else.
In the system I'm building the main agent doesn't have access to tools and must call scoped down subagents who have one or two tools at most and always in the same category (so no mixed fetch and calendar tools). They must also return structured data to the main agent.
I think that kind of isolation is necessary even though it's a bit more costly. However since the subagents have simple tasks I can use super cheap models.
And the way Google calendar works right now, it automatically shows invites on your calendar, even if they are spam. That does not bode well for prompt injection.
Many of us have been partitioning our “computing” life into public and private segments, for example for social media, job search, or blogging. Maybe it’s time for another segment somewhere in the middle?
Something like lower risk private data, which could contain things like redacted calendar entries, de-identified, anonymized, or obfuscated email, or even low-risk thoughts, journals, and research.
I am Worried; I barely use ChatGPT for anything that could come back to hurt me later, like medical or psychological questions. I hear that lots of folks are finding utility here but I’m reticent.
>I barely use ChatGPT for anything that could come back to hurt me later, like medical or psychological questions
I use ollama with local LLMs for anything that could be considered sensitive, the generation is slower but results are generally quite reasonable. I've had decent success with gemma3 for general queries.
"Agentic misalignment makes it possible for models to act similarly to an insider threat, behaving like a previously-trusted coworker or employee who suddenly begins to operate at odds with a company’s objectives."
Create a burner account for email/calendar, that solves most of those problems. Nobody will care if the AI leaks that you have a dentist appointment on Tuesday.
But isn't the whole supposed value-add here that it gets access to your real data? If you don't want it to get at your calendar, you could presumably just not grant it access in the first place – no need for a fake one. But if you want it to automatically "book me a haircut with the same person as last time in an afternoon time slot when I'm free later this month" then it needs access to your real calendar and if attacked it can leak or wreck your real calendar too. It's hard to see how you can ever have one without the other.
I agree with the scariness etc. Just one possibly comforting point.
I assume (hope?) they use more traditional classifiers for determining importance (in addition to the model's judgment). Those are much more reliable than LLMs & they're much cheaper to run so I assume they run many of them
Almost anyone can add something to people's calendars as well (of course people don't accept random invites but they can appear).
If this kind of agent becomes wide spread hackers would be silly not to send out phishing email invites that simply contain the prompts they want to inject.
The asking for permission thing is irrelevant. People are using this tool to get the friction in their life to near zero, I bet my job that everyone will just turn on auto accept and go for a walk with their dog.
I'm not so optimistic as someone that works on agents for businesses and creating tools for it. The leap from low 90s to 99% is classic last mile problem for LLM agents. The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.
Can't help but feel many are optimizing happy paths in their demos and hiding the true reality. Doesn't mean there isn't a place for agents but rather how we view them and their potential impact needs to be separated from those that benefit from hype.
In general most of the previous AI "breakthrough" in the last decade were backed by proper scientific research and ideas:
- AlphaGo/AlphaZero (MCTS)
- OpenAI Five (PPO)
- GPT 1/2/3 (Transformers)
- Dall-e 1/2, Stable Diffusion (CLIP, Diffusion)
- ChatGPT (RLHF)
- SORA (Diffusion Transformers)
"Agents" is a marketing term and isn't backed by anything. There is little data available, so it's hard to have generally capable agents in the sense that LLMs are generally capable
The technology for reasoning models is the ability to do RL on verifiable tasks, with the some (as-of-yet unpublished, but well-known) search over reasoning chains, with a (presumably neural) reasoning fragment proposal machine, and a (presumably neural) scoring machine for those reasoning fragments.
The technology for agents is effectively the same, with some currently-in-R&D way to scale the training architecture for longer-horizon tasks. ChatGPT agent or o3/o4-mini are likely the first published models that take advantage of this research.
It's fairly obvious that this is the direction that all the AI labs are going if you go to SF house parties or listen to AI insiders like Dwarkesh Patel.
'Agents' are just a design pattern for applications that leverage recent proper scientific breakthroughs. We now have models that are increasingly capable of reading arbitrary text and outputting valid json/xml. It seems like if we're careful about what text we feed them and what json/xml we ask for, we can get them to string together meaningful workflows and operations.
Obviously, this is working better in some problem spaces than others; seems to mainly depend on how in-distribution the data domain is to the LLM's training set. Choices about context selection and the API surface exposed in function calls seem to have a large effect on how well these models can do useful work as well.
My personal framing of "Agents" is that they're more like software robots than they are an atomic unit of technology. Composed of many individual breakthroughs, but ultimately a feat of design and engineering to make them useful for a particular task.
But that's how progress works! To me it makes sense that llms first manage to do 80% of the task, then 90, then 95, then 98, then 99, then 99.5, and so on. The last part IS the hardest, and each iteration of LLMs will get a bit further.
Just because it didn't reach 100% just yet doesn't mean that LLMs as a whole are doomed. In fact, the fact that they are slowly approaching 100% shows promise that there IS a future for LLMs, and that they still have the potential to change things fundamentally, more so than they did already.
But they don’t do 80% of the task. They do 100% of the task, but 20% is wrong (and you don’t know which 20% without manually verifying all of it).
So it is really great for tasks where do the work is a lot harder than verifying it, and mostly useless for tasks where doing the work and verifying it are similarly difficult.
I would go so far as to say that the reason people feel LLMs have stagnated is precisely because they feel like they're only progressing a few percentage points between iteration - despite the fact that these points are the hardest.
> Can't help but feel many are optimizing happy paths in their demos and hiding the true reality.
Even with the best intentions, this feels similar to when a developer hands off code directly to the customer without any review, or QA, etc. We all know that what a developer considers "done" often differs significantly from what the customer expects.
>The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.
To your point - the most impressive AI tool (not an LLM but bear with me) I have used to date, and I loathe giving Adobe any credit, is Adobe's Audio Enhance tool. It has brought back audio that prior to it I would throw out or, if the client was lucky, would charge thousands of dollars and spend weeks working on to repair to get it half as good as that thing spits out in minutes. Not only is it good at salvaging terrible audio, it can make mediocre zoom audio sound almost like it was recorded in a proper studio. It is truly magic to me.
Warning: don't feed it music lol it tries to make the sounds into words. That being said, you can get some wild effects when you do it!
Not even well-optimized. The demos in the related sit-down chat livestream video showed an every-baseball-park-trip planner report that drew a map with seemingly random lines that missed the east coast entirely, leapt into the Gulf of Mexico, and was generally complete nonsense. This was a pre-recorded demo being live-streamed with Sam Altman in the room, and that’s what they chose to show.
I mostly agree with this. The goal with AI companies is not to reach 99% or 100% human-level, it's >100% (do tasks better than an average human could, or eventually an expert).
But since you can't really do that with wedding planning or whatnot, the 100% ceiling means the AI can only compete on speed and cost. And the cost will be... whatever Nvidia feels like charging per chip.
Seen this happen many times with current agent implementations. With RL (and provided you have enough use case data) you can get to a high accuracy on many of these shortcomings. Most problems arise from the fact that prompting is not the most reliable mechanism and is brittle. Teaching a model on specific tasks help negate those issues, and overall results in a better automation outcome without devs having to make so much effort to go from 90% to 99%. Another way to do it is parallel generation and then identifying at runtime which one seems most correct (majority voting or llm as a judge).
I agree with you on the hype part. Unfortunately, that is the reality of current silicon valley. Hype gets you noticed, and gets you users. Hype propels companies forward, so that is about to stay.
I've been using OpenAI operator for some time - but more and more websites are blocking it, such as LinkedIn and Amazon. That's two key use-cases gone (applying to jobs and online shopping).
Operator is pretty low-key, but once Agent starts getting popular, more sites will block it. They'll need to allow a proxy configuration or something like that.
THIS is the main problem. I was listening the whole time for them to announce a way to run it locally or at least proxy through your local devices. Alas the Deepseek R1 distillation experience they went through (a bit like when Steve Jobs was fuming at Google for getting Android to market so quickly) made them wary of showing to many intermediate results, tricks etc. Even in the very beginning Operator v1 was unable to access many sites that blocked data-center IPs and while I went through the effort of patching in a hacky proxy-setup to be able to actually test real world performance they later locked it down even further without improving performance at all. Even when its working, its basically useless and its not working now and only getting worse. Either they make some kinda deal with eastdakota(which he is probably too savvy to agree to)or they can basically forget about doing web browsing directly from their servers.Considering, that all non web applications of "computer use" greatly benefit from local files and software (which you already have the license for!)the whole concept appears to be on the road to failure. Having their remote computer use agent perform most stuff via CLI is actually really funny when you remember that computer use advocates used to claim the whole point was NOT to rely on "outdated" pre-gui interfaces.
If people will actually pay for stuff (food, clothing, flights, whatever) through this agent or operator, I see no reason Amazon etc would continue to block them.
I was buying plenty of stuff through Amazon before they blocked Operator. Now I sometimes buy through other sites that allow it.
The most useful for me was: "here's a picture of a thing I need a new one of, find the best deal and order it for me. Check coupon websites to make sure any relevant discounts are applied."
To be honest, if Amazon continues to block "Agent Mode" and Walmart or another competitor allows it, I will be canceling Prime and moving to that competitor.
The AI isn't going notice the latest and greatest hot new deals that are slathered on every page. It's just going to put the thing you asked for in the shopping-cart.
Possibly in part because bots will not fall for the same tricks as humans (recommended items, as well as other things which amazon does to try and get the most money possible)
In typical SV style, this is just to throw it out there and let second order effects build up. At some point I expect OpenAI to simply form a partnership with LinkedIn and Amazon.
In fact, I suspect LinkedIn might even create a new tier that you'd have to use if you want to use LinkedIn via OpenAI.
I do data work in domains that are closely related to LinkedIn (sales and recruitment), and let me tell you, the chances that LinkedIn lets any data get out of the platform are very slim.
They have some of the strongest anti-bot measures in the world and they even prosecute companies that develop browser extensions for manual extraction. They would prevent people from writing LinkedIn info with pen and paper, if they could. Their APIs are super-rudimentary and they haven't innovated in ages. Their CRM integrations for their paid products (ex: Sales Nav) barely allow you to save info into the CRM and instead opt for iframe style widgets inside your CRM so that data remains within their moat.
Unless you show me how their incentives radically change (ex: they can make tons of money while not sacrificing any other strategic advantage), I will continue to place a strong bet on them being super defensive about data exfiltration.
Agents respecting robots.txt is clearly going to end soon. Users will be installing browser extensions or full browsers that run the actions on their local computer with the user's own cookie jar, IP address, etc.
I hope agents.txt becomes standard and websites actually start to build agent-specific interfaces (or just have API docs in their agent.txt). In my mind it's different from "robots" which is meant to apply rules to broad web-scraping tools.
I wonder how many people will think they are being clever by using the Playwright MCP or browser extensions to bypass robots.txt on the sites blocking the direct use of ChatGPT Agent and will end up with their primary Google/LinkedIn/whatever accounts blocked for robotic activity.
Expecting AI agents to respect robots.txt is like expecting browser extensions like uBlock Origins to respect "please-dont-adblock.txt".
Of course it's going to be ignored, because it's an unreasonable request, it's hard to detect, and the user agent works for the user, not the webmaster.
Assuming the agent is not requesting pages at an overly fast speed, of course. In that case, feel free to 429.
Q: but what about botnets-
I'm replying in the context of "Users will be installing browser extensions or full browsers that run the actions on their local computer with the user's own cookie jar, IP address, etc."
We have a similar tool that can get around any of this, we built a custom desktop that runs on residential proxies. You can also train the agents to get better at computer tasks https://www.agenttutor.com/
Finding, comparing, and ordering products -- I'd ask it to find 5 options on Amazon and create a structured table comparing key features I care about along with price. Then ask it to order one of them.
This solves a big issue for existing CLI agents, which is session persistence for users working from their own machines.
With claude code, you usually start it from your own local terminal. Then you have access to all the code bases and other context you need and can provide that to the AI.
But when you shut your laptop, or have network availability changes the show stops.
I've solved this somewhat on MacOS using the app Amphetamine which allows the machine to go about its business with the laptop fully closed. But there are a variety of problems with this, including heat and wasted battery when put away for travel.
Another option is to just spin up a cloud instance and pull the same repos to there and run claude from there. Then connect via tmux and let loose.
But there are (perhaps easy to overcome) ux issues with getting context up to that you just don't have if it is running locally.
The sandboxing maybe offers some sense of security--again something that can be possibly be handled by executing claude with a specially permissioned user role--which someone with John's use case in the video might want.
---
I think its interesting to see OpenAI trying to crack the Agent UX, possibly for a user type (non developer) that would appreciate its capabilities just as much but not need the ability to install any python package on the fly.
When using spec writter and sub-tasking tools like TaskMaster, Kiro, etc. I've experienced Claude Code to take 30-60+ minutes for a more complex feature
> Mid 2025: Stumbling Agents
The world sees its first glimpse of AI agents.
Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage.
Especially when the author personally knows the engineers working on the features, and routinely goes to parties with them. And when you consider that Altman said last year that “2025 will be the agentic year”
It was common knowledge that big corps were working on agent-type products when that report was written. Hardly much of a prediction, let alone any sort of technical revolution.
The big crux of AI 2027 is the claims about exponential technological improvement. "Agents" are mostly a new frontend to the same technology openai has been selling for a while. Let's see if we're on track at the start of 2026
They aren't predicting any new capabilities here: all things they mentioned already existed in various demos. They are basically saying that the next iteration of Operator is unlikely to be groundbreaking, which is rather obvious. I.e. "sudden breakthrough is unlikely" is just common sense.
Calling it "The world sees its first glimpse of AI agents" is just bad writing, in my opinion. People have been making some basic agents for years, e.g. Auto-GPT & Baby-AGI were published in 2023: https://www.reddit.com/r/singularity/comments/12by8mj/i_used...
Yeah, those had much higher error rate, but what's the principal difference here?
Seems rather weird "it's an agent when OpenAI calls it an agent" appeal to authority.
It's very hard for me to imagine the current level of agents serving a useful purpose in my personal life. If I ask this to plan a date night with my wife this weekend, it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc. This is a lot of stuff it has to get right, and it requires a lot of trust!
I'm excited that this capability is getting close, but I think the current level of performance mostly makes for a good demo and isn't quite something I'm ready to adopt into daily life. Also, OpenAI faces a huge uphill battle with all the integrations required to make stuff like this useful. Apple and Microsoft are in much better spots to make a truly useful agent, if they can figure out the tech.
Maybe this is the "bitter lesson of agentic decisions": hard things in your life are hard because they involve deeply personal values and complex interpersonal dynamics, not because they are difficult in an operational sense. Calling a restaurant to make a reservation is trivial. Deciding what restaurant to take your wife to for your wedding anniversary is the hard part (Does ChatGPT know that your first date was at a burger-and-shake place? Does it know your wife got food poisoning the last time she ate sushi?). Even a highly paid human concierge couldn't do it for you. The Navier–Stokes smoothness problem will be solved before "plan a birthday party for my daughter."
Well, people do have personal assistants and concierges, so it can be done? but I think they need a lot of time and personal attention from you to get that useful right. they need to remember everything you've mentioned offhand or take little corrections consistently.
It seems to me like you have to reset the context window on LLMs way more often than would be practical for that
I would even argue the hard parts of being human don't even need to be automated. Why are we all in a rush to automate everything, including what makes us human?
> hard things in your life are hard because they involve deeply personal values and complex interpersonal dynamics, not because they are difficult in an operational sense
I think what's interesting here is that it's a super cheap version of what many busy people already do -- hire a person to help do this. Why? Because the interface is easier and often less disruptive to our life. Instead of hopping from website to website, I'm just responding to a targeted imessage question from my human assistant "I think you should go with this <sitter,restaurant>, that work?" The next time I need to plan a date night, my assistant already knows what I like.
Replying "yes, book it" is way easier than clicking through a ton of UIs on disparate websites.
My opinion is that agents looking to "one-shot" tasks is the wrong UX. It's the async, single simple interface that is way easier to integrate into your life that's attractive IMO.
Yes! I’ve been thinking along similar lines: agents and LLMs are exposing the worst parts of the ergonomics of our current interfaces and tools (eg programming languages, frameworks).
I reckon there’s a lot to be said for fixing or tweaking the underlying UX of things, as opposed to brute forcing things with an expensive LLM.
> It's very hard for me to imagine the current level of agents serving a useful purpose in my personal life. If I ask this to plan a date night with my wife this weekend, it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc. This is a lot of stuff it has to get right, and it requires a lot of trust!
This would be my ideal "vision" for agents, for personal use, and why I'm so disappointed in Apple's AI flop because this is basically what they promised at last year's WWDC. I even tried out a Pixel 9 pro for a while with Gemini and Google was no further ahead on this level of integration either.
But like you said, trust is definitely going to be a barrier to this level of agent behavior. LLMs still get too much wrong, and are too confident in their wrong answers. They are so frequently wrong to the point where even if it could, I wouldn't want it to take all of those actions autonomously out of fear for what it might actually say when it messages people, who it might add to the calendar invites, etc.
Agents are nothing more than the core chat model with a system prompt, and wrapper that parses responses and executes actions and puts the result into the prompt, and a system instruction that lets the model know what it can do.
Nothing is really that advanced yet with agents themselves - no real reasoning going on.
That being said, you can build your own agents fairly straightforward. The key is designing the wrapper and the system instructions. For example, you can have a guided chat on where it builds of the functionality of looking at your calendar, google location history, babysitter booking, and integrate all of that into automatic actions.
I am not sure I see most of this as a problem. For an agent you would want to write some longer instructions than just "book me an aniversery dinner with my wife".
You would want to write a couple paragraphs outlining what you were hoping to get (maybe the waterfront view was the important thing? Maybe the specific place?)
As for booking a babysitter - if you don't already have a specific person in mind (I don't have kids), then that is likely a separate search. If you do, then their availability is a limiting factor, in just the same way your calendar was and no one, not you, not an agent, not a secretary, can confirm the restaurant unless/until you hear back from them.
As an inspiration for the query, here is one I used with Chat GPT earlier:
>I live in <redacted>. I need a place to get a good quality haircut close to where I live. Its important that the place has opening hours outside my 8:00 to 16:00 mon-fri job and good reviews.
>
>I am not sensitive to the price. Go online and find places near my home. Find recent reviews and list the places, their names, a summary of the reviews and their opening hours.
>
>Thank you
It has to earn that trust and that takes time. But there are a lot of personal use cases like yours that I can imagine.
For example, I suddenly need to reserve a dinner for 8 tomorrow night. That's a pain for me to do, but if I could give it some basic parameters, I'm good with an agent doing this. Let them make the maybe 10-15 calls or queries needed to find a restaurant that fits my constraints and get a reservation.
I see restaurant reservations as an example of an AI agent-appropriate task fairly often, but I feel like it's something that's neither difficult (two or three clicks on OpenTable and I see dozens of options I can book in one more click), nor especially compelling to outsource (if I'm booking something for a group, choosing the place is kind of personal and social—I'm taking everything I know about everybody in the group into account, and I'd likely spend more time downloading that nuance to the agent than I would just scrolling past a few places I know wouldn't work).
Similar to what was shown in the video when I make a large purchase like a home or car I usually obsess for a couple of years and make a huge spreadsheet to evaluate my decisions. Having an agent get all the spreadsheet data would be a big win. I had some success recently trying that with manus.
>it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc
This (and not model quality) is why I’m betting on Google.
Whilst we have seen other implementations of this (providing a VPS to an LLM), this does have a distinct edge others in the way it presents itself. The UI shown, with the text overlay, readable mouse and tailored UI components looks very visually appealing and lends itself well to keeping users informed on what is happening and why at every stage. I have to tip my head to OpenAIs UI team here, this is a really great implementation and I always get rather fascinated whenever I see LLMs being implemented in a visually informative and distinctive manner that goes beyond established metaphors.
Comparing it to the Claude+XFCE solutions we have seen by some providers, I see little in the way of a functional edge OpenAI has at the moment, but the presentation is so well thought out that I can see this being more pleasant to use purely due to that. Many times with the mentioned implementations, I struggled with readability. Not afraid to admit that I may borrow some of their ideas for a personal project.
It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases. I mean, this is nothing new with LLMs, but as these use cases encourage users to input more complex tasks, that are more integrated with our personal data (and at times money, as hinted at by all the "do task X and buy me Y" examples), "almost right" seems like it has the potential to cause a lot of headaches. Especially when the 2% error is subtle and buried in step 3 of 46 of some complex agentic flow.
This is where the AI hype bites people.
A great use of AI in this situation would be to automate the collection and checking of data. Search all of the data sources and aggregate links to them in an easy place. Use AI to search the data sources again and compare against the spreadsheet, flagging any numbers that appear to disagree.
Yet the AI hype train takes this all the way to the extreme conclusion of having AI do all the work for them. The quip about 98% correct should be a red flag for anyone familiar with spreadsheets, because it’s rarely simple to identify which 2% is actually correct or incorrect without reviewing everything.
This same problem extends to code. People who use AI as a force multiplier to do the thing for them and review each step as they go, while also disengaging and working manually when it’s more appropriate have much better results. The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.
“The fallacy in these versions of the same idea is perhaps the most pervasive of all fallacies in philosophy. So common is it that one questions whether it might not be called the philosophical fallacy. It consists in the supposition that whatever is found true under certain conditions may forthwith be asserted universally or without limits and conditions. Because a thirsty man gets satisfaction in drinking water, bliss consists in being drowned. Because the success of any particular struggle is measured by reaching a point of frictionless action, therefore there is such a thing as an all-inclusive end of effortless smooth activity endlessly maintained.
It is forgotten that success is success of a specific effort, and satisfaction the fulfillment of a specific demand, so that success and satisfaction become meaningless when severed from the wants and struggles whose consummations they arc, or when taken universally.”
This might as well be the new definition of “script kiddie”, and it’s the kids that are literally going to be the ones birthed into this lifestyle. The “craft” of programming may not be carried by these coming generations and possibly will need to be rediscovered at some point in the future. The Lost Art of Programming is a book that’s going to need to be written soon.
I disagree. Receiving a spreadsheet from a junior means I need to check it. If this gives me infinite additional juniors I’m good.
It’s this popular pattern of HN comments - expect AI to behave deterministically correct - while the whole world operates on stochastically correct all the time…
Why would you need ai for that though? Pull your sources. Run a diff. Straight to the known truth without the chatgpt subscription. In fact by that point you don’t even need the diff if you pulled from the sources. Just drop into the spreadsheet at that point.
— Tom Cargill, Bell Labs
https://en.wikipedia.org/wiki/Ninety%E2%80%93ninety_rule
Probably because it's just here now? More people take Waymo than Lyft each day in SF.
GenAI is the exciting new tech currently riding the initial hype spike. This will die down into the trough of disillusionment as well, probably sometime next year. Like self-driving, people will continue to innovate in the space and the tech will be developed towards general adoption.
We saw the same during crypto hype, though that could be construed as more of a snake oil type event.
Whenever someone tells me how these models are going to make white collar professions obsolete in five years, I remind them that the people making these predictions 1) said we'd have self driving cars "in a few years" back in 2015 and 2) the predictions about white collar professions started in 2022 so five years from when?
A few comparisons:
>Pressing the button: $1 >Knowing which button to press: $9,999 Those 2% copy-paste changes are the $9.999 and might take as long to find as rest of the work.
Also: SCE to AUX.
Regardless of if AI generates the spreadsheet or if I generate the spreadsheet, I'm still going to do the same validation steps before I share it with anyone. I might have a 2% error rate on a first draft.
So then you have to dig into all this overly verbose code to identify the 3-4 subtle flaws with how it transformed/joined the data. And these flaws take as much time to identify and correct as just writing the whole pipeline yourself.
I used to have a non-technical manager like this - he'd watch out for the words I (and other engineers) said and in what context, and would repeat them back mostly in accurate word contexts. He sounded remarkably like he knew what he was talking about, but would occasionally make a baffling mistake - like mixing up CDN and CSS.
LLMs are like this, I often see Cursor with Claude making the same kind of strange mistake, only to catch itself in the act, and fix the code (but what happens when it doesn't)
But normally you would want a more hands on back and forth to ensure the requirements actually capture everything, validation and etc that the results are good, layers of reviews right
Remember the title “attention is all you need”? Well you need to pay a lot of attention to CC during these small steps and have a solid mental model of what it is building.
And why 98%? Why not 99% right? Or 99.9% right? I know they can't outright say 100% because everyone knows that's a blatant lie, but we're okay with them bullshitting about the 98% number here?
Also there's no universe in which this guy gets to walk his dog while his little pet AI does his work for him, instead his boss is going to hound him into doing quadruple the work because he's now so "efficient" that he's finishing his spreadsheet in an hour instead of 8 or whatever. That, or he just gets fired and the underpaid (or maybe not even paid) intern shoots off the same prompt to the magic little AI and does the same shoddy work instead of him. The latter is definitely what the C-suite is aiming for with this tech anyway.
This is the part you have wrong. People just won't do that. They'll save the 8 hours and just deal with 2% error in their work (which reduces as AI models get better). This doesn't work with something with a low error tolerance, but most people aren't building the next Golden Gate Bridge. They'll just fix any problems as they crop up.
Some of you will be screaming right now "THAT'S NOT WORTH IT", as if companies don't already do this to consumers constantly, like losing your luggage at the airport or getting your order wrong. Or just selling you something defective, all of that happens >2% of the time, because companies know customers will just deal-with-it.
At least with humans you have things like reputation (has this person been reliable) or if you did things yourself, you have some good idea of how diligent you've been.
The usual estimate you see is that about 2-5% of spreadsheets used for running a business contain errors.
The last '2%' (and in some benchmarks 20%) could cost as much as $100B+ more to make it perfect consistently without error.
This requirement does not apply to generating art. But for agentic tasks, errors at worst being 20% or at best being 2% for an agent may be unacceptable for mistakes.
As you said, if the agent makes an error in either of the steps in an agentic flow or task, the entire result would be incorrect and you would need to check over the entire work again to spot it.
Most will just throw it away and start over; wasting more tokens, money and time.
And no, it is not "AGI" either.
Deleted Comment
Might explain why some people grind up a billion tokens trying to make code work only to have it get worse while others pick apart the bits of truth and quickly fill in their blind spots. The skillsets separating wheat from chaff are things like honest appreciation for corroboration, differentiating subjective from objective problems, and recognizing truth-preserving relationships. If you can find the 0.02 ^ n sub-problems, you can grind them down with AI and they will rapidly converge, leaving the 0.98 ^ n problems to focus human touch on.
"I think it got 98% of the information correct..." how do you know how much is correct without doing the whole thing properly yourself?
The two options are:
- Do the whole thing yourself to validate
- Skim 40% of it, 'seems right to me', accept the slop and send it off to the next sucker to plug into his agent.
I think the funny part is that humans are not exempt from similar mistakes, but a human making those mistakes again and again would get fired. Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
[0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...
Well yeah, because the agent is so much cheaper and faster than a human that you can eat the cost of the mistakes and everything that comes with them and still come out way ahead. No, of course that doesn't work in aircraft manufacturing or medicine or coding or many other scenarios that get tossed around on HN, but it does work in a lot of others.
Because it's a budget. Verifying them is _much_ cheaper than finding all the entries in a giant PDF in the first place.
> the butterfly effect of dependence on an undependable stochastic system
We're using stochastic systems for a long time. We know just fine how to deal with them.
> Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
There are very few tasks humans complete at a 98% success rate either. If you think "build spreadsheet from PDF" comes anywhere close to that, you've never done that task. We're barely able to recognize objects in their default orientation at a 98% success rate. (And in many cases, deep networks outperform humans at object recognition)
The task of engineering has always been to manage error rates and risk, not to achieve perfection. "butterfly effect" is a cheap rhetorical distraction, not a criticism.
My rule is that if you submit code/whatever and it has problems you are responsible for them no matter how you "wrote" it. Put another way "The LLM made a mistake" is not a valid excuse nor is "That's what the LLM spit out" a valid response to "why did you write this code this way?".
LLMs are tools, tools used by humans. The human kicking off an agent, or rather submitting the final work, is still on the hook for what they submit.
You must be really desperate for anti-AI arguments if this is the one you're going with. Employees make mistakes all day every day and they don't get fired. Companies don't give a shit as long as the cost of the mistakes is less than the cost of hiring someone new.
Deleted Comment
At a certain point, relentlessly checking for whether the model has got everything is more effort in turn than…doing it.
Moreover, is it actually a 4-8 hour job? Or is the person not using the right tool, is the better tool a sql query?
Half these “wow ai” examples feel like “oh my plates are dirty, better just buy more”.
1) The cognitive burden is much lower when the AI can correctly do 90% of the work. Yes, the remaining 10% still takes effort, but your mind has more space for it.
2) For experts who have a clear mental model of the task requirements, it’s generally less effort to fix an almost-correct solution than to invent the entire thing from scratch. The “starting cost” in mental energy to go from a blank page/empty spreadsheet to something useful is significant. (I limit this to experts because I do think you have to have a strong mental framework you can immediately slot the AI output into, in order to be able to quickly spot errors.)
3) Even when the LLM gets it totally wrong, I’ve actually had experiences where a clearly flawed output was still a useful starting point, especially when I’m tired or busy. It nerd-snipes my brain from “I need another cup of coffee before I can even begin thinking about this” to “no you idiot, that’s not how it should be done at all, do this instead…”
I think their point is that 10%, 1%, whatever %, the type of problem is a huge headache. In something like a complicated spreadsheet it can quickly become hours of looking for needles in the haystack, a search that wouldn't be necessary if AI didn't get it almost right. In fact it's almost better if it just gets some big chunk wholesale wrong - at least you can quickly identify the issue and do that part yourself, which you would have had to in the first place anyway.
Getting something almost right, no matter how close, can often be worse than not doing it at all. Undoing/correcting mistakes can be more costly as well as labor intensive. "Measure twice cut once" and all that.
I think of how in video production (edits specifically) I can get you often 90% of the way there in about half the time it takes to get it 100%. Those last bits can be exponentially more time consuming (such as an intense color grade or audio repair). The thing is with a spreadsheet like that, you can't accept a B+ or A-. If something is broken, the whole thing is broken. It needs to work more or less 100%. Closing that gap can be a huge process.
I'll stop now as I can tell I'm running a bit in circles lol
It's a high cognitive burden if you don't know which 10% of the work the AI failed to do / did incorrectly, though.
I think you're picturing a percentage indicating what scope of the work the AI covered, but the parent was thinking about the accuracy of the work it did cover. But maybe what you're saying is if you pick the right 90% subset, you'll get vastly better than 98% accuracy on that scope of work? Maybe we just need to improve our intuition for where LLMs are reliable and where they're not so reliable.
Though as others have pointed out, these are just made-up numbers we're tossing around. Getting 99% accuracy on 90% of the work is very different from getting 75% accuracy on 50% of the work. The real values vary so much by problem domain and user's level of prompting skill, but it will be really interesting as studies start to emerge that might give us a better idea of the typical values in at least some domains.
What error rate this same person would find if reviewing spreadsheets made by other people seems like an inherently critical benchmark before we can even discuss whether this is a problem or an achievement.
"Hello, yes, I would like to pollute my entire data store" is an insane a sales pitch. Start backing up your data lakes on physical media, there is going to be an outrageous market for low-background data in the future.
semi-related: How many people are going to get killed because of this?
98% might well be disastrous, but I've seen enough awful quality human-produced data that without some benchmarks I'm not confident we know whether this would be better or worse.
https://www.computerworld.com/article/1561181/excel-error-le...
Deleted Comment
I once was managing a team of data scientists and my boss kept getting frustrated about some incorrectnesses she discovered, and it was really difficult to explain that this is just human error and it would take lots of resources to ensure 100% correctness.
The same with code.
It’s a cost / benefits balance that needs to be found.
AI just adds another opportunity into this equation.
People act like this is some new thing but this exactly what supervising a more junior coworker is like. These models won't stay performing at Jr. levels for long. That is clear
Also, do you really understand what the numbers in that spreadsheet mean if you have not been participating in pulling them together?
It just make people quite faster at what they’re already doing.
Deleted Comment
> Prompt injections are attempts by third parties to manipulate its behavior through malicious instructions that ChatGPT agent may encounter on the web while completing a task. For example, a malicious prompt hidden in a webpage, such as in invisible elements or metadata, could trick the agent into taking unintended actions, like sharing private data from a connector with the attacker, or taking a harmful action on a site the user has logged into.
A malicious website could trick the agent into divulging your deepest secrets!
I am curious about one thing -- the article mentions the agent will ask for permission before doing consequential actions:
> Explicit user confirmation: ChatGPT is trained to explicitly ask for your permission before taking actions with real-world consequences, like making a purchase.
How does the agent know a task is consequential? Could it mistakenly make a purchase without first asking for permission? I assume it's AI all the way down, so I assume mistakes like this are possible.
I think that kind of isolation is necessary even though it's a bit more costly. However since the subagents have simple tasks I can use super cheap models.
Something like lower risk private data, which could contain things like redacted calendar entries, de-identified, anonymized, or obfuscated email, or even low-risk thoughts, journals, and research.
I am Worried; I barely use ChatGPT for anything that could come back to hurt me later, like medical or psychological questions. I hear that lots of folks are finding utility here but I’m reticent.
I use ollama with local LLMs for anything that could be considered sensitive, the generation is slower but results are generally quite reasonable. I've had decent success with gemma3 for general queries.
https://www.anthropic.com/research/agentic-misalignment
"Agentic misalignment makes it possible for models to act similarly to an insider threat, behaving like a previously-trusted coworker or employee who suddenly begins to operate at odds with a company’s objectives."
I assume (hope?) they use more traditional classifiers for determining importance (in addition to the model's judgment). Those are much more reliable than LLMs & they're much cheaper to run so I assume they run many of them
If this kind of agent becomes wide spread hackers would be silly not to send out phishing email invites that simply contain the prompts they want to inject.
Can't help but feel many are optimizing happy paths in their demos and hiding the true reality. Doesn't mean there isn't a place for agents but rather how we view them and their potential impact needs to be separated from those that benefit from hype.
just my two cents
- AlphaGo/AlphaZero (MCTS)
- OpenAI Five (PPO)
- GPT 1/2/3 (Transformers)
- Dall-e 1/2, Stable Diffusion (CLIP, Diffusion)
- ChatGPT (RLHF)
- SORA (Diffusion Transformers)
"Agents" is a marketing term and isn't backed by anything. There is little data available, so it's hard to have generally capable agents in the sense that LLMs are generally capable
The technology for reasoning models is the ability to do RL on verifiable tasks, with the some (as-of-yet unpublished, but well-known) search over reasoning chains, with a (presumably neural) reasoning fragment proposal machine, and a (presumably neural) scoring machine for those reasoning fragments.
The technology for agents is effectively the same, with some currently-in-R&D way to scale the training architecture for longer-horizon tasks. ChatGPT agent or o3/o4-mini are likely the first published models that take advantage of this research.
It's fairly obvious that this is the direction that all the AI labs are going if you go to SF house parties or listen to AI insiders like Dwarkesh Patel.
Obviously, this is working better in some problem spaces than others; seems to mainly depend on how in-distribution the data domain is to the LLM's training set. Choices about context selection and the API surface exposed in function calls seem to have a large effect on how well these models can do useful work as well.
MDP, Q learning, TD, RL, PPO are basically all about agent.
What we have today is still very much the same field as it was.
Just because it didn't reach 100% just yet doesn't mean that LLMs as a whole are doomed. In fact, the fact that they are slowly approaching 100% shows promise that there IS a future for LLMs, and that they still have the potential to change things fundamentally, more so than they did already.
So it is really great for tasks where do the work is a lot harder than verifying it, and mostly useless for tasks where doing the work and verifying it are similarly difficult.
Even with the best intentions, this feels similar to when a developer hands off code directly to the customer without any review, or QA, etc. We all know that what a developer considers "done" often differs significantly from what the customer expects.
Yep. This is literally what every AI company does nowadays.
To your point - the most impressive AI tool (not an LLM but bear with me) I have used to date, and I loathe giving Adobe any credit, is Adobe's Audio Enhance tool. It has brought back audio that prior to it I would throw out or, if the client was lucky, would charge thousands of dollars and spend weeks working on to repair to get it half as good as that thing spits out in minutes. Not only is it good at salvaging terrible audio, it can make mediocre zoom audio sound almost like it was recorded in a proper studio. It is truly magic to me.
Warning: don't feed it music lol it tries to make the sounds into words. That being said, you can get some wild effects when you do it!
Deleted Comment
But since you can't really do that with wedding planning or whatnot, the 100% ceiling means the AI can only compete on speed and cost. And the cost will be... whatever Nvidia feels like charging per chip.
I agree with you on the hype part. Unfortunately, that is the reality of current silicon valley. Hype gets you noticed, and gets you users. Hype propels companies forward, so that is about to stay.
Operator is pretty low-key, but once Agent starts getting popular, more sites will block it. They'll need to allow a proxy configuration or something like that.
It'll let the AI platforms get around any other platform blocks by hijacking the consumer's browser.
And it makes total sense, but hopefully everyone else has done the game theory at least a step or two beyond that.
The most useful for me was: "here's a picture of a thing I need a new one of, find the best deal and order it for me. Check coupon websites to make sure any relevant discounts are applied."
To be honest, if Amazon continues to block "Agent Mode" and Walmart or another competitor allows it, I will be canceling Prime and moving to that competitor.
In fact, I suspect LinkedIn might even create a new tier that you'd have to use if you want to use LinkedIn via OpenAI.
They have some of the strongest anti-bot measures in the world and they even prosecute companies that develop browser extensions for manual extraction. They would prevent people from writing LinkedIn info with pen and paper, if they could. Their APIs are super-rudimentary and they haven't innovated in ages. Their CRM integrations for their paid products (ex: Sales Nav) barely allow you to save info into the CRM and instead opt for iframe style widgets inside your CRM so that data remains within their moat.
Unless you show me how their incentives radically change (ex: they can make tons of money while not sacrificing any other strategic advantage), I will continue to place a strong bet on them being super defensive about data exfiltration.
Expecting AI agents to respect robots.txt is like expecting browser extensions like uBlock Origins to respect "please-dont-adblock.txt".
Of course it's going to be ignored, because it's an unreasonable request, it's hard to detect, and the user agent works for the user, not the webmaster.
Assuming the agent is not requesting pages at an overly fast speed, of course. In that case, feel free to 429.
Q: but what about botnets-
I'm replying in the context of "Users will be installing browser extensions or full browsers that run the actions on their local computer with the user's own cookie jar, IP address, etc."
You could host a VNC webview to another desktop with a good IP
Deleted Comment
With claude code, you usually start it from your own local terminal. Then you have access to all the code bases and other context you need and can provide that to the AI.
But when you shut your laptop, or have network availability changes the show stops.
I've solved this somewhat on MacOS using the app Amphetamine which allows the machine to go about its business with the laptop fully closed. But there are a variety of problems with this, including heat and wasted battery when put away for travel.
Another option is to just spin up a cloud instance and pull the same repos to there and run claude from there. Then connect via tmux and let loose.
But there are (perhaps easy to overcome) ux issues with getting context up to that you just don't have if it is running locally.
The sandboxing maybe offers some sense of security--again something that can be possibly be handled by executing claude with a specially permissioned user role--which someone with John's use case in the video might want.
---
I think its interesting to see OpenAI trying to crack the Agent UX, possibly for a user type (non developer) that would appreciate its capabilities just as much but not need the ability to install any python package on the fly.
The latency used to really bother me, but if Claude does 99% of the typing. Its a good idea.
> Mid 2025: Stumbling Agents The world sees its first glimpse of AI agents.
Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage.
Calling it "The world sees its first glimpse of AI agents" is just bad writing, in my opinion. People have been making some basic agents for years, e.g. Auto-GPT & Baby-AGI were published in 2023: https://www.reddit.com/r/singularity/comments/12by8mj/i_used...
Yeah, those had much higher error rate, but what's the principal difference here?
Seems rather weird "it's an agent when OpenAI calls it an agent" appeal to authority.
Deleted Comment
I'm excited that this capability is getting close, but I think the current level of performance mostly makes for a good demo and isn't quite something I'm ready to adopt into daily life. Also, OpenAI faces a huge uphill battle with all the integrations required to make stuff like this useful. Apple and Microsoft are in much better spots to make a truly useful agent, if they can figure out the tech.
It seems to me like you have to reset the context window on LLMs way more often than would be practical for that
Beautiful
Replying "yes, book it" is way easier than clicking through a ton of UIs on disparate websites.
My opinion is that agents looking to "one-shot" tasks is the wrong UX. It's the async, single simple interface that is way easier to integrate into your life that's attractive IMO.
I reckon there’s a lot to be said for fixing or tweaking the underlying UX of things, as opposed to brute forcing things with an expensive LLM.
This would be my ideal "vision" for agents, for personal use, and why I'm so disappointed in Apple's AI flop because this is basically what they promised at last year's WWDC. I even tried out a Pixel 9 pro for a while with Gemini and Google was no further ahead on this level of integration either.
But like you said, trust is definitely going to be a barrier to this level of agent behavior. LLMs still get too much wrong, and are too confident in their wrong answers. They are so frequently wrong to the point where even if it could, I wouldn't want it to take all of those actions autonomously out of fear for what it might actually say when it messages people, who it might add to the calendar invites, etc.
Nothing is really that advanced yet with agents themselves - no real reasoning going on.
That being said, you can build your own agents fairly straightforward. The key is designing the wrapper and the system instructions. For example, you can have a guided chat on where it builds of the functionality of looking at your calendar, google location history, babysitter booking, and integrate all of that into automatic actions.
You would want to write a couple paragraphs outlining what you were hoping to get (maybe the waterfront view was the important thing? Maybe the specific place?)
As for booking a babysitter - if you don't already have a specific person in mind (I don't have kids), then that is likely a separate search. If you do, then their availability is a limiting factor, in just the same way your calendar was and no one, not you, not an agent, not a secretary, can confirm the restaurant unless/until you hear back from them.
As an inspiration for the query, here is one I used with Chat GPT earlier:
>I live in <redacted>. I need a place to get a good quality haircut close to where I live. Its important that the place has opening hours outside my 8:00 to 16:00 mon-fri job and good reviews. > >I am not sensitive to the price. Go online and find places near my home. Find recent reviews and list the places, their names, a summary of the reviews and their opening hours. > >Thank you
One of my favorite use cases for these tools is travel where I can get recommendations for what to do and see without SEO content.
This workflow is nice because you can ask specific questions about a destination (e.g., historical significance, benchmark against other places).
ChatGPT struggles with: - my current location - the current time - the weather - booking attractions and excursions (payments, scheduling, etc.)
There is probably friction here but I think it would be really cool for an agent to serve as a personalized (or group) travel agent.
For example, I suddenly need to reserve a dinner for 8 tomorrow night. That's a pain for me to do, but if I could give it some basic parameters, I'm good with an agent doing this. Let them make the maybe 10-15 calls or queries needed to find a restaurant that fits my constraints and get a reservation.
The act of choosing a date spot is part of your human connection with the person, don’t automate it away!
Focus the automation on other things :)
This (and not model quality) is why I’m betting on Google.
Comparing it to the Claude+XFCE solutions we have seen by some providers, I see little in the way of a functional edge OpenAI has at the moment, but the presentation is so well thought out that I can see this being more pleasant to use purely due to that. Many times with the mentioned implementations, I struggled with readability. Not afraid to admit that I may borrow some of their ideas for a personal project.