The term AGI so obviously means something way smarter than what we have. We do have something impressive but it’s very limited.
Playing loud music, your neighbours can hear it => you’re the problem
Smoking and having the smoke pollute your neighbours air => you’re the problem
Plenty of times the fault is with the apartment, etc.: if the reasonable noise of me living disrupts my neighbors, that's bad design. Different people work different shifts - I don't see why the morning person should have to hold off on a morning shower just because the plumbing wakes up their neighbor, nor why the night-shift worker should have to hold off on doing laundry just because that wakes the morning person up.
We can't blindly trust Waymo's PR releases or apples-to-oranges comparisons. That's why the bar is higher.
People somehow imagine an agent that can crush the competition with minimal human oversight. And then they somehow think that they'll be in charge, and not Sam Altman, a government, or possibly the model itself.
If the model's that good, nobody's going to sell it to you.
It is so named because we have literal Dark Factories in the real world, run by robotics instead of AI, producing cellphones without any need for humans.
None the less, said literal Dark Factory that actually exists, in the real world, is still owned by the corporation that built it. The robots did not take over, the government did not seize it.
All the great work you see on the internet AI has supposedly done was only achieved by a human doing lots of trial and error and curating everything the agentic LLM did. And it's all cherry picked successes.
The article explicitly states an 83% success rate. That's apparently good enough for them! Systems don't need to be perfect to be useful.
I test all of the code I produce via LLMs, usually doing fairly tight cycles. I also review the unit test coverage manually, so that I have a decent sense that it really is testing things - the goal is less perfect unit tests and more just quickly catching regressions. If I have a lot of complex workflows that need testing, I'll have it write unit tests and spell out the specific edge cases I'm worried about, or setup cheat codes I can invoke to test those workflows out in the UI/CLI.
Trust comes from using them often - you get a feeling for what a model is good and bad at, and what LLMs in general are good and bad at. Most of them are a bit of a mess when it comes to UI design, for instance, but they can throw together a perfectly serviceable "About This" HTML page. Any long-form text they write (such as that About page) is probably trash, but that's super-easy to edit manually. You can often just edit down what they write: they're actually decent writers, just very verbose and unfocused.
I find it similar to management: you have to learn how each employee works. Unless you're in the Top 1%, you can't rely on every employee giving 110% and always producing perfect PRs. Bugs happen, and even NASA-strictness doesn't bring that down to zero.
And just like management, some models are going to be the wrong employee for you because they think your style guide is stupid and keep writing code how they think it should be written.
Is there some missing and frequently used 4th option, here? Or some other route that you'd expect? Presumably it needs to get packages via some method.
You just described a GitHub feature
The person you're replying to mentions a fairly large number of actions, here: "cloned the codebase, found the issue, wrote the fix, added tests. I asked it to code review its own fix. The AI debugged itself, then reviewed its own work, and then helped me submit the PR."
If GitHub really does have a feature I can turn on that just automatically fixes my code, I'd love to know about it.
Again, I'm kind of on a 'suck it dear company' attitude. The reason they ban you must align with the terms of service and must be backed up with data that is kept X amount of time.
Simply put, we've seen no shortage of individuals here on HN or other sites like Twitter that need to use social media to resolve whatever occurred because said company randomly banned an account under false pretenses.
This really matters when we are talking about giants like Google, or any other service in a near monopoly position.
(/sarcasm)