Readit News logoReadit News
gchamonlive · 2 months ago
I think it's interesting to juxtapose traditional coding, neural network weights and prompts because in many areas -- like the example of the self driving module having code being replaced by neural networks tuned to the target dataset representing the domain -- this will be quite useful.

However I think it's important to make it clear that given the hardware constraints of many environments the applicability of what's being called software 2.0 and 3.0 will be severely limited.

So instead of being replacements, these paradigms are more like extra tools in the tool belt. Code and prompts will live side by side, being used when convenient, but none a panacea.

karpathy · 2 months ago
I kind of say it in words (agreeing with you) but I agree the versioning is a bit confusing analogy because it usually additionally implies some kind of improvement. When I’m just trying to distinguish them as very different software categories.
miki123211 · 2 months ago
What do you think about structured outputs / JSON mode / constrained decoding / whatever you wish to call it?

To me, it's a criminally underused tool. While "raw" LLMs are cool, they're annoying to use as anything but chatbots, as their output is unpredictable and basically impossible to parse programmatically.

Structured outputs solve that problem neatly. In a way, they're "neural networks without the training". They can be used to solve similar problems as traditional neural networks, things like image classification or extracting information from messy text, but all they require is a Zod or Pydantic type definition and a prompt. No renting GPUs, labeling data and tuning hyperparameters necessary.

They often also improve LLM performance significantly. Imagine you're trying to extract calories per 100g of product, but some product give you calories per serving and a serving size, calories per pound etc. The naive way to do this is a prompt like "give me calories per 100g", but that forces the LLM to do arithmetic, and LLMs are bad at arithmetic. With structured outputs, you just give it the fifteen different formats that you expect to see as alternatives, and use some simple Python to turn them all into calories per 100g on the backend side.

BobbyJo · 2 months ago
The versioning makes sense to me. Software has a cycle where a new tool is created to solve a problem, and the problem winds up being meaty enough, and the tool effective enough, that the exploration of the problem space the tool unlocks is essentially a new category/skill/whatever.

computers -> assembly -> HLL -> web -> cloud -> AI

Nothing on that list has disappeared, but the work has changed enough to warrant a few major versions imo.

gchamonlive · 2 months ago
> versioning is a bit confusing analogy because it usually additionally implies some kind of improvement

Exactly what I felt. Semver like naming analogies bring their own set of implicit meanings, like major versions having to necessarily supersede or replace the previous version, that is, it doesn't account for coexistence further than planning migration paths. This expectation however doesn't correspond with the rest of the talk, so I thought I might point it out. Thanks for taking the time to reply!

poorcedural · 2 months ago
Andrej, maybe Software 3.0 is not written in spoken language like code or prompts. Software 3.0 is recorded in behavior, a behavior that today's software lacks. That behavior is written and consumed by machine and annotated by human interaction. Skipping to 3.0 is premature, but Software 2.0 is a ramp.
swyx · 2 months ago
no no, it actually is a good analogy in 2 ways:

1) it is a breaking change from the prior version

2) it is an improvement in that, in its ideal/ultimate form, it is a full superset of capabilities of the previous version

gyomu · 2 months ago
It's not just the hardware constraints - it's also the training constraints, and the legibility constraints.

Training constraints: you need lots, and lots of data to build complex neural network systems. There are plenty of situations where the data just isn't available to you (whether for legal reasons, technical reasons, or just because it doesn't exist).

Legibility constraints: it is extremely hard to precisely debug and fix those systems. Let's say you build a software system to fill out tax forms - one the "traditional" way, and one that's a neural network. Now your system exhibits a bug where line 58(b) gets sometimes improperly filled out for software engineers who are married, have children, and also declared a source of overseas income. In a traditionally implemented system, you can step through the code and pinpoint why those specific conditions lead to a bug. In a neural network system, not so much.

So totally agreed with you that those are extra tools in the toolbelt - but their applicability is much, much more constrained than that of traditional code.

In short, they excel at situations where we are trying to model an extremely complex system - one that is impossible to nail down as a list of formal requirements - and where we have lots of data available. Signal processing (like self driving, OCR, etc) and human language-related problems are great examples of such problems where traditional programming approaches have failed to yield the kind of results we wanted (ie, beyond human performance) in 70+ years of research and where the modern, neural network approach finally got us the kind of results we wanted.

But if you can define the problem you're trying to solve as formal requirements, then those tools are probably ill-suited.

radicalbyte · 2 months ago
Weights are code being replaced by data; something I've been making heavy use of since the early 00s. After coding for 10 years you start to see the benefits of it and understand where you should use it.

LLMs give us another tool only this time it's far more accessible and powerful.

dcsan · 2 months ago
LLMs have already replaced some code directly for me eg NLP stuff. Previously I might write a bunch of code to do clustering now I just ask the LLM to group things. Obviously this is a very basic feature native to LLMs but there will be more first class LLM callable functions over time.
OJFord · 2 months ago
I'm not sure about the 1.0/2.0/3.0 classification, but it did lead me to think about LLMs as a programming paradigm: we've had imperative & declarative, procedural & functional languages, maybe we'll come to view deterministic vs. probabilistic (LLMs) similarly.

    def __main__:
        You are a calculator. Given an input expression, you compute the result and print it to stdout, exiting 0.
        Should you be unable to do this, you print an explanation to stderr and exit 1.
(and then, perhaps, a bunch of 'DO NOT express amusement when the result is 5318008', etc.)

llflw · 2 months ago
Why bother using human language to communicate with a computer? You interact with a computer using a programming language—code—which is more precise and effective. Specifically: → In 1.0, you communicate with computers using compiled code. → In 2.0, you communicate with compilers using high-level programming languages. → In 3.0, you interact with LLMs using prompts, which arguably should not be in natural human language. Nonetheless, you should communicate with AGIs using human language, just as you would with other human beings.
standeven · 2 months ago
Why bother using higher-level programming languages to communicate with a computer? You interact with a computer using assembly - raw bit shifting and memory addresses - which is more precise and effective.
softfalcon · 2 months ago
If this is what it comes to, it would explain the many, many software malfunctions in Star Trek. If everything is an LLM/LRM (or whatever super advanced version they have in the 23rd century) then everything can evolve into weird emergent behaviours.

stares at every weird holo-deck episode

Dead Comment

semiquaver · 2 months ago
LLMs are not inherently indeterministic. Batching, temperature, and other things make them appear so when run by big providers but a locally-run LLM model at zero temperature will always produce the same output given the same input.
oytis · 2 months ago
That's an improvement, they are still "chaotic" though in that small changes in input can change the output unpredictably strong
lmeyerov · 2 months ago
That assumes they were implemented with deterministic operators, which isn't the default assumption when using neural network libs on GPUs. Imagine random seeds, cublas optimizations - like you can configure all these things, but I wouldn't assume it, esp in GPU-optimized OSS..
ai-christianson · 2 months ago
Why does this remind me of COBOL.
wiz21c · 2 months ago
'cos COBOL was designed to be human readable (writable ?).
dheera · 2 months ago

    def __main__:
        You run main(). If there are issues, you edit __file__ to try to fix the errors and re-run it. You are determined, persistent, and never give up.

beambot · 2 months ago
Output "1" if the program halts; "0" if it doesn't.
OJFord · 2 months ago
You know, the more I think about it, the more I like this model.

What we have today with ChatGPT and the like (and even IDE integrations and API use) is imperative right, it's like 'answer this question' or 'do this thing for me', it's a function invocation. Whereas the silly calculator program I presented above is (unintentionally) kind of a declarative probabilistic program - it's 'this is the behaviour I want, make it so' or 'I have these constraints and these unknowns, fill in the gaps'.

What if we had something like Prolog, but with the possibility of facts being kind of on-demand at runtime, powered by the LLM driving it?

crsn · 2 months ago
This (sort of) is already a paradigm: https://en.m.wikipedia.org/wiki/Probabilistic_programming
stabbles · 2 months ago
That's entirely orthogonal.

In probabilistic programming you (deterministically) define variables and formulas. It's just that the variables aren't instances of floats, but represent stochastic variables over floats.

This is similar to libraries for linear algebra where writing A * B * C does not immediately evaluate, but rather builds an expression tree that represent the computation; you need to do say `eval(A * B * C)` to obtain the actual value, and it gives the library room to compute it in the most efficient way.

It's more related to symbolic programming and lazy evaluation than (non-)determinism.

no_wizard · 2 months ago
I wonder when companies will remove the personality out of LLMs by default, especially for tools
dingnuts · 2 months ago
that would require actually curating the training data and eliminating sources that contain casual conversation

too expensive since those are all licensed sources, much easier to train on Reddit data

iLoveOncall · 2 months ago
> maybe we'll come to view deterministic vs. probabilistic (LLMs) similarly

I can't believe someone would seriously write this and not realize how nonsensical it is.

"indeterministic programming", you seriously cannot come up with a bigger oxymoron.

diggan · 2 months ago
Why do people keep having this reaction to something we're already used to? When you're developing against an API, you're already doing the same thing, planning for what happens when the request hangs, or fails completely, or gives a different response, and so on. Same for basically any IO.

It's almost not even new, just that it generates text instead of JSON, or whatever. But we've already been doing "indeterministic programming" for a long time, where you cannot always assume a function 100% returns what it should all the time.

Dead Comment

practal · 2 months ago
Great talk, thanks for putting it online so quickly. I liked the idea of making the generation / verification loop go brrr, and one way to do this is to make verification not just a human task, but a machine task, where possible.

Yes, I am talking about formal verification, of course!

That also goes nicely together with "keeping the AI on a tight leash". It seems to clash though with "English is the new programming language". So the question is, can you hide the formal stuff under the hood, just like you can hide a calculator tool for arithmetic? Use informal English on the surface, while some of it is interpreted as a formal expression, put to work, and then reflected back in English? I think that is possible, if you have a formal language and logic that is flexible enough, and close enough to informal English.

Yes, I am talking about abstraction logic [1], of course :-)

So the goal would be to have English (German, ...) as the ONLY programming language, invisibly backed underneath by abstraction logic.

[1] http://abstractionlogic.com

AdieuToLogic · 2 months ago
> So the question is, can you hide the formal stuff under the hood, just like you can hide a calculator tool for arithmetic? Use informal English on the surface, while some of it is interpreted as a formal expression, put to work, and then reflected back in English?

The problem with trying to make "English -> formal language -> (anything else)" work is that informality is, by definition, not a formal specification and therefore subject to ambiguity. The inverse is not nearly as difficult to support.

Much like how a property in an API initially defined as being optional cannot be made mandatory without potentially breaking clients, whereas making a mandatory property optional can be backward compatible. IOW, the cardinality of "0 .. 1" is a strict superset of "1".

practal · 2 months ago
> The problem with trying to make "English -> formal language -> (anything else)" work is that informality is, by definition, not a formal specification and therefore subject to ambiguity. The inverse is not nearly as difficult to support.

Both directions are difficult and important. How do you determine when going from formal to informal that you got the right informal statement? If you can judge that, then you can also judge if a formal statement properly represents an informal one, or if there is a problem somewhere. If you detect a discrepancy, tell the user that their English is ambiguous and that they should be more specific.

lelanthran · 2 months ago
> Use informal English on the surface, while some of it is interpreted as a formal expression, put to work, and then reflected back in English? I think that is possible, if you have a formal language and logic that is flexible enough, and close enough to informal English.

That sounds like a paradox.

Formal verification can prove that constraints are held. English cannot. mapping between them necessarily requires disambiguation. How would you construct such a disambiguation algorithm which must, by its nature, be deterministic?

practal · 2 months ago
Going from informal to formal can be done using autoformalization [1]. The real question is, how do you judge that the result is correct?

[1] Autoformalization with Large Language Models — https://papers.nips.cc/paper_files/paper/2022/hash/d0c6bc641...

andrepd · 2 months ago
Not gonna lie, after skimming the website and a couple preprints for 10 minutes my crank detector is off the charts. Your very vague comments adds to it.

But maybe I just don't understand.

practal · 2 months ago
Yes, you just don't understand :-)

I am working on making it simpler to understand, and particularly, simpler to use.

PS: People keep browsing the older papers although they are really outdated. I've updated http://abstractionlogic.com to point to the newest information instead.

redbell · 2 months ago
> "English is the new programming language."

For those who missed it, here's the viral tweet by Karpathy himself: https://x.com/karpathy/status/1617979122625712128

throwaway314155 · 2 months ago
Referenced in the video of course. Not that everyone should watch a 40 minute long video before commenting but his reaction to the "meme" that vibe coding became when his tweet was intended as more of a shower thought is worth checking out.
singularity2001 · 2 months ago
lean 4/5 will be a rising star!
practal · 2 months ago
You would definitely think so, Lean is in a great position here!

I am betting though that type theory is not the right logic for this, and that Lean can be leapfrogged.

kordlessagain · 2 months ago
This thread perfectly captures what Karpathy was getting at. We're witnessing a fundamental shift where the interface to computing is changing from formal syntax to natural language. But you can see people struggling to let go of the formal foundations they've built their careers on.
uncircle · 2 months ago
> This thread perfectly captures what Karpathy was getting at. We're witnessing a fundamental shift where the interface to computing is changing from formal syntax to natural language.

Yes, telling a subordinate with natural language what you need is called being a product manager. Problem is, the subordinate has encyclopedic knowledge but it's also extremely dumb in many aspects.

I guess this is good for people that got into CS and hate the craft so prefer doing management, but in many cases you still need in your team someone with a IQ higher than room temperature to deliver a product. The only "fundamental" shift here is killing the entry-level coder at the big corp tasked at doing menial and boilerplate tasks, when instead you can hire a mechanical replacement from an AI company for a few hundred dollars a month.

norir · 2 months ago
Have you thought through the downsides of letting go of these formal foundations that have nothing to do with job preservation? This comes across as a rather cynical interpretation of the motivations of those who have concerns.
otabdeveloper4 · 2 months ago
> We're witnessing a fundamental shift where the interface to computing is changing from formal syntax to natural language.

People have said this every year since the 1950's.

No, it is not happening. LLMs won't help.

Writing code is easy, it's understanding the problem domain is hard. LLMs won't help you understand the problem domain in a formal manner. (In fact they might make it even more difficult.)

megaman821 · 2 months ago
Yep, that why I never write anything out using mathmatical expressions. Natural language only baby!
Eggpants · 2 months ago
No. Karpathy has long embraced the Silly-con valley “Fake it until you make it” mind set. One of his slides even had a frame of Tesla self driving video that was later revealed to be faked.

It’s in his financial self interest to over inflate LLM’s beyond their “cool math bar trick” level. They are a lossy text compression technique with stolen text sources.

All this “magic” is just function calls behind the scenes doing web/database/math/etc for the LLM.

Anyone who claims LLMs have a soul either truly doesn’t understand how they work (association rules++) or has hitched their financial wagon to this grift. It’s the crypto coin bruhs looking for their next score.

skydhash · 2 months ago
Not really. There’s a problem to be solved, and the solution is always best exprimed in formal notation, because we can then let computers do it and not worry about it.

We already have natural languages for human systems and the only way it works is because of shared metaphors and punishment and rewards. Everyone is incentivized to do a good job.

mkleczek · 2 months ago
This is why I call all this AI stuff BS.

Using a formal language is a feature, not a bug. It is a cornerstone of all human engineering and scientific activity and is the _reason_ why these disciplines are successful.

What you are describing (ie. ditching formal and using natural language) is moving humanity back towards magical thinking, shamanism and witchcraft.

neuronic · 2 months ago
It's called gatekeeping and the gatekeepers will be the ones left in the dust. This has been proven time and time again. Better learn to go with the flow - judging LLMs on linear improvements or even worse on today's performance is a fool's errand.

Even if improvements level off and start plateauing, things will still get better and for careful guided, educated use LLMs have already become a great accelerator in many ways. StackOverflow is basically dead now which in itself is a fundamental shift from just 3-4 years ago.

dang · 2 months ago
This was my favorite talk at AISUS because it was so full of concrete insights I hadn't heard before and (even better) practical points about what to build now, in the immediate future. (To mention just one example: the "autonomy slider".)

If it were up to me, which it very much is not, I would try to optimize the next AISUS for more of this. I felt like I was getting smarter as the talk went on.

kaycebasques · 2 months ago
On one hand, I think Karpathy is a gifted educator in a way that's not repeatable as a science. On the other, if the conference leaders next year told every presenter to watch this talk and emulate how Karpathy focuses on concrete insights and suggests what to build now, then the overall quality of presentations would probably trend higher.
hgl · 2 months ago
It’s fascinating to think about what true GUI for LLM could be like.

It immediately makes me think a LLM that can generate a customized GUI for the topic at hand where you can interact with in a non-linear way.

karpathy · 2 months ago
Fun demo of an early idea was posted by Oriol just yesterday :)

https://x.com/OriolVinyalsML/status/1935005985070084197

spamfilter247 · 2 months ago
My takeaway from the demo is less that "it's different each time", but more a "it can be different for different users and their styles of operating" - a poweruser can now see a different Settings UI than a basic user, and it can be generated realtime based on the persona context of the user.

Example use case (chosen specifically for tech): An IDE UI that starts basic, and exposes functionality over time as the human developer's skills grow.

superfrank · 2 months ago
On one hand, I'm incredibly impressed by the technology behind that demo. On the other hand, I can't think of many things that would piss me off more than a non-deterministic operating system.

I like my tools to be predictable. Google search trying to predict that I want the image or shopping tag based on my query already drives me crazy. If my entire operating system did that, I'm pretty sure I'd throw my computer out a window.

hackernewds · 2 months ago
it's impressive but it seems like a crappier UX? that none of the patterns can really be memorized
asterisk_ · 2 months ago
I feel like one quickly hits a similar partial observability problem as with e.g. light sensors. How often do you wave around annoyed because the light turned off.

To get _truly_ self driving UIs you need to read the mind of your users. It's some heavy tailed distribution all the way down. Interesting research problem on its own.

We already have adaptive UIs (profiles in VSC anyone? Vim, Emacs?) they're mostly under-utilized because takes time to setup + most people are not better at designing their own workflow relative to the sane default.

aprilthird2021 · 2 months ago
This is crazy cool, even if not necessarily the best use case for this idea
throwaway314155 · 2 months ago
I would bet good money that many of the functions they chose not to drill down into (such as settings -> volume) do nothing at all or cause an error.

It's a fronted generator. It's fast. That's cool. But is being pitched as a functioning OS generator and I can't help but think it isn't given the failure rates for those sorts of tasks. Further, the success rates for HTML generation probably _are_ good enough for a Holmes-esque (perhaps too harsh) rugpull (again, too harsh) demo.

A cool glimpse into what the future might look like in any case.

superconduct123 · 2 months ago
That looks both cool and infuriating
suddenlybananas · 2 months ago
Having different documents come up every time you go into the documents directory seems hellishly terrible.
sensanaty · 2 months ago
[flagged]
cjcenizal · 2 months ago
My friend Eric Pelz started a company called Malleable to do this very thing: https://www.linkedin.com/posts/epelz_every-piece-of-software...
whatarethembits · 2 months ago
I'm curious where this ends up going.

Personally I think its a mistake; at least at "team" level. One of the most valuable things about a software or framework dictating how things are done is to give a group of people a common language to communicate with and enforce rules. This is why we generally prefer to use a well documented framework, rather than letting a "rockstar engineer" roll their own. Only they will understand its edge cases and ways of thinking, everyone else will pay a price to adapt to that, dragging everyone's productivity down.

Secondly, most people don't know what they want or how they want to work with a specific piece of software. Its simply not important enough, in the hierarchy of other things they care about, to form opinions about how a specific piece of software ought to work. What they want, is the easiest and fastest way to get something done and move on. It takes insight, research and testing to figure out what that is in a specific domain. This is what "product people" are supposed to figure out; not farm it out to individual users.

jonny_eh · 2 months ago
An ever-shifting UI sounds unlearnable, and therefore unusable.
dang · 2 months ago
It wouldn't be unlearnable if it fits the way the user is already thinking.
OtherShrezzing · 2 months ago
A mixed ever-shifting UI can be excellent though. So you've got some tools which consistently interact with UI components, but the UI itself is altered frequently.

Take for example world-building video games like Cities Skylines / Sim City or procedural sandboxes like Minecraft. There are 20-30 consistent buttons (tools) in the game's UX, while the rest of the game is an unbounded ever-shifting UI.

9rx · 2 months ago
Tools like v0 are a primitive example of what the above is talking about. The UI maintains familiar conventions, but is laid out dynamically based on surrounding context. I'm sure there are still weird edge cases, but for the most part people have no trouble figuring out how to use the output of such tools already.
sotix · 2 months ago
Like Spotify ugh
dpkirchner · 2 months ago
Like a HyperCard application?
necrodome · 2 months ago
We (https://vibes.diy/) are betting on this
stoisesky · 2 months ago
This talk https://www.youtube.com/watch?v=MbWgRuM-7X8 explores the idea of generative / malleable personal user interfaces where LLMs can serve as the gateway to program how we want our UI to be rendered.
nbbaier · 2 months ago
I love this concept and would love to know where to look for people working on this type of thing!
semi-extrinsic · 2 months ago
Humans are shit at interacting with systems in a non-linear way. Just look at Jupyter notebooks and the absolute mess that arises when you execute code blocks in arbitrary order.
bicepjai · 2 months ago
What is the mess you are referring with regards to Jupyter notebooks ?
nilirl · 2 months ago
Where do these analogies break down?

1. Similar cost structure to electricity, but non-essential utility (currently)?

2. Like an operating system, but with non-determinism?

3. Like programming, but ...?

Where does the programming analogy break down?

PeterStuer · 2 months ago
Define non-essenti

The way I see dependency in office ("knowledge") work:

- pre-(computing) history. We are at the office, we work

- dawn of the pc: my computer is down, work halts

- dawn of the lan: the network is down, work halts

- dawn of the Internet: the Internet connection is down, work halts (<- we are basically all here)

- dawn of the LLM: ChatGPT is down, work halts (<- for many, we are here already)

nilirl · 2 months ago
I see your point. It's nearing essential.
rudedogg · 2 months ago
> programming

The programming analogy is convenient but off. The joke has always been “the computer only does exactly what you tell it to do!” regarding logic bugs. Prompts and LLMs most certainly do not work like that.

I loved the parallels with modern LLMs and time sharing he presented though.

diggan · 2 months ago
> Prompts and LLMs most certainly do not work like that.

It quite literally works like that. The computer is now OS + user-land + LLM runner + ML architecture + weights + system prompt + user prompt.

Taken together, and since you're adding in probabilities (by using ML/LLMs), you're quite literally getting "the computer only does exactly what you tell it to do!", it's just that we have added "but make slight variations to what tokens you select next" (temperature>0.0) sometimes, but it's still the same thing.

Just like when you tell the computer to create encrypted content by using some seed. You're getting exactly what you asked for.

Deleted Comment

politelemon · 2 months ago
only in English, and also non-deterministic.
malux85 · 2 months ago
Yeah, wherever possible I try to have the llm answer me in Python rather than English (especially when explaining new concepts)

English is soooooo ambiguous

mikewarot · 2 months ago
A few days ago, I was introduced to the idea that when you're vibe coding, you're consulting a "genie", much like in the fables, you almost never get what you asked for, but if your wishes are small, you might just get what you want.

The primagen reviewed this article[1] a few days ago, and (I think) that's where I heard about it. (Can't re-watch it now, it's members only) 8(

[1] https://medium.com/@drewwww/the-gambler-and-the-genie-08491d...

anythingworks · 2 months ago
that's a really good analogy! It feels like wicked joke that llms behave in such a way that they're both intelligent and stupid at the same time
fudged71 · 2 months ago
“You are an expert 10x software developer. Make me a billion dollar app.” Yeah this checks out
abdullin · 2 months ago
Tight feedback loops are the key in working productively with software. I see that in codebases up to 700k lines of code (legacy 30yo 4GL ERP systems).

The best part is that AI-driven systems are fine with running even more tight loops than what a sane human would tolerate.

Eg. running full linting, testing and E2E/simulation suite after any minor change. Or generating 4 versions of PR for the same task so that the human could just pick the best one.

bandoti · 2 months ago
Here’s a few problems I foresee:

1. People get lazy when presented with four choices they had no hand in creating, and they don’t look over the four and just click one, ignoring the others. Why? Because they have ten more of these on the go at once, diminishing their overall focus.

2. Automated tests, end-to-end sim., linting, etc—tools already exist and work at scale. They should be robust and THOROUGHLY reviewed by both AI and humans ideally.

3. AI is good for code reviews and “another set of eyes” but man it makes serious mistakes sometimes.

An anecdote for (1), when ChatGPT tries to A/B test me with two answers, it’s incredibly burdensome for me to read twice virtually the same thing with minimal differences.

Code reviewing four things that do almost the same thing is more of a burden than writing the same thing once myself.

abdullin · 2 months ago
A simple rule applies: "No matter what tool created the code, you are still responsible for what you merge into main".

As such, task of verification, still falls on hands of engineers.

Given that and proper processes, modern tooling works nicely with codebases ranging from 10k LOC (mixed embedded device code with golang backends and python DS/ML) to 700k LOC (legacy enterprise applications from the mainframe era)

eddd-ddde · 2 months ago
With lazy people the same applies for everything, code they do write, or code they review from peers. The issue is not the tooling, but the hands.
OvbiousError · 2 months ago
I don't think the human is the problem here, but the time it takes to run the full testing suite.
tlb · 2 months ago
Yes, and (some near-future) AI is also more patient and better at multitasking than a reasonable human. It can make a change, submit for full fuzzing, and if there's a problem it can continue with the saved context it had when making the change. It can work on 100s of such changes in parallel, while a human trying to do this would mix up the reasons for the change with all the other changes they'd done by the time the fuzzing result came back.

LLMs are worse at many things than human programmers, so you have to try to compensate by leveraging the things they're better at. Don't give up with "they're bad at such and such" until you've tried using their strengths.

abdullin · 2 months ago
Humans tend to lack inhumane patience.
diggan · 2 months ago
It is kind of a human problem too, although that the full testing suite takes X hours to run is also not fun, but it makes the human problem larger.

Say you're Human A, working on a feature. Running the full testing suite takes 2 hours from start to finish. Every change you do to existing code needs to be confirmed to not break existing stuff with the full testing suite, so some changes it takes 2 hours before you have 100% understanding that it doesn't break other things. How quickly do you lose interest, and at what point do you give up to either improve the testing suite, or just skip that feature/implement it some other way?

Now say you're Robot A working on the same task. The robot doesn't care if each change takes 2 hours to appear on their screen, the context is exactly the same, and they're still "a helpful assistant" 48 hours later when they still try to get the feature put together without breaking anything.

If you're feeling brave, you start Robot B and C at the same time.

londons_explore · 2 months ago
The full test suite is probably tens of thousands of tests.

But AI will do a pretty decent job of telling you which tests are most likely to fail on a given PR. Just run those ones, then commit. Cuts your test time from hours down to seconds.

Then run the full test suite only periodically and automatically bisect to find out the cause of any regressions.

Dramatically cuts the compute costs of tests too, which in big codebase can easily become whole-engineers worth of costs.

Byamarro · 2 months ago
I work in web dev, so people sometimes hook code formatting as a git commit hook or sometimes even upon file save. The tests are problematic tho. If you work at huge project it's a no go idea at all. If you work at medium then the tests are long enough to block you, but short enough for you not to be able to focus on anything else in the meantime.
9rx · 2 months ago
Unless you are doing something crazy like letting the fuzzer run on every change (cache that shit), the full test suite taking a long time suggests that either your isolation points are way too large or you are letting the LLM cross isolated boundaries and "full testing suite" here actually means "multiple full testing suites". The latter is an easy fix: Don't let it. Force it stay within a single isolation zone just like you'd expect of a human. The former is a lot harder to fix, but I suppose ending up there is a strong indicator that you can't trust the human picking the best LLM result in the first place and that maybe this whole thing isn't a good idea for the people in your organization.
yahoozoo · 2 months ago
The problem is that every time you run your full automation with linting and tests, you’re filling up the context window more and more. I don’t know how people using Claude do it with its <300k context window. I get the “your message will exceed the length of this chat” message so many times.
diggan · 2 months ago
I don't know exactly how Claude works, but the way I work around this with my own stuff is prompting it to not display full outputs ever, and instead temporary redirect the output somewhere then grep from the log-file what it's looking for. So a test run outputting 10K lines of test output and one failure is easily found without polluting the context with 10K lines.
abdullin · 2 months ago
Claude's approach is currently a bit dated.

Cursor.sh agents or especially OpenAI Codex illustrate that a tool doesn't need to keep on stuffing context window with irrelevant information in order to make progress on a task.

And if really needed, engineers report that Gemini Pro 2.5 keeps on working fine within 200k-500k token context. Above that - it is better to reset the context.

the_mitsuhiko · 2 months ago
I started to use sub agents for that. That does not pollute the context as much
elif · 2 months ago
In my experience with Jules and (worse) Codex, juggling multiple pull requests at once is not advised.

Even if you tell the git-aware Jules to handle a merge conflict within the context window the patch was generated, it is like sorry bro I have no idea what's wrong can you send me a diff with the conflict?

I find i have to be in the iteration loop at every stage or else the agent will forget what it's doing or why rapidly. for instance don't trust Jules to run your full test suite after every change without handholding and asking for specific run results every time.

It feels like to an LLM, gaslighting you with code that nominally addresses the core of what you just asked while completely breaking unrelated code or disregarding previously discussed parameters is an unmitigated success.

layer8 · 2 months ago
> Tight feedback loops are the key in working productively with software. […] even more tight loops than what a sane human would tolerate.

Why would a sane human be averse to things happening instantaneously?

latexr · 2 months ago
> Or generating 4 versions of PR for the same task so that the human could just pick the best one.

That sounds awful. A truly terrible and demotivating way to work and produce anything of real quality. Why are we doing this to ourselves and embracing it?

A few years ago, it would have been seen as a joke to say “the future of software development will be to have a million monkey interns banging on one million keyboards and submit a million PRs, then choose one”. Today, it’s lauded as a brilliant business and cost-saving idea.

We’re beyond doomed. The first major catastrophe caused by sloppy AI code can’t come soon enough. The sooner it happens, the better chance we have to self-correct.

chamomeal · 2 months ago
I say this all the time!

Does anybody really want to be an assembly line QA reviewer for an automated code factory? Sounds like shit.

Also I can’t really imagine that in the first place. At my current job, each task is like 95% understanding all the little bits, and then 5% writing the code. If you’re reviewing PRs from a bot all day, you’ll still need to understand all the bits before you accept it. So how much time is that really gonna save?

ponector · 2 months ago
>That sounds awful.

Not for the cloud provider. AWS bill to the moon!

osigurdson · 2 months ago
I'm not sure that AI code has to be sloppy. I've had some success with hand coding some examples and then asking codex to rigorously adhere to prior conventions. This can end up with very self consistent code.

Agree though on the "pick the best PR" workflow. This is pure model training work and you should be compensated for it.

bonoboTP · 2 months ago
If it's monkeylike quality and you need a million tries, it's shit. It you need four tries and one of those is top-tier professional programmer quality, then it's good.
diggan · 2 months ago
> A truly terrible and demotivating way to work and produce anything of real quality

You clearly have strong feelings about it, which is fine, but it would be much more interesting to know exactly why it would terrible and demotivating, and why it cannot produce anything of quality? And what is "real quality" and does that mean "fake quality" exists?

> million monkey interns banging on one million keyboards and submit a million PRs

I'm not sure if you misunderstand LLMs, or the famous "monkeys writing Shakespeare" part, but that example is more about randomness and infinity than about probabilistic machines somewhat working towards a goal with some non-determinism.

> We’re beyond doomed

The good news is that we've been doomed for a long time, yet we persist. If you take a look at how the internet is basically held up by duct-tape at this point, I think you'd feel slightly more comfortable with how crap absolutely everything is. Like 1% of software is actually Good Software while the rest barely works on a good day.

koakuma-chan · 2 months ago
> That sounds awful. A truly terrible and demotivating way to work and produce anything of real quality

This is the right way to work with generative AI, and it already is an extremely common and established practice when working with image generation.

Deleted Comment