Readit News logoReadit News
Posted by u/chrisjj 2 years ago
I'm puzzled how anyone trusts ChatGPT for code
I asked ChatGPT to convert a simple 12-line Python program to Lisp.

The result is broken by mismatched brackets. Yes. In Lisp.

I ask for correction and receive the errant bracket moved, still mismatched.

I ask again and the bracket gets removed. Fine??

No. A loop break has been mistranslated to a return from function, so execution is nonsense.

I wonder how much user faith in ChatGPT is based on examples in which the errors are not apparent ... to a certain kind of user.

delichon · 2 years ago
I'm puzzled why anyone would incorporate code on trust. I certainly don't trust LLMs for code, but I use code from LLMs on most work days ... not because I trust it, but because I understand it and tested it. So, the same as code from anyone, including from myself six months ago. Trust doesn't enter into it. Clear and workable solutions do, and I've gotten an impressive number of those.
gordon_freeman · 2 years ago
yes. LLMs are great at giving a template or code block for specific usecase and then I use it after understanding, modifying (if needed) and testing it. I am not sure about ChatGPT but Gemini Advanced even allows modifying code and running it within the answer itself which is convenient to test out minor changes.

Also, getting template of 100s of LOC saves time in writing it from scratch even if I know exactly what I'd need to write.

chrisjj · 2 years ago
> LLMs are great at giving a template or code block for specific use case and then I use it after understanding,

I suspect the typical ChatGPT user is using it for code that's beyond his ability to write. That being the whole point of his using it.

It follows that such code is likely beyond his ability to understand. But worse, beyond his ability to understand he doesn't understand it.

repelsteeltje · 2 years ago
> Also, getting template of 100s of LOC saves time in writing it from scratch even if I know exactly what I'd need to write.

I rarely know exactly what I need to write and writing it usually isn't the problem. Sans LLM, there have always been tools and techniques you can lean on, like syntax highlighting, auto completion, your compiler feedback, or unit tests.

I find that writing code consumes only a small percentage of my time. And while writing code is obviously where bugs originate, it's not where the big problems come from.

Making software good is the easy part, making the right software is where the trouble lies. It's why I spend most time not coding, but talking to people. Figuring out what needs to be done, how, when and why.

GenerocUsername · 2 years ago
Agree. The snippets from ChatGPT are essentially open-source as in they are fully inspectable and can be freely modified before execution. There is zero trust required.
yevlasenko · 2 years ago
I would back this one up as well. The code I generate in chat gpt is rather "dumb" (mostly react components, nestjs modules, etc) and in this case I trust myself to test and use it.
fassssst · 2 years ago
Right, hopefully no one is sprinkling eval(prompt) into their codebase.
1f60c · 2 years ago
I had an idea about that the other day!

What if you could write something like:

  @implement_this
  def prime_sieve(n: int) -> list[int]:
    pass
And the decorator reads the function name and optional docstring, runs it through an LLM and replaces the function with one implementing the desired behavior (hopefully correctly). I remember there was something like this for StackOverflow.

_akhe · 2 years ago
Not gonna lie... llama.cpp... LlamaIndex... Ollama... kinda is eval(prompt) a lot of the time!

Of course AI data pipelines are a totally different conversation than code solutions.

chrisjj · 2 years ago
> I'm puzzled why anyone would incorporate code on trust.

Because almost every programmer these days has learned by a route that relies on incorporating code on trust. E.g. using someone else's compiler.

> I use code from LLMs on most work days ... not because I trust it, but because I understand it and tested it.

The snag there is that one can rarely verify test coverage is sufficient.

> the same as code from anyone, including from myself six months ago.

But it is not the same. ChatGPT delivers completely untested code. Not even syntax-checked in my case. Would you accept such code from any human?

> Trust doesn't enter into it. Clear and workable solutions do

Plenty of clear and workable solutions are discovered unsafe. I suspect that's much more likely when the code comes from a source such as ChatGPT that delivers code in a completely untested state.

Brian_K_White · 2 years ago
Syntactically incorrect code is not much different than an outline or pseudocode.

There are way too countless many stories of people successfully accomplishing tasks by getting an llm to give them a starting point or outline or even a complete solution that only needed a few fixups, hardly any different from debugging your own first draft, to say it doesn't or can't work. That's already sailed.

The fact that it doesn't produce finished correct code, or even appropriate outline for every single case, doesn't seem especially remarkable or damning to me. It is what it is, it's not what it's not.

(I haven't used any myself, but I'm not primarily a coder who is obligated to take advantage of every available tool, and don't like how any of the llm companies are training on open source code without either consent or attribution and then selling that. And don't like how ai and even mere speech to text voice interface before that is being added to everything in general.)

zouhair · 2 years ago
I am no programmer and I love how chatgpt can explain a bash script to me line by line, makes reading those way easier and more enjoyable.
argc · 2 years ago
Yeah, to use it effectively you use it for inspiration and generation of template code. The code may need a lot of changes or may need relatively few. But don't copy it expecting it to work without validating and testing it.
2OEH8eoCRo0 · 2 years ago
So you've audited all code that you're running on your system?
burnte · 2 years ago
> I'm puzzled why anyone would incorporate code on trust.

Really? You can't understand why people do things based on trust? Do you trust no one?

simonw · 2 years ago
Trust but verify. Code review exists because even the best developers make mistakes.
Sohcahtoa82 · 2 years ago
If you're expecting to just copy/paste from ChatGPT without understanding what it generates, you're really doing it wrong.

I've only used it for basic snippets where I know what the code I need is, but it's faster to describe it than write it out myself (for example, "Using Python and boto3, take all the files in the current directory that end in .csv or .pdf and upload them to a bucket named "reports". Prepend the S3 keys with "YYYY-MM-DD/" using today's date").

For simple things like that, it's been great.

paulmd · 2 years ago
It’s also constrained by how many examples it’s seen of a particular language or data format, and the general complexity of the format (even without AI you’ll see diff breaking on json files for much the same reason, unless you go out of your way to avoid it - brackets are semantic white noise and there’s nothing unique for it to lock onto between those bracket tokens). At a minimum it’s going to absolutely chew through tokens to maintain that much context.

Obscure languages and operating systems are also a weakness - for example I could not get llama3 to reply with FreeBSD commands even after repeatedly telling it not to - there just aren’t that many SO/Reddit/forum posts about BSD in comparison to Linux. And evidently that’s what they’ve trained it on. There isn’t even anything semantically odd about an rc.conf, it spits out correct statements, just for the wrong file in the wrong OS.

Everyone acknowledges things like multiplication and counting are hard for LLMs because they aren’t statistically encoded. But counting lisp brackets is also basically the same problem. And unlike YAML it’s not really self-healing or obviously correctable once it’s fucked up (as you noticed). I think the implication is there’s clearly some syntaxes that are much less generateable simply because they require intense amounts of context.

(also, as far as counting… I’m sure far smarter people have look at this, but maybe multiplication is something that could benefit from log-addition techniques, that might reduce the complexity quite a bit vs a naive “multiply column by column like a human fourth-grader” approach.)

chrisjj · 2 years ago
> It’s also constrained by how many examples it’s seen of a particular language

ChatGPT: "I've encountered numerous examples of Lisp code during my training, covering different applications and techniques within the language. Whether it's simple examples or more complex implementations, I've seen quite a bit of Lisp code."

> the general complexity of the format

Good then that this example has a very simple format.

Here Not a rpoblem in this (even without AI you’ll see diff breaking on json files for much the same reason, unless you go out of your way to avoid it - brackets are semantic white noise

NewsaHackO · 2 years ago
Yep. OP had to ask it fix mismatched brackets, and is a perfect example of the type of person who definitely should not use LLM for code.
chrisjj · 2 years ago
Actually I didn't have to.

I chose to. So I could give it another chance to succeed.

But do tell. What type of person does not have to ask it to fix its mismatched brackets?

boredtofears · 2 years ago
I’m sure most people can learn to use LLMs correctly if they can code, it doesn’t matter what “type” of person you are.
peteradio · 2 years ago
I've seen self described "product owners" claim to build using LLM outputs without knowing how to code.
chrisjj · 2 years ago
? If you're expecting to just copy/paste from ChatGPT without understanding what it generates, you're really doing it wrong.

Glad I'm not doing that, then. I understand what it is generating as soon as I try to run it - and get a "missing bracket" error. I've understood that what it has generated is crap. And that only good luck made that easily detectable.

shinycode · 2 years ago
Our intern uses copilot extensively and its code is riddled with errors. What a pain to review, a lot of time wasted. This is really concerning for newer generations that are not professionals yet but they trust it because the code looks better than what they are able to do. Worse, they outsource their brain and don’t sharpen their senses. How will they become professionals this way ? As a help it’s okay, as a cheating tool that make them faster …
zamadatix · 2 years ago
An intern using LLMs to generate bad untested code is no different than an intern using Stack Overflow to generate bad untested code in that the problem isn't Stack Overflow or the LLM or whatever tool the intern is using rather the lack of application of quality control and code review by the intern on their own code. Making them write everything from scratch isn't an instant cure either, you can write some really badly designed stuff right from scratch and push it immediately on up too. This doesn't make you mentally sharper along the way, it makes you confidently sloppy. A better path is to ensure they are spending significantly more time reviewing, testing, and incrementally improving their code than they are typing/generating it. That's where you really sharpen your brain/senses... and also where you keep from driving everyone else nuts with your PRs. Be it LLM or senior dev if you just say "one shot this functionality so I can push it to main" you're going to have a bad day.
chrisjj · 2 years ago
> An intern using LLMs to generate bad untested code is no different than an intern using Stack Overflow to generate bad untested code

Very different. An LLM lets him generate /far more/ bad untested code per unit of effort.

And SO encourages other humans to test it. Often an SO answer comes with evaluative comments and ratings that are all I need to reject it. If the LMM was "AI", it would at least provide the same, right?

> A better path is to ensure they are spending significantly more time reviewing, testing, and incrementally improving their code than they are typing/generating it

Then let's see the LLM-based workflow that achieves that...

shinycode · 2 years ago
The main difference with SO is that the code from SO cannot always be c/p, the context is different so we had to rethink it. With copilot you have the sense that it’s contextualized and it’s the right fit. With pair programming I saw him accept multiple times the autocompleted code without reading it, he just said « wow it got it fast ». That for me is even worse than SO in that regard. For the rest I agree
robryan · 2 years ago
At least with stack overflow you can be fairly confident the code at least works correctly in the context it was provided in if it has a lot of votes.
JTyQZSnP3cQGa8B · 2 years ago
Why aren’t you banning the use of such tools all over the company? I think I would do that with the only reason being the fear that it might leak private source code to random companies.
shinycode · 2 years ago
Because it’s not in my power to decide but I’ve proposed it
jgb1984 · 2 years ago
ChatGPT, when I was forced to use it often got confused which is the bigger number, 2 or 3. I don't care how many lines of bottom of the barrel code it can throw up, I'm not using any of it. AI generated spaghetti debt.

LLM's will turn out to be nothing more than another glittering gadget. Except this one will obliterate any coding skills of future generations. Not to mention the gigatons of AI generated garbage text, images and music that it's filling the world with. What a waste.

sdenton4 · 2 years ago
User: which is the larger number, 3 or 2?

ChatGPT: The larger number is 3.

User: If I have one apple, pick another apple and then get an apple from a friend, and Alice has two apples, who has more apples?

ChatGPT: You have more apples. You started with one apple, picked another, and got one more from a friend, so you have three apples. Alice has two apples.

[edit]:

User: what does good will hunting say to the know-it-all at the bar?

ChatGPT: After Will Hunting outsmarts the know-it-all student in the bar, he spots the student's female friend outside a shop later on. To tease the student further and show his wit, Will taps on the window to get her attention and, holding up his phone number, famously asks, "Do you like apples?" After she nods, Will triumphantly says, "Well, I got her number. How do you like them apples?" This line became quite iconic and is one of the memorable quotes from the movie.

Jensson · 2 years ago
Since ChatGPT doesn't have a central place to do math its ability to know whether 3 is bigger than 2 depends on the context, your examples thus doesn't prove anything.

Every time I've tried gpt-4 etc there are such problems. I don't use it for trivial things since then what's the point, and for anything non-trivial it screws up too much.

It is useful for translations and brainstorming, but not much else. Translating from natural language to code is a part of that, but not letting it code itself.

cvhashim04 · 2 years ago
I think you're in for a rude awakening. The software engineering field will be an unrecognizable landscape 5-10 years from now and you'll be looked at like the guy who codes in text edit with no ide if you arent using some form of an AI to generate code -- notice I said generate, coding from scratch will be a thing of the past also.
jgb1984 · 2 years ago
That's incredibly funny, since I've been developing software for 20+ years using debian and vim (look mom, no IDE!) and I'm pretty sure that's what I'll be doing for the years to come. Imagine how valuable it will be when you're in the minority of humans who still understand how a linux system operates, and how to write and read your own code! Amazing! I absolutely welcome cloud and AI, happy to let it destroy any and all wisdom of future generations :)
superfrank · 2 years ago
> ChatGPT, when I was forced to use it often got confused which is the bigger number, 2 or 3

I mean... yeah. ChatGPT doesn't know what numbers or what logic is. It wasn't "confused". It just doesn't do that.

You're using it for something it's not intended to be used for and, unsurprisingly, it's bad at it. If I said I'm giving up on using hammers because they were terrible for screwing screws into the wall, you'd tell me I'm using the wrong tool. This is no different. ChatGPT has its uses, but it isn't magic. Just like any tool it has limitations.

groby_b · 2 years ago
You... might want to inform yourself a bit more before making proclamations about the future.

LLMs have fairly well understood failure modes. They have incredible strengths, too. And they differ very much by what you use. If the ChatGPT you were forced to use was GPT3.5 (and the behavior sounds very much like it was), you literally have no idea what modern LLMs can actually do.

chx · 2 years ago
Every time...

https://hachyderm.io/@inthehands/112006855076082650

> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.

> Alas, that does not remotely resemble how people are pitching this technology.

chrisjj · 2 years ago
> I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context”

Great. Solved the world shortage of bullshit. Oh wait...

lolinder · 2 years ago
FWIW, I'd expect mismatched parentheses to be particularly likely to occur in Lisp. There are very few forms of writing where two tokens () account for 30-50% of the total token size of the text. It's unsurprising that with even a tiny bit of temperature the model gets overzealous and places one too early or too late.

Add to that the fact that matching parentheses is hard for humans to reliably do correctly—especially when typing code into a textbox on Stack Overflow—and now you have a model whose architecture struggles with knowing when to stop in highly repetitive sequences and that was trained on data that is disproportionately likely to have incorrect syntax for the language you're using.

Other languages don't have those features, and I've never found Copilot to give me a syntax error in, say, JavaScript.

That said, I also don't know any professional developer who trusts LLMs to code. Everyone I know who uses them uses them as a tool and takes responsibility for cleaning up their output.

Leherenn · 2 years ago
Copilot likes to swallow my } in C++ for some reasons. E.g. if you generate a new function at the end of a namespace block, it will usually use the bracket of the namespace block at the terminator of the function, leaving you one short.
chrisjj · 2 years ago
> FWIW, I'd expect mismatched parentheses to be particularly likely to occur in Lisp.

FWIW, this example had one more ) than ( !

I'll post it as a comment to the root.

lolinder · 2 years ago
Yep, that's entirely unsurprising. Generative AI in general has a hard time knowing when to stop highly repetitive patterns—image generators struggle with getting the right number of fingers, and transformers will often get stuck in loops if things get too repetitive.

Lisp is just a uniquely hard language for a transformer to cope with—not that it doesn't make mistakes in other languages, but I've almost never seen a syntax mistake.

mindcrime · 2 years ago
I'm with @delichon... I don't "trust" LLM generated code, I review it and decide how appropriate (if at all) it is, possibly modify it, and include it if it makes sense. Trust isn't really part of the equation.
jaredsohn · 2 years ago
I also don't trust code from stackoverflow, colleagues, or even code I had written myself in the past.

Is helpful to always be skeptical of code and try to get an understanding of it.

Another way I think of it is I find it easier to revise than to write (both in English and code). So when I write something I just try to get my ideas out there and then skeptically reread it and recompose. LLMs let me just speed up that process.

mindcrime · 2 years ago
or even code I had written myself in the past.

Aaawwww hellz no. I hate looking at my old code. I've had too many of those surreal experiences where something breaks, I start looking at how to fix it, and am suddenly stunned to realize "this could never have worked in the first place." There's one or two of those I'm still puzzling over, decades later. :-(

nemothekid · 2 years ago
I don't know if OP meant "trust" as in free from security errors or bugs. I often struggle with "how is this a useful tool?". Fixing bugs in someone else's code seems like more a time sink than just writing the code myself.

I assume everyone else must be working on projects/languages that have far more representation in the training set, but for me, especially working in Rust, even trying to write interfaces into C libraries, I would constantly find myself dealing with hallucinated parameters or function calls, incorrectly implemented functions or just broken code.

At the very least if I do it myself, I'll have some intuition about how the code works.

mindcrime · 2 years ago
> I assume everyone else must be working on projects/languages that have far more representation in the training set,

I can't speak for anyone else, but I've used LLM's mainly for generating Java code. The generated code hasn't necessarily ever been perfect, but IME it's been "close enough" that even counting having to make some manual changes, it's still faster than writing it myself from scratch.

That said, I'm typically only using it for relatively short snippets of code, where I "know" how to do the operation in question, but want to avoid having to stop and look up fiddly details that might not be cached "top of mind" because I haven't written that code in a while.

I'm sure there are use-cases where LLM's don't save time, but I'm equally sure that there are at least some where they do.

pprotas · 2 years ago
Sounds like GPT 3.5 to me, version 4 is much better at code.

Also, as a general rule you shouldn't trust its code output, but it's a good way to brainstorm or get started on coding something. It can give useful tips or replace stack overflow. I almost never ask it to generate code for me, but I ask it questions about existing code very often. That seems to work very well for me.

StewardMcOy · 2 years ago
Sorry to hijack your reply, but I've been having a lot of trouble with ChatGPT4 for code. I don't actually incorporate LLM-generated code into my work, but I often converse with ChatGPT and check what it's capable of every couple weeks so I can stay informed.

By far, the most common problem is that I will ask ChatGPT 4 to write a small problem that does X, and it will create a skeleton. Lots of functions that implement at most some paramater validation and then contain comments like // Implement the rest of this function.

With enough prodding, I can sometimes get it to actually fill in the function implementations one-by-one. But often times it will implement part of it and then put another one of those "implement the rest" comments in. Like for example, if I asked it to write a function to draw three circles, after prodding, it would write code to draw the first circle and then stop. Obviously, the answer here is for it to create a function that draws 1 circle and call it 3 times, but I can't fault it for taking my instructions very literally.

Even when I do get it to implement full functions, when I later ask it for the full code listing, it's back to printing out functions that have "implement this" comments in them.

Perhaps the fault is mine for asking it to do tasks that there aren't hundreds of implementations of on the internet.

simonw · 2 years ago
I've seen lots of people complain about this (the "ChatGPT is being lazy" bug, which OpenAI themselves said they would look at late last year) but I've not seen it myself. I wonder if it comes down to prompting style? What kind of prompts are you using that get these responses?
bearjaws · 2 years ago
GPT4 still sucks if you are using anything modern.

It's knowledge of sveltekit still seems weighted to pre 1.0, it hallucinates libraries for shadcn, or radix-ui.

You have to prefix your prompts with "Please use modern JS" otherwise it won't use .filter or .reduce sometimes.

I agree its a good replacement for Stackoverflow, but it is often pitched as a replacement for coding.

I've found using aider as the best compromise, let it build something small and revert if it doesn't work.

water9 · 2 years ago
So basically, you’re trying to use stuff that came later than the data set ChatGPT was trained on and complaining it doesn’t know things that are happening in the future..
chrisjj · 2 years ago
> Sounds like GPT 3.5 to me

It was.

> version 4

Paywalled.

> is much better at code.

I'll post the question as comment to root, for you to try if you want.