Interesting study, but unfortunately already outdated. I don't think anybody uses ChatGPT 3.5 for coding anymore.
From my personal experience, Claude 3.5 or GPT-4o work best. They're more coding assistants, not really capable of writing anything more than very simple programs on their own. They make a lot of mistakes, and you need to know how to debug code they produce.
Claude is my favorite but it just randomly added a division by 100,000 to a line of code for no discernible reason. According to Claude it was "an oversight on [Claude's] part".
I was using 3.5 until quite about a month ago. Now I'm using 4o. 4o is better but its not a huge difference. I was surprised as I expected it to be a huge improvement. I haven't tried the regular "4" model yet, I've heard it used to be quite good but maybe got worse.
In any case, these models are good for simple stuff as far as I can tell, but can't do anything hard or off the beaten path. They can give me ideas for solving hard issues, but any code they generate is killed by hallucinations and general lack of anything resembling contextual knowledge. It can be quite obstinate too, even if I explain I'm trying to solve an unusual problem, and so need out of the box "thinking", 4o will try to force me back to some mainstream solution that I can't use.
I also use Amazon's Q and that is good for simple/repetitive automation but often generates extremely wacky stuff otherwise.
They are all apparently better at Python than anything else. I don't use that so maybe that's an issue.
Maybe my reading comprehension is awful, but I don't see it mentioned anywhere in the article that the "ChatGPT" from the study is the worst at coding of the 3 models that people commonly use.
It seems relevant to mention that the "ChatGPT" in the article isn't the one most of us are using for coding.
that was cracking me up... they took a data set that no doubt has tons of discussion and documentation out there and used that as the basis for the research...
And to add on this, a method for testing coding abilities, which programmers think very little of, since it does not show how good some is at programming, only at leetcoding
These days, I'm using Claude 3.5 to create WordPress plug-ins with some success and some failures. What is sure though, is that Clause is miles better than chat gpt at creating WordPress plug-ins. The last 10 times I tried to create one with chatgpt, I had a total WordPress failure.
Definitely a pet peeve of mine that “ChatGTP” entered the popular lexicon as the name for all OpenAI models, when it’s just a web interface for running LLM models, not a specific model- and the different models have vastly different capabilities.
But I do need a lot of help with how packages/libraries/languages work, and ChatGPT can usually give a decent answer in a minute that could have taken me hours of deeply frustrating searches.
I also find that ChatGPT works well for documentation. I don't need help coding either, as for you, it is may core skill.
However, I am not a great writer, and not a native English speaker, and I find that ChatGPT is better than I am at this. It is after all, a large language model, it is really good with words, it is what it is designed to do, more so than problem solving. I usually feed it my code and let it document it, if it gets it wrong, I correct it, add some context,... Essentially, ChatGPT is my editor (the job, not the software).
Scoff all you want, but this 40 year hobby coder's projects have gotten geometrically better in design and functionality - on OSs and languages I've never touched before. Whether it's been VBA, SwiftUI or DEBUG.COM.
This means they are simply overfitting the training dataset which also increases their likelyhood of producing an almost the same output of their training data and violating copyright.
I do not understand why this was flagged, this research gives important information on LLM's that contrasts with the marketing fluff we're seeing out of big tech.
From my personal experience, Claude 3.5 or GPT-4o work best. They're more coding assistants, not really capable of writing anything more than very simple programs on their own. They make a lot of mistakes, and you need to know how to debug code they produce.
Claude is my favorite but it just randomly added a division by 100,000 to a line of code for no discernible reason. According to Claude it was "an oversight on [Claude's] part".
In any case, these models are good for simple stuff as far as I can tell, but can't do anything hard or off the beaten path. They can give me ideas for solving hard issues, but any code they generate is killed by hallucinations and general lack of anything resembling contextual knowledge. It can be quite obstinate too, even if I explain I'm trying to solve an unusual problem, and so need out of the box "thinking", 4o will try to force me back to some mainstream solution that I can't use.
I also use Amazon's Q and that is good for simple/repetitive automation but often generates extremely wacky stuff otherwise.
They are all apparently better at Python than anything else. I don't use that so maybe that's an issue.
It seems relevant to mention that the "ChatGPT" in the article isn't the one most of us are using for coding.
We knew that already, but it is good to have an academic publication to link to.
I don't need help coding. That's my core skill!
But I do need a lot of help with how packages/libraries/languages work, and ChatGPT can usually give a decent answer in a minute that could have taken me hours of deeply frustrating searches.
However, I am not a great writer, and not a native English speaker, and I find that ChatGPT is better than I am at this. It is after all, a large language model, it is really good with words, it is what it is designed to do, more so than problem solving. I usually feed it my code and let it document it, if it gets it wrong, I correct it, add some context,... Essentially, ChatGPT is my editor (the job, not the software).
I look at hard to understand code pretty often. Maybe ChatGPT can be helpful there too.
I won't blindly trust either partner, and at least ChatGPT isn't insulted when I check if it is right :)
https://github.com/microsoft/MS-DOS/