Benchmarking GPT-4 Turbo – A Cautionary Tale

Aider has had an Exercism benchmarking suite for quite some time.

Interestingly, my benchmark results of GPT 4 Turbo show an opposite result: the new gpt-4-1106-preview did significantly better on the first try than the March and June models.

https://aider.chat/docs/benchmarks-1106.html

Aider benchmarks against the 133 Exercism python exercises, not js exercises that mentat's benchmark uses. So this is not an apples-to-apples comparison, but there doesn't seem to be a strong reason to expect qualitatively different results.

I also notice that the instructions prompt that mentat uses seems to be inspired by the aider benchmark? Glad to see others adopting similar benchmarking approaches.

https://github.com/AbanteAI/mentat/blob/main/tests/benchmark...

https://github.com/paul-gauthier/aider/blob/main/benchmark/p...

Edit: Not sure if the mentat authors are in this thread? After looking around a bit, there seems to be a bunch of aider code in your repo. Some attribution would be appreciated. It might even be required under aider's Apache 2.0 license?

biobootloader · 2 years ago

Hey Paul, I'm a Mentat author.

> I also notice that the instructions prompt that mentat uses seems to be inspired by the aider benchmark? Glad to see others adopting similar benchmarking approaches.

We were inspired by you to use Exercism as a benchmark, thank you! We will add attribution for that. We switched our original instruction prompts for that benchmark to be similar to Aiders to allow for fair comparison.

> After looking around a bit, there seems to be a bunch of aider code in your repo. Some attribution would be appreciated.

We have an unused implementation of your output response format (https://github.com/AbanteAI/mentat/blob/main/mentat/parsers/...), but I don't know what else you are seeing? We implemented that to compare with our response formats and didn't find much difference in performance.

anotherpaulg · 2 years ago

I didn't spend much time looking, but your benchmark prompting inspired me to search your repo for "aider". The results were 3 PRs where aider was mentioned in the conversations [0].

The "code map" PR in particular mentions being "inspired by aider", links to aider and seems to include a bunch of code from aider's old ctags based "repo map" implementation. This isn't an insignificant component of an AI coding tool.

Aider is open source and I try and share my learnings as I'm building it. So it's great when other projects get inspiration from aider! But it is polite to provide attribution for such inspiration, especially if you crib from code with an attribution license.

[0] https://github.com/search?q=repo%3AAbanteAI%2Fmentat+aider&t...

derwiki · 2 years ago

I’ve been using the new model with Aider since it was released, and my anecdata agrees—the “edits applied successfully “ failure rate is much lower than classic gpt4.

Also THANK YOU for Aider! I talk it up to all my programmer friends; it really feels like a glimpse into the future of coding.

hanselot · 2 years ago

Isn't it a good thing that of the benchmarks they ran, the newer model has fewer of the answers memorized (aka, its parroting less)?

Wouldn't this actually be exactly proof that the model has improved over its predecessor by having to solve the problem itself rather than rely on memory?

What use is a model that memorizes the answers to all the benchmarks (see the 7b models on open llm leaderboard for more info on that).

throwaway675309 · 2 years ago

I feel like I see this A LOT these days. If you do a Show HN (for example) and your project is directly inspired by somebody else's who came before you, the least you can do is give nominal attribution.

What is it about software development in particular that makes people so seemingly ethically unfettered by blatant plagiarism?

naiv · 2 years ago

I am also noticing a massive improvement over the old model

ja3k · 2 years ago

Sorry about that. We updated the blog with attribution and put an attributing comment in our code base where we use your benchmarking prompts. We'll probably delete our implementation of your response format later today since we just had it for benchmarking.

strich · 2 years ago

Does aider work with c# at all?

anotherpaulg · 2 years ago

Yes!

Thanks for asking. I've been meaning to address these kinds of questions in the aider FAQ [0]. Here's the entry I just added:

Aider supports pretty much all the popular coding languages. This is partly because GPT-4 is fluent in most mainstream languages, and familiar with popular libraries, packages and frameworks.

In fact, coding with aider is sometimes the most magical when you're working in a language that you are less familiar with. GPT often knows the language better than you, and can generate all the boilerplate to get to the heart of your problem. GPT will often solve your problem in an elegant way using a library or package that you weren't even aware of.

Aider uses tree-sitter to do code analysis and help GPT navigate larger code bases by producing a repository map [1].

Aider can currently produce repository maps for most mainstream languages, listed below. But aider should work quite well for other languages, even without repo map support.

  - C
  - C#
  - C++
  - Emacs Lisp
  - Elixir
  - Elm
  - Go
  - Java
  - Javascript
  - OCaml
  - PHP
  - Python
  - QL
  - Ruby
  - Rust
  - Typescript

[0] https://aider.chat/docs/faq.html#what-code-languages-does-ai...

[1] https://aider.chat/docs/repomap.html

> We designed a test for this theory: we reran the benchmarks without showing the models the instructions to each exercise. Instead, we just told them that they were Exercism exercises, and gave them the exercise names and function stubs.

This summarizes all my skepticism agains the AI field. Pretty clear that they aren't solving the problem, they have them memorized.

DecayingOrganic · 2 years ago

Memorization often gets a bad rap as the underachiever's shortcut. However, it's a fundamental component of any learning process! Our ability to reason, solve problems, and innovate is all built upon a foundation of memorized information. In fact, it's precisely the reason humans have thrived for so long; we were able to memorize and pass down knowledge culturally long before the written word, not because we were 100 times smarter than our nearest cousins. Without memorization, be it in our brains or AI algorithms, there's no foundation to build upon for higher reasoning.

viraptor · 2 years ago

It's hard to decide for me without seeing the data. Even if you don't know the exact exercise, seeing the title and the function name/parameters is often enough for me to guess what the challenge is. I checked the public questions on exercism and almost all of those (that I spot checked) that contained the function name were extremely obvious. Knowing it's a programming challenge would also improve my guessing chances.

For example the function stubs I can find are "value_of_card(<card>)" in exercise "Black Jack", or "generate_seat_letters(<number>)" in exercise "Plane Tickets". I think I could guess those without seeing the rest of the question.

drcode · 2 years ago

You can call it whatever you want, all I know is I used to write programs in lines of code, then blocks of code at a time, spit out by LLMs

Using GPT-4 Turbo yesterday, I feel like I'm moving to pages of code at a time now.

Taking the ideas in my head and turning them into reality is so easy now

GaggiX · 2 years ago

So how can it solve novel problems? Internet does not have all combinations for every possible task with any random programming language, library or constraints. It can even solve problems with non-existing programming languages and libraries if you describe them, if that's just memorization then I don't know what it isn't.

caesil · 2 years ago

If that's your takeaway from this then you really missed the point. The implication here is that gpt-4 to gpt-4-turbo represents a leap away from memorization and toward better reasoning with a more complete world model.

famouswaffles · 2 years ago

"They memorized all the problems" is not what was found here and still a wrong overcorrection.

dartos · 2 years ago

“Gpt-4 has more problems memorized than gpt-4 turbo” was exactly what was found here.

That doesn’t mean it’s only able to solve problems in its training set (tho it’s much better at that obviously.)

boxed · 2 years ago

If you are shown only the title of a coding problem and the site name where it's from, and you manage to solve it you are showing that you either cheated or knew the answer.

blovescoffee · 2 years ago

Ok but you understand there's a body of literature that shows that LLMs don't "just" memorize

Deleted Comment

gloosx · 2 years ago

+100 to that. My biggest scepticism is people actually creating a new problem while thinking they are solving problem. Don't get me wrong, translating natural language ideas into code is fun and all, the truth it is also code, yet in ambiguous language format given to the machine.

When did natural language became better for expressing development ideas than code? I know – when you don't know how to code in the first place. Then you will have to bet on all of the ambiguities of the language, cultural and meta-physical which words carry in order to hack your thing together instead of expressing yourself directly and explicitly.

Finally what is beautiful about strict code format we are so used to - it is truly the fastest and shortest path to get your thing done, in case you possess the knowledge needed.

doug_durham · 2 years ago

Natural language isn't superior to computer languages. NL allows you to describe a software concept in a computer language and framework neutral way. The LLM generates the code. The real benefit is when you work across languages and frameworks. It is difficult to keep all of the details of all of the framework calls in your head all of the time.

ericrallen · 2 years ago

That sounds a lot like gatekeeping.

These tools will empower folks who aren’t developers to build stuff and maybe learn a bit more about how programming works.

They will enable folks who have ideas, but can’t express them, to actually be able to create what they are imagining.

That’s awesome.

Code isn’t beautiful (except for a few rare exceptions). Creating something with code is.

m3kw9 · 2 years ago

From a black box point of view and one angle, gpt is a web filter where it will try to find you the exact thing you are looking for but from memory. Vs google you have to distill all the info into what you need

nathanfig · 2 years ago

"memorize" implies they can only recite things verbatim and that's ignoring the massive leap in being able to synthesize disjoint "memories" in new ways.

ren_engineer · 2 years ago

even if it's not true AI or even an architecture with the potential to become AI, LLMs are already good enough to provide real world value. Obviously "super autocomplete" isn't as sexy as true AI, but still very useful

rldjbpin · 2 years ago

if the benchmark means replicating the experience of taking technical interviews by most people, then this is a spot on approach and serves the potential user right.

mavhc · 2 years ago

LLMs are lossy compression

stoicfungi · 2 years ago

All models are, including humain brain.