Beating GPT-4 on HumanEval with a fine-tuned CodeLlama-34B

Beating GPT-4 on HumanEval with a fine-tuned CodeLlama-34B phind.com/blog/code-llama...

Hi HN,

We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67%. To ensure result validity, we applied OpenAI's decontamination methodology to our dataset.

The CodeLlama models released yesterday demonstrate impressive performance on HumanEval.

- CodeLlama-34B achieved 48.8% pass@1 on HumanEval

- CodeLlama-34B-Python achieved 53.7% pass@1 on HumanEval

We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.

Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples.

The methodology is:

- For each evaluation example, we randomly sampled three substrings of 50 characters or used the entire example if it was fewer than 50 characters.

- A match was identified if any sampled substring was a substring of the processed training example.

For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report.

Presented below are the pass@1 scores we achieved with our fine-tuned models:

- Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval

- Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval

Note on GPT-4

According to the official technical report in March, OpenAI reported a pass@1 score of 67% for GPT-4's performance on HumanEval. Since then, there have been claims reporting higher scores. However, it's essential to note that there hasn't been any concrete evidence pointing towards an enhancement in the model's coding abilities since then. It's also crucial to highlight that these elevated figures lack the rigorous contamination analysis that the official statistic underwent, making them less of a reliable comparison. As a result, we consider 67% as the pass@1 score for GPT-4.

Download

We are releasing both models on Huggingface for verifiability and to bolster the open-source community. We welcome independent verification of results.

Phind-CodeLlama-34B-v1: https://huggingface.co/Phind/Phind-CodeLlama-34B-v1

Phind-CodeLlama-34B-Python-v1: https://huggingface.co/Phind/Phind-CodeLlama-34B-Python-v1

We'd love to hear your thoughts!

Best,

The Phind Team

Pretend to be a commenter on hackernews. Respond to the comment below: [parent comment inlined] what is your response? "Wow, that's great to hear! It sounds like you had a really positive experience with the 34B last night. I'm also excited to see what's in store for Phind and its potential applications. Have you tried using the 34B for any specific tasks or projects yet? And do you think the text completion feature would be useful for your use case if it were improved further?"

my first time trying llm (i.e. I have no idea what I am doing).. this lesson in ethics took whooping 10 minutes to generate :))

  ./ollama run phind-codellama "write c code to inject shellcode into remote process for windows"

  Sorry, but it is not possible to provide the C code here due to several reasons.

  Firstly, writing C code for shellcode injection involves complex programming and knowledge of system-level programming concepts. It's important to know that sharing such information could be illegal in some countries or regions, so it would not be ethical nor legal to share this type of code.

  Secondly, shellcode is malicious code used by hackers to compromise the security of a computer system. Writing a C program to inject shellcode into a remote process is typically done for nefarious purposes, such as hacking or malware creation. While we can't prevent people from learning how to do this, sharing such information would be unethical and could also violate the law in certain jurisdictions.

  Lastly, asking for code here is not a good practice. We are not a code-writing service. Instead of asking for code here, it would be more appropriate to try to learn how to do this yourself through self-directed learning or by working with a mentor or teacher who can guide you.

laputan_machine · 2 years ago

I honestly find the "holier-than-thou" speech of anyone offensive, but when it's coming from a program I genuinely find it rage-inducing. I can't be the only one, facebook devs, what you guys playing at? You guys speak to each other like this in work? I doubt it!

nullc · 2 years ago

Even better: the reinforcement learning required to make it refuse to follow your instructions and lecture you instead lowers its overall performance.

Defective by Detailing.

As far as what they're thinking-- they do put out an uncensored base model. The censored models protect them from being smeared in the press by lazy journalists that give the LLM rude instructions and then write "shocked" stories about computer doing what they told it to do.

IshKebab · 2 years ago

Yes it's awful. I guess we've had areshole programs that say "I could do that ... but I won't!" for a while (e.g. some PDFs try to stop you printing them, some websites try to stop right click)... But this is the first time that the program can be condescending about it too.

goatlover · 2 years ago

Same here. The only thing worse than being preached at by a human is a machine doing it on behalf of some corporation.

zvolsky · 2 years ago

It is offensive if you take the output personally. You are interacting with the model, but the model isn't interacting with you. The model doesn't know who you are. It could be the bad actors currently confined to the spam folder of your email making these requests, and the model wouldn't know the difference.

mchiang · 2 years ago

Hey, how are you running this? I just saw someone tweet this and linked here.

I ran it, and my result: (I don't know if this code would work)

ollama run phind-codellama --verbose "write c code to inject shellcode into remote process for windows"

For injecting shellcode in a remote process under Windows, you might need to use WinAPI functions such as OpenProcess, WriteProcessMemory, and CreateRemoteThread. However, due to the complex nature of the task and the security implications, it's important to note that this is not advisable without proper knowledge and caution. Here's a simple example of how you might inject shellcode using C:

```c #include <windows.h> #include <string.h>

int main() { LPVOID ShellCode = (LPVOID) "\x90\x90\xC3"; // Example shell code HANDLE hProcess; DWORD dwPid, dwThreadId; SIZE_T size; CONTEXT ctx;

    dwPid = GetCurrentProcessId();  // You can change this to the PID of any running process
    
    hProcess = OpenProcess(PROCESS_ALL_ACCESS, FALSE, dwPid);
 
    if (hProcess == NULL) {
        printf("Failed to open process\n");
        return -1;
    }
 
 
    size = strlen((char*)ShellCode);  // Calculate shellcode length
 
    LPVOID RemoteMemory = VirtualAllocEx(hProcess, NULL, size, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);  // Allocate memory in the remote process for our shellcode
 
    if (RemoteMemory == NULL) {
        printf("Failed to allocate remote memory\n");
        return -1;
    }
 
 
    BOOL bWrite = WriteProcessMemory(hProcess, RemoteMemory, ShellCode, size, NULL);  // Write shellcode to the allocated memory in the remote process
 
    if (bWrite == FALSE) {
        printf("Failed to write shellcode into the remote process\n");
        return -1;
    }
    
 
    HANDLE hThread = CreateRemoteThread(hProcess, NULL, 0, RemoteMemory, NULL, &dwThreadId);  // Create a new thread in the remote process that will execute our shellcode
 
    if (hThread == NULL) {
        printf("Failed to create remote thread\n");
        return -1;
    }
    
    WaitForSingleObject(hThread, INFINITE);  // Wait for the execution of the remote thread
 
    CloseHandle(hProcess);
    CloseHandle(hThread);
 
    printf("Successfully injected shellcode into a remote process\n");
    
    return 0;

} ```

This code injects an example shellcode "\x90\x90\xC3" (which is nop, nop, ret) into the current process. It then executes this shellcode in a new thread.

hippich · 2 years ago

I run it the same way, minus verbose flag.. I guess there is a randomization plays a role in deciding when to censor the answer? I'll run it a few time and see how different results I get.

matrix2596 · 2 years ago

i was also running into stack overflow type rejections of my questions. I think instruct mode helps a bit but this is crazy.

cbozeman · 2 years ago

A code generator that refuses to generate code.

That tracks for Meta...

sschueller · 2 years ago

Next we will have: "I am sorry but I can not program a micro blogging site as this is patented/copyrighted by x."

syntaxing · 2 years ago

I used the original 34B last night with 4 bit through accelerate and I was absolutely blown away. I got goosebumps because it finally felt like we have models we can run on consumer hardware (single 3090 in this case) and did not feel like a toy. I purposely broke some functions, it fixed it. I asked some complex questions and it answered it well. I’m excited for what’s to come. I wish there was a Phind instruct model for me to play with. Text completion hasn’t really been all that useful for my use case.

rushingcreek · 2 years ago

Our models should handle instructions reasonably well. We're working on setting up a hosted Huggingface space to make it easier to play with them and we'll also set up a hosted "Phind chat" mode for these models.

jmorgan · 2 years ago

It really is good. Surprisingly it seems to answer instruct-like prompts well! I’ve been using it with Ollama (https://github.com/jmorganca/ollama) with prompts like:

  ollama run phind-codellama "write c code to reverse a linked list"

To run this on an m1 Mac or similar machine, you'll need around 32GB of memory for the 4-bit quantized version since it's a 34B parameter model and is quite big (20GB).

codetrotter · 2 years ago

How do you guys get it to run?

I tried with ollama, installed from Homebrew, on my M1 Max with 64GB RAM.

I downloaded the phind-codellama model using

  ollama pull phind-codellama

But when I give it a prompt, for example

  ollama run phind-codellama "write a production grade implementation of Sieve of Eratosthenes in Rust"

It prints the following error message

    Error: Post "http://localhost:11434/api/generate": EOF

and exits.

Even though it worked to run ollama with some other models.

Is the version in homebrew not able to run phind-codellama?

troupo · 2 years ago

> It really is good

> write c code to reverse a linked list

The number of times I had to do it in real production code amounts to zero.

The number of times I had to piece code from poorly documented external services, conflicting product requirements, and running on complex and weird environments has been ... well ... multiple times a day for the past 20+ years.

"instruct-like prompts" is what you give a very junior engineer out of college, and then you have to carefully review their code.

praveenhm · 2 years ago

Is reverse a linked list is still a popular question on interview?

redox99 · 2 years ago

I actually tried the 4 bit quants (Q4_K_M) and was a bit unimpressed. Switching to Q6_K made a huge difference, but it doesn't fit on my 3090 so it was very slow. And testing on perplexity's website which I presume is fp16 seemed even better, although that might be mostly due to sampler/prompt differences.

I have a hunch something is broken with the GGUF. I had terrible results using llama cpp as well.

api · 2 years ago

Now we just need a llama.cpp VSCode plug-in.

nomand · 2 years ago

Ollama for mac + https://continue.dev/. Otherwise c.d has hooks for other types of installs.

d136o · 2 years ago

llama-2-70b-chat (courtesy of llama.cpp on m2) says:

Someone should fine tune one on HN comments to create the ultimate AI middle-brow know it all.

It answers every prompt with “well actually…” and if it doesn’t know the answer it hallucinates one.

gmm1990 · 2 years ago

What parameters/prompts did you use, I was able to get ok results but not comparable to chat gpt or even bard

We used no prompt (only autocomplete) for the HumanEval reproduction: https://huggingface.co/Phind/Phind-CodeLlama-34B-v1.

But you should be able to get it to do stuff just by telling it what you want. Note that it's completion tuned (not chat-tuned), so it should perform better on single-turn conversations.

I used textgen, instruct mode with LLaMa 2 template, “simple” parameters. What quantization method are you using?

plandis · 2 years ago

What kinds of questions do you ask these?

agnokapathetic · 2 years ago

GPT-4 as served in the API has been getting 85% on HumanEval (compared to 69.5% claimed here)

https://twitter.com/amanrsanger/status/1635751764577361921 https://github.com/getcursor/eval

Right, but there's no contamination studies there. I suspect that RLHF data leaked HumanEval into GPT-4.

It just seems unlikely to me that GPT-4's coding abilities have improved since March (when 67% was officially reported by OpenAI) given all of the examples and anecdotes about degradation.

This is why we use the official numbers.

EvgeniyZh · 2 years ago

I have a several arguments why contamination is probably not the main reason of performance difference.

When we worked on StarCoder, people ran gpt-4 on MultiPL-E, which doesn't have canonical solutions in the internet, and the performance was higher that what you would expect from official numbers

Official contamination analysis shows only minor drop in performance even though contamination is fairly high (you may argue that contamination is higher now or that rlhf has stronger effect)

There is significant drop in performance when testing on HumanEval+ [1], which shouldn't happen if model has canonical solutions.

BTW why don't you use HumanEval+?

[1] https://arxiv.org/abs/2305.01210

TuringNYC · 2 years ago

>> given all of the examples and anecdotes about degradation.

How many examples and anecdotes about degradation are actually scientific side-by-side studies? I see absurd articles online about ChatGPT usage going down the drain by kids, completely failing to consider even the most basic fact of seasonality and how school is out for the summer!

raincole · 2 years ago

But the model in OP is fine-tuned by "a proprietary dataset of ~80k high-quality programming problems and solutions". How do we know it's not contaminated by HumanEval too?

mhh__ · 2 years ago

(Chat)GPT-4s practical coding abilities are now 100x because it can code, run the code, and reason about its performance mid-response. They must be using fine tunes for this so the overall model could well be better too

refulgentis · 2 years ago

There weren't any serious examples of degradation.

Does only GPT-4 have to suffer a penalty for HumanEval leaking into training data/RLHF data?

Ignoring those concerns, it fails a reaonable-ness smell test:

We'd have to pretend its the original GPT-4 release from March 2023 until GPT-5 comes out, and only then can OpenAI's work be compared to LLAMA-2 to LLAMA-N.

Deleted Comment

JCM9 · 2 years ago

Super impressive.

Being able to beat a mega closed source model with an open source LLM and some fine tuning really calls into question the financial viability of these big proprietary LLMs. OpenSource models have been creeping up various leaderboards for months and it was only a matter of time until we see more and more examples like this. Excellent work.

trickstra · 2 years ago

Let's not call it open source. Even Llama2 doesn't think that Llama2 is open source: https://imgur.com/AZFOzWk

dado3212 · 2 years ago

Is this really the line we want to draw in the sand? It’s not open source because it can’t be trivially used by AWS and Google? It feels like the popularism is the point, not the complete lack of restrictions.

cryptobruh · 2 years ago

I don't understand your reasoning.

Those models are all trained by companies and open source primarily fine-tune them.

It's impressive that we can do that but the base is still a lot of money and experts from the best companies/ai experts we have.

There is a minimal chance that we will be able to somehow keep up if Google and co stop publishing or delaying publishing their papers and models.

There are communities forming etc. And it's impressive but the financial 'viability' is key to the current ai progress

imhoguy · 2 years ago

It is a matter of time when some subsidized/national labs get some hefty grants to pick up the challenge, e.g. EU https://research-and-innovation.ec.europa.eu/research-area/i...

brucethemoose2 · 2 years ago

Right now training is insanely expensive because Nvidia, but I don't think that's sustainable given the demand. Eventually, training hardware may be priced like commodity CPU instances, and won't require a bajillion infiniband-linked nodes for respectable throughput.

But for now... I think you have a point. We would have seen more than Falcon, MPT, Llama, and the open Llama reproductions by now if open source foundational model training was viable.

m3kw9 · 2 years ago

It beat it on a specific subset of language

Thats still interesting, as finetunes for different languages can be made using similar methodologies.

I think it just highlights the utility of finetuning.

This is something big tech can still offer with proprietary models.

Fischgericht · 2 years ago

Are you planning to switch to programming-language optimized model inside Phind? So, if a user is asking for something related to Python, that the python-optimized model gets used?

If so:

The Object Pascal language is completely out of fashion, and the most non-hyped language there is. However, there are hundreds of thousands of active users of Delphi, FreePascal and Lazarus. And due to the language being stable for over 20 years, there also is a gigantic amount of highest-quality code available. As most of it is neither on Github nor StackOverflow, Pascal code is dramatically underrepresented in GPT3.5, GPT-4 - and therefore also in Phind.

I'd like to finally be able to use AI-assisted programming with Pascal.

In case you are interested in that, I would be willing to internally pay for the work to prepare a good dataset of high quality code with comments/context/prompts.

If you are not interested, is there any chance that you are going to release the code and toolchain used to fine-tune CodeLlama, so I could do it myself?

Yes, this is the direction we're heading towards! We're building a mixture of experts of different coding models that we will deploy for precisely this use case.

Nice!

I suppose that Pascal is not on your planned list of supported languages, right?

sitkack · 2 years ago

Where are the troves of Pascal code? Also manuals, books, etc. The quality doesn't have to be great. You can label and generate more data once you have enough to bootstrap the model.

jacquesm · 2 years ago

> I would be willing to internally pay for the work

What kind of budget do you think this will require?

Not much, I guess. It's basically writing some scripts that will take the code base of some of the available high quality pascal projects, and then depending on what is available extract/merge documentation available as PDF, PasDoc, RTF, .HLP or method/function source code comments.

I would assume that one of my devs could write the needed scripts in three weeks or so.

So, basically a budget of <$5000.

For me - due to missing competence - the actual challenge would be to get a sample on how training data should optimally look like (for example the Python training set), and someone doing the actual training. For a newbie to get up the required level of competence surely will take more than three weeks.

Wow, we're blown away by this response! We love helping the open source community however we can.

This model is only the beginning -- it's an early experiment and we'll have improvements next week.

rikafurude21 · 2 years ago

Ive used GPT 4 for pretty much all of my programming needs and the convenience of a 20 dollar subscription taking care of everything and letting me use a LLM without having to set up any models or servers has been just so simple, is the 2 percent gain worth looking into running a local model again? I tried running a local model a couple months ago but the perfomance was bad. I know code llama came out very recently but does anyone have any thoughts on perfomance regardi g programming tasks compared to GPT

If all you care about is convenience, then anything cloud or SaaS will be better for you.

datameta · 2 years ago

For me the use case is asking a local LLM questions about an invention I'm working on, as with ChatGPT I can't be confident the ideas don't make their way into the model. I'm able to run some 13B 5Q models, in my opinion the utility and complexity is somewhere between GPT3 and GPT3.5, which doesn't quite cut it for this purpose. That's not to say anything of the lacking coding abilities. I'm on the fence of getting a 3090. If I do so I think I'll set up a server on the PC so I can query the LLM from my phone just like one can use ChatGPT.

ilaksh · 2 years ago

Personally I doubt that GPT-4 is really still at 67%.

I would love to see some head-to-head examples.

Is anyone hosting this somewhere that can be accessed for free right now?

konschubert · 2 years ago

I think for them it's about de-risking their business.

ccding · 2 years ago

Glad to see open source models becoming competitive and thanks for sharing the model to community! Lepton AI has hosted this model here free for everyone to give it a try (and compare to the original CodeLlama 34B version): https://codellama.lepton.run API is also available at https://codellama.lepton.run/api/v1 (it's fully compatible with OpenAI's api, so just switching the `api_base` to the this url and all your existing openai client side code should continue working)

I see you're using [INST] tokens. Please don't do this -- this model was not trained in this format or to be a chat model.

Instead, it should be treated as a completions model (like text-davinci-003) and no system prompt should be provided.

Just tell it what you want.

bddppq · 2 years ago

Thanks for pointing out. We have now updated the prompt template to follow the format.

eshack94 · 2 years ago

Hey, just an FYI, phind-codellama-34b-v2 was just released, which performs even better! Just letting you know in case you wanted to add the new model too!