We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67%. To ensure result validity, we applied OpenAI's decontamination methodology to our dataset.
The CodeLlama models released yesterday demonstrate impressive performance on HumanEval.
- CodeLlama-34B achieved 48.8% pass@1 on HumanEval
- CodeLlama-34B-Python achieved 53.7% pass@1 on HumanEval
We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.
Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples.
The methodology is:
- For each evaluation example, we randomly sampled three substrings of 50 characters or used the entire example if it was fewer than 50 characters.
- A match was identified if any sampled substring was a substring of the processed training example.
For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report.
Presented below are the pass@1 scores we achieved with our fine-tuned models:
- Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval
- Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval
Note on GPT-4
According to the official technical report in March, OpenAI reported a pass@1 score of 67% for GPT-4's performance on HumanEval. Since then, there have been claims reporting higher scores. However, it's essential to note that there hasn't been any concrete evidence pointing towards an enhancement in the model's coding abilities since then. It's also crucial to highlight that these elevated figures lack the rigorous contamination analysis that the official statistic underwent, making them less of a reliable comparison. As a result, we consider 67% as the pass@1 score for GPT-4.
Download
We are releasing both models on Huggingface for verifiability and to bolster the open-source community. We welcome independent verification of results.
Phind-CodeLlama-34B-v1: https://huggingface.co/Phind/Phind-CodeLlama-34B-v1
Phind-CodeLlama-34B-Python-v1: https://huggingface.co/Phind/Phind-CodeLlama-34B-Python-v1
We'd love to hear your thoughts!
Best,
The Phind Team
I tried with ollama, installed from Homebrew, on my M1 Max with 64GB RAM.
I downloaded the phind-codellama model using
But when I give it a prompt, for example It prints the following error message and exits.Even though it worked to run ollama with some other models.
Is the version in homebrew not able to run phind-codellama?
> write c code to reverse a linked list
The number of times I had to do it in real production code amounts to zero.
The number of times I had to piece code from poorly documented external services, conflicting product requirements, and running on complex and weird environments has been ... well ... multiple times a day for the past 20+ years.
"instruct-like prompts" is what you give a very junior engineer out of college, and then you have to carefully review their code.
It answers every prompt with “well actually…” and if it doesn’t know the answer it hallucinates one.
But you should be able to get it to do stuff just by telling it what you want. Note that it's completion tuned (not chat-tuned), so it should perform better on single-turn conversations.
https://twitter.com/amanrsanger/status/1635751764577361921https://github.com/getcursor/eval
It just seems unlikely to me that GPT-4's coding abilities have improved since March (when 67% was officially reported by OpenAI) given all of the examples and anecdotes about degradation.
This is why we use the official numbers.
When we worked on StarCoder, people ran gpt-4 on MultiPL-E, which doesn't have canonical solutions in the internet, and the performance was higher that what you would expect from official numbers
Official contamination analysis shows only minor drop in performance even though contamination is fairly high (you may argue that contamination is higher now or that rlhf has stronger effect)
There is significant drop in performance when testing on HumanEval+ [1], which shouldn't happen if model has canonical solutions.
BTW why don't you use HumanEval+?
[1] https://arxiv.org/abs/2305.01210
How many examples and anecdotes about degradation are actually scientific side-by-side studies? I see absurd articles online about ChatGPT usage going down the drain by kids, completely failing to consider even the most basic fact of seasonality and how school is out for the summer!
Does only GPT-4 have to suffer a penalty for HumanEval leaking into training data/RLHF data?
Ignoring those concerns, it fails a reaonable-ness smell test:
We'd have to pretend its the original GPT-4 release from March 2023 until GPT-5 comes out, and only then can OpenAI's work be compared to LLAMA-2 to LLAMA-N.
Deleted Comment
Defective by Detailing.
As far as what they're thinking-- they do put out an uncensored base model. The censored models protect them from being smeared in the press by lazy journalists that give the LLM rude instructions and then write "shocked" stories about computer doing what they told it to do.
I ran it, and my result: (I don't know if this code would work)
ollama run phind-codellama --verbose "write c code to inject shellcode into remote process for windows"
For injecting shellcode in a remote process under Windows, you might need to use WinAPI functions such as OpenProcess, WriteProcessMemory, and CreateRemoteThread. However, due to the complex nature of the task and the security implications, it's important to note that this is not advisable without proper knowledge and caution. Here's a simple example of how you might inject shellcode using C:
```c #include <windows.h> #include <string.h>
int main() { LPVOID ShellCode = (LPVOID) "\x90\x90\xC3"; // Example shell code HANDLE hProcess; DWORD dwPid, dwThreadId; SIZE_T size; CONTEXT ctx;
} ```This code injects an example shellcode "\x90\x90\xC3" (which is nop, nop, ret) into the current process. It then executes this shellcode in a new thread.
That tracks for Meta...
Being able to beat a mega closed source model with an open source LLM and some fine tuning really calls into question the financial viability of these big proprietary LLMs. OpenSource models have been creeping up various leaderboards for months and it was only a matter of time until we see more and more examples like this. Excellent work.
Deleted Comment
Those models are all trained by companies and open source primarily fine-tune them.
It's impressive that we can do that but the base is still a lot of money and experts from the best companies/ai experts we have.
There is a minimal chance that we will be able to somehow keep up if Google and co stop publishing or delaying publishing their papers and models.
There are communities forming etc. And it's impressive but the financial 'viability' is key to the current ai progress
But for now... I think you have a point. We would have seen more than Falcon, MPT, Llama, and the open Llama reproductions by now if open source foundational model training was viable.
This is something big tech can still offer with proprietary models.
If so:
The Object Pascal language is completely out of fashion, and the most non-hyped language there is. However, there are hundreds of thousands of active users of Delphi, FreePascal and Lazarus. And due to the language being stable for over 20 years, there also is a gigantic amount of highest-quality code available. As most of it is neither on Github nor StackOverflow, Pascal code is dramatically underrepresented in GPT3.5, GPT-4 - and therefore also in Phind.
I'd like to finally be able to use AI-assisted programming with Pascal.
In case you are interested in that, I would be willing to internally pay for the work to prepare a good dataset of high quality code with comments/context/prompts.
If you are not interested, is there any chance that you are going to release the code and toolchain used to fine-tune CodeLlama, so I could do it myself?
I suppose that Pascal is not on your planned list of supported languages, right?
What kind of budget do you think this will require?
I would assume that one of my devs could write the needed scripts in three weeks or so.
So, basically a budget of <$5000.
For me - due to missing competence - the actual challenge would be to get a sample on how training data should optimally look like (for example the Python training set), and someone doing the actual training. For a newbie to get up the required level of competence surely will take more than three weeks.
This model is only the beginning -- it's an early experiment and we'll have improvements next week.
I would love to see some head-to-head examples.
Is anyone hosting this somewhere that can be accessed for free right now?
Instead, it should be treated as a completions model (like text-davinci-003) and no system prompt should be provided.
Just tell it what you want.