Readit News logoReadit News
chrishare · 2 years ago
Credit where credit is due, Meta has had a fantastic commitment towards open source ML. You love to see it.
joshspankit · 2 years ago
Yes but: if the commitment is driven by internal researchers and coders standing firm about making their work open source (a rumour I’ve heard a couple times), most of the credit goes to them.
mvkel · 2 years ago
Wasn't LLaMa originally a leak that they were then forced to spin into an open source contribution?

Not to diminish the value of the contribution, but "commitment" is an interesting word choice.

hnfong · 2 years ago
It wasn't a leak in the typical sense. They sent the weights to pretty much everyone who asked nicely.

When you send something interesting to thousands of people without vetting their credentials, you'd expect the stuff to get "leaked" out eventually (and sooner rather than later).

I'd say it's more appropriate to say the weights were "pirated" than "leaked".

That said, you're probably correct that the community that quickly formed around the "pirated" weights might have influenced Zuckerberg to decide to make llama2's more freely accessible.

Aissen · 2 years ago
Redistributable and free to use weights does not make a model open source (even if it's really nice, with very few people having access to that kind of training power).
satvikpendem · 2 years ago
Meta has source available licenses, not open source ones.
simonw · 2 years ago
Here's the model on Hugging Face: https://huggingface.co/codellama/CodeLlama-70b-hf
israrkhan · 2 years ago
I hope someone will soon post a quantized version that I can run on my macbook pro.
annjose · 2 years ago
Ollama has released the quantized version.

https://ollama.ai/library/codellama:70bhttps://x.com/ollama/status/1752034686615048367?s=20

Just need to run `ollama run codellama:70b` - pretty fast on macbook.

LVB · 2 years ago
I'm not very plugged into how to use these models, but I do love and pay for both ChatGPT and GitHub Copilot. How does one take a model like this (or a smaller version) and leverage it in VS Code? There's a dizzying array of GPT wrapper extensions for VS Code, many of which either seem like kind of junk (10 d/ls, no updates in a year), or just lead to another paid plan, at which point I might as well just keep my GH Copilot. Curious what others are doing here for Copilot-esque code completion without Copilot.
ado__dev · 2 years ago
You can try it with Sourcegraph Cody. https://sourcegraph.com/cody

And instructions on how to change the provider to use Ollama w/ whatever model you want:

Install and run Ollama - Put ollama in your $PATH. E.g. ln -s ./ollama /usr/local/bin/ollama.

- Download Code Llama 70b: ollama pull codellama:70b

- Update Cody's VS Code settings to use the unstable-ollama autocomplete provider.

- Confirm Cody uses Ollama by looking at the Cody output channel or the autocomplete trace view (in the command palette).

- Update the cody settings to use "codellama:70b" as the ollama model

https://github.com/sourcegraph/cody/pull/2635

just_testing · 2 years ago
Hey, thanks for the tip.

One issue, though: I took a look at the Cody website and it looks like one can't have unlimited completions even when self-hosting a LLM.

I understand you guys have a business model and need to make money out of it. I'm just asking because I work as a teacher and I have students who can't pay an extra subscription and/or students who want to hack into stuff.

petercooper · 2 years ago
https://continue.dev/ is a good place to start.
speedgoose · 2 years ago
Continue doesn’t support tab completion like Copilot yet.

A pull/merge request is being worked on: https://github.com/continuedev/continue/pull/758

jondwillis · 2 years ago
Bonus points for being able to use local models!
israrkhan · 2 years ago
This looks really good..
sestinj · 2 years ago
beat me to the punch : )
sestinj · 2 years ago
I’ve been working on continue.dev, which is completely free to use with your own Ollama instance / TogetherAI key, or for a while with ours.

Was testing with Codellama-70b this morning and it’s clearly a step up from other OS models

dan_can_code · 2 years ago
How do you test a 70B model locally? I've tried to query, but the response is super slow.
SparkyMcUnicorn · 2 years ago
There are some projects that let you run a self-hosted Copilot server, then you set a proxy for the official Copilot extension.

https://github.com/fauxpilot/fauxpilot

https://github.com/danielgross/localpilot

water-data-dude · 2 years ago
When I was setting up a local LLM to play with I stood up my own Open AI API compatible server using llama-cpp-python. I installed the Copilot extension and set OverrideProxyUrl in the advanced settings to point to my local server, but CoPilot obstinately refused to let me do anything until I’d signed in to GitHub to prove that I had a subscription.

I don’t _believe_ that either of these lets you bypass that restriction (although I’d love to be proven wrong), so if you don’t want to sign up for a subscription you’ll need to use something like Continue.

israrkhan · 2 years ago
I tried fauxpilot to make it work on my own llama.cpp instance, but didn't work out of the box. Filed a github issue, but did not get any traction. Eventually gave up on it. This was around 5 months ago. Things might have improved by now.
apapapa · 2 years ago
Free Bard is better than free ChatGPT... Not sure about paid versions
ignoramous · 2 years ago
Bard censorship is annoying. One thing I've found (free) Bard to better than the rest at is summarizing book chapters, manuals, and docs. It is also surprisingly good at translation (X to English), as it often adds context to what its translating.

With careful prompt engineering, you can get a lot out of free Bard except when its censored.

dxxvi · 2 years ago
I'm learning Rust. It seems to me that Bard is better than OpenAI 4 I use for free at work.
raxxorraxor · 2 years ago
I use the plugin Twinny in conjunction with ollama to host the models. Easy setup and quite powerful assistance. You need a decent rig though, since you don't want any latency for features like autocomplete.

But even if you don't have a faster rig, you can still leverage it for slower tasks to generate docs or tests.

Twinny should really be more popular, didn't find a more powerful no-bullshit plugin for VSCode.

marinhero · 2 years ago
You can download it and run it with [this](https://github.com/oobabooga/text-generation-webui). There's an API mode that you could leverage from your VS Code extension.
cmgriffing · 2 years ago
I've been using Cody by Sourcegraph and liking it so far.

https://sourcegraph.com/cody

turnsout · 2 years ago
Given how good some of the smaller code models are (such as Deepseek Coder at 6.7B), I'll be curious to see what this 70B model is capable of!
ignoramous · 2 years ago
AlphaCodium is the newest kid on the block that's SoTA pass@5 on coding tasks (authors claim at least 2x better than GPT4): https://github.com/Codium-ai/AlphaCodium

As for small models, Microsoft has been making noise with the unreleased WaveCoder-Ultra-6.7b (https://arxiv.org/abs/2312.14187).

passion__desire · 2 years ago
AlphaCodium author says he should have used DSPy

https://twitter.com/talrid23/status/1751663363216580857

Deleted Comment

hackerlight · 2 years ago
Is this better than GPT4's Grimoire?
eurekin · 2 years ago
Are weights available?
jasonjmcghee · 2 years ago
My personal experience is that Deepseek far exceeds code llama of the same size, but it was released quite a while ago.
turnsout · 2 years ago
Agreed—I hope Meta studied Deepseek's approach. The idea of a Deepseek Coder at 70B would be exciting.
CrypticShift · 2 years ago
Phind [1] uses the larger 34B Model. Still, I'm also curious what they are gonna do with this one.

[1] https://news.ycombinator.com/item?id=38088538

hackerlight · 2 years ago
martingoodson · 2 years ago
Baptiste Roziere gave a great talk about Code Llama at our meetup recently: https://m.youtube.com/watch?v=_mhMi-7ONWQ

I highly recommend watching it.

pandominium · 2 years ago
Everyone is mentioning using 4090 and a smaller model, but I rarely see an analysis where the energy consumption is used.

I think Copilot is already highly subsidized by Microsoft.

Let's say you use Copilot around 30% of your daily work hours. How much kWh does an opensource 7B or 13B model use then in a month on one 4090?

EDIT:

I think for a 13B at 30% use per day it comes around 30$/no on energy bill.

So probably with a even more smaller but capable model can beat the Copilot monthly subscription.

Retric · 2 years ago
Subscription models are generally subsidized by people barely using them. So I wouldn’t be surprised if the average is closer to 10%.
MacsHeadroom · 2 years ago
Using a model 30% of the day is only maybe 100 instances of use, with each lasting for about 6 seconds.

So really you're looking at using the GPU for around 10 minutes a day.

Monthly cost is pennies.

Lapha · 2 years ago
Running models locally using GPU inference shouldn't be too bad as the biggest impact in terms of performance is ram/vram bandwidth rather than compute. Some rough power figures for a dual AMD GPU setup (24gb vram total) on a 5950x (base power usage of around 100w) using llama.cpp (i.e., a ChatGPT style interface, not Copilot):

46b Mixtral q4 (26.5 gb required) with around 75% in vram: 15 tokens/s - 300w at the wall, nvtop reporting GPU power usage of 70w/30w, 0.37kWh

46b Mixtral q2 (16.1 gb required) with 100% in vram: 30 tokens/s - 350w, nvtop 150w/50w, 0.21kWh.

Same test with 0% in vram: 7 tokens/s - 250w, 0.65kWh

7b Mistral q8 (7.2gb required) with 100% in vram: 45 tokens/s - 300w, nvtop 170w, 0.12kWh

The kWh figures are an estimate for generating 64k tokens (around 35 minutes at 30 tokens/s), it's not an ideal estimate as it only assumes generation and ignores the overhead of prompt processing or having longer contexts in general.

The power usage essentially mirrors token generation speed, which shouldn't be too surprising. The more of the model you can load into fast vram the faster tokens will generate and the less power you'll use for the same amount of tokens generated. Also note that I'm using mid and low tier AMD cards, with the mid tier card being used for the 7b test. If you have an Nvidia card with fast memory bandwidth (i.e., a 3090/4090), or an Apple ARM Ultra, you're going to see in the region of 60 tokens/s for the 7b model. With a mid range Nvidia card (any of the 4070s), or an Apple ARM Max, you can probably expect similar performance on 7b models (45 t/s or so). Apple ARM probably wins purely on total power usage, but you're also going to be paying an arm and a leg for a 64gb model which is the minimum you'd want to run medium/large sized models with reasonable quants (46b Mixtral at q6/8, or 70b at q6), but with the rate models are advancing you may be able to get away with 32gb (Mixtral at q4/6, 34b at q6, 70b at q3).

I'm not sure how many tokens a Copilot style interface is going to churn though but it's probably in the same ballpark. A reasonable figure for either interface at the high end is probably a kWh a day, and even in expensive regions like Europe it's probably no more than $15/mo. The actual cost comparison then becomes a little complicated, spending $1500 on 2 3090s for 48gb of fast vram isn't going to make sense for most people, similarly making do with whatever cards you can get your hands on so long as they have a reasonable amount of vram probably isn't going to pay off in the long run. It also depends on the size of the model you want to use and what amount of quantisation you're willing to put up with, current 34b models or Mixtral at reasonable quants (q4 at least) should be comparable to ChatGPT 3.5, future local models may end up getting better performance (either in terms of generation speed or how smart they are) but ChatGPT 5 may blow everything we have now out of the water. It seems far too early to make purchasing decisions based on what may happen, but most people should be able to run 7b/13b and maybe up to 34/46b models with what they have and not break the bank when it comes time to pay the power bill.

fennecfoxy · 2 years ago
Don't even need to do hard math: compare using co-pilot style LLM (bursts of 100% GPU every wee while) vs gaming on your 4090 (running at 100% for x hours).
theLiminator · 2 years ago
Curious what's the current SOTA local copilot model? Are there any extensions in vscode that give you a similar experience? I'd love something more powerful than copilot for local use (I have a 4090, so I should be able to run a decent number of models).
sfsylvester · 2 years ago
This is a completely fair, but open question. Not to be a typical HN user, but when you say SOTA local, the question is really what benchmarks do you really care about in order to evaluate. Size, operability, complexity, explainability etc.

Working out what copilot models perform best has been a deep exercise for myself and has really made me evaluate my own coding style on what I find important and things I look out for when investigating models and evaluating interview candidates.

I think three benchmarks & leaderboards most go to are:

https://huggingface.co/spaces/bigcode/bigcode-models-leaderb... - which is the most understood, broad language capability leaderboad that relies on well understood evaluations and benchmarks.

https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul... - Also comprehensive, but primarily assesses Python and JavaScript.

https://evalplus.github.io/leaderboard.html - which I think is a better take on comparing models you intend to run locally as you can evaluate performance, operability and size in one visualisation.

Best of luck and I would love to know which models & benchmarks you choose and why.

vwkd · 2 years ago
> when investigating models and evaluating interview candidates

Wow, just realized, in the future employers will mostly interview LLMs instead of people.

hackerlight · 2 years ago
> https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul... - Also comprehensive, but primarily assesses Python and JavaScript.

I wonder why they didn't use DeepSeek under the "senior" interview test. I am curious to see how it stacks up there.

theLiminator · 2 years ago
I'm honestly more interested in anecdotes and I'm just seeking anything that can be a drop-in copilot replacement (that's subjectively better). Perhaps one major thing I'd look for is improved understanding of the code in my own workspace.

I honestly don't know what benchmarks to look at or even what questions to be asking.

Eisenstein · 2 years ago
When this 70b model gets quantized you should be able to run it fine on your 4090. Check out 'TheBloke' on huggingface and the llamacpp to run the gguf files.
coder543 · 2 years ago
I think your take is a bit optimistic. I like quantization as much as the next person, but even the 2-bit model won’t fit entirely on a 4090: https://huggingface.co/TheBloke/Llama-2-70B-GGUF

I would be uncomfortable recommending less than 4-bit quantization on a non-MoE model, which is ~40GB on a 70B model.

siilats · 2 years ago
We made a Jetbrains plugin called CodeGPT to run this locally https://plugins.jetbrains.com/plugin/21056-codegpt
bredren · 2 years ago
Are seamless conversations still handled, using the truncation method described in #68?

I was curious if some kind of summary or compression of old exchanges flagged as such might allow the app to remember stuff that had been discussed but fallen outside the token limit.

But possibly request key details lost during summary to bring them back into the new context.

I had thought chatgpt was doing something like this but haven’t read about it.