Readit News logoReadit News
mythz · a year ago
Qwen2.5 Coder 32B is great for an OSS model, but in my testing (ollama) Sonnet 3.5 yields noticeably better results, a lot more than what the provided benchmarks suggest.

Best thing about it is that it's an OSS model that can be hosted by anyone, resulting in an open competitive market bringing hosting costs down, currently sitting at $0.18/$0.18 M tok/s [1] making it 50x cheaper than Sonnet 3.5 and ~17x cheaper than Haiku 3.5.

[1] https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct

Vetch · a year ago
Claude Sonnet 3.5s are bars too high to clear. No other model comes close, with the occasional exception of o1-preview. But o1-preview is always a gamble, your rolls are limited and it will either be the best answer possible from an LLM or it returns after a wild goose chase, having talked itself into a tangled mess of confusion.

I'd personally rank the Qwen2.5 32B model only a little behind GPT4o at worst, and preferable to gemini 1.5 pro 002 (at code only, Gemini is a model that's surprisingly bad at code considering its top class STEM reasoning).

This makes Qwen2.5-coder-32B astounding all considered. It's really quite capable and is finally an accessible model that's useful for real work. I tested it on some linear algebra, discussed pros and cons of a belief propagation based approach to SAT solving, had it implement a fast simple approximate nearest neighbor based on the near orthogonality of random vectors in high dimensions (in OCaml, not perfect with but close enough to useful/easily correctable), simulate execution of a very simple recursive program (also Ocaml) and write a basic post processing shader for Unity. It did really well on each of those tasks.

dragonsh · a year ago
Not really tried the Claude 3.5, later tried o1-preview on github models and recently Qwen2.5 32B for a prompt to generate a litestar[0] app to manage a wysiwyg content using grapesjs[1] and use pelican[2] to generate static site. It generated very bad code and invented many libraries in import which didn't exist. Cluade was one of the worst code generator, later tried sieve of atkin to generate primes to N and then use miller-rabin test to test each generated prime both using all the cpu core available. Claude completely failed and could never get a correct code without some or the other errors especially using multiprocess, o1-preview got it right in first attempt, Qwen 2.5 32B got it right in 3'rd error fix. In general for some very simple code Claude is correct but when using something new it completely fails, o1-preview performs much better. Give a try to generate some manim community edition visualization using Claude, it generates something not working correct or with errors, o1-preview does much better job.

In most of my test o1-preview performed way better than Claude and Qwen was not that bad either.

[0] https://github.com/litestar-org/litestar

[1] https://grapesjs.com/

[3] https://getpelican.com/

mistercheph · a year ago
Isn't it a bit unreasonable to compare something free that you can run today on totally standard consumer hardware with *the* state of the art LLM that is probably >= 1T parameters?
mjwweb44 · a year ago
Yes
faangguyindia · a year ago
Too bad zed editor doesn't have code completion via custom LLM

So far I am using zed editor and can't switch to somethints else untill zed editor gets the update which support FIM directive via custom LLMs

aargh_aargh · a year ago
This question keeps popping up but I don't get it. Everyone and their dog has an OpenAI-compatible API. Why not just serve a local LLM and put api.openai.com 127.0.0.1 in your hosts file?
satvikpendem · a year ago
You could...switch editors? Why not do that until Zed gets that support?
Tiberium · a year ago
Yeah, benchmarks are one thing but when you actually interact with the model it becomes clear very fast how "intelligent" the model actually is, by doing or noting small things that other models won't. 3.5 Sonnet v1 was great, v2 is already incredible.
LeoPanthera · a year ago
...but you can't run it locally. Not unless you're sitting on some monster metal. It's tiresome when people compare enormous cloud models to tiny little things. They're completely different.
mythz · a year ago
> ...but you can't run it locally. Not unless you're sitting on some monster metal.

I'm getting a very usable ~18 tok/s running it on 2x NVIDIA A4000 (32GB VRAM).

Both GPUs cost less than USD $1,400 on eBay.

qwen2.5-coder:32b is 19GB on ollama [1]

[1] https://ollama.com/library/qwen2.5-coder:32b

Deleted Comment

guerrilla · a year ago
Yeah, over 65GB VRAM... that'd be expensive but not impossible. I think three RTX 4090's could do it, with their 24GB each.
tucnak · a year ago
The issue with some recent models is that they're basically overfitting on public evals, and it's not clear who's the biggest offender—OpenAI, or the Chinese? And regardless, "Mandelbrot in plaintext" is a bad choice for evaluation's sake. The public datasets are full of stuff like that. You really want to be testing stuff that isn't overfit to death, beginning with tasks that notoriously don't generalise all too well, all the while being most indicative of capability: like, translating a program that is unlikely to have been included in the training set verbatim, from a lesser-known language—to a well-known language, and back.

I'd be shocked if this model held up in the comprehensive private evals.

simonw · a year ago
That's why I threw in "same size as your terminal window" for the Mandelbrot demo - I thought that was just enough of a tweak to avoid exact regurgitation of some previously trained program.

I have not performed comprehensive evals of my own here - clearly - but I did just enough to confirm that the buzz I was seeing around this model appeared to hold up. That's enough for me to want to write about it.

tucnak · a year ago
Hey, Simon! Have you ever considered to host private evals? I think, with the weight of the community behind you, you could easily accumulate a bunch of really high-quality, "curated" data, if you will. That is to say, people would happily send it to you. More people should self-host stuff like https://github.com/lm-sys/FastChat without revealing their dataset, I think, and we would probably trust it more than the public stuff, considering they already trust _you_ to some extent! So far the private eval scene is just a handful of guys on twitter reporting their findings in unsystematic manner, but a real grassroots approach backed up by a respectable influencer would go a long way to change that.

Food for thought.

JimDabell · a year ago
> The issue with some recent models is that they're basically overfitting on public evals… You really want to be testing stuff that isn't overfit to death… I'd be shocked if this model held up in the comprehensive private evals.

From the announcement:

> we selected the latest 4 months of LiveCodeBench (2024.07 - 2024.11) questions as the evaluation, which are the latest published questions that could not have leaked into the training set, reflecting the model’s OOD capabilities.

https://qwenlm.github.io/blog/qwen2.5-coder-family/

tucnak · a year ago
They say a lot of things, like that their base models weren't instruction-tuned, however people have confirmed that it's impossible to find instruction that it wouldn't follow, and the output would indicate that exactly. The labs absolutely love incorporating public evals in their training; of course, they're not going to admit that.
isoprophlex · a year ago
All the big guys are hiring domain experts - serious brains, phd level, in some cases - to build bespoke train and test data for their models.

As long as Jensen Huang keeps shitting out nvidia cards, progress is just a function of cash to burn on paying humans to dump their knowledge into train data... and hoping this silly transformer architecture keeps holding up

tucnak · a year ago
> All the big guys are hiring domain experts - serious brains, phd level, in some cases

I don't know where this myth had originated, and perhaps it was true at least at some point, but you just have to consider that all the recent major advances in datasets had to do with _unsupervised_ reward models, synthetic, generational datasets, and new advanced alignment methods. The big labs _are_ hiring serious PhD level researchers, and most of these are physicists, Bayesians of many kind and breed, not "domain experts." However, perception matters a lot these days; some labs, I won't point, but OpenAI is probably the biggest offender, simply cannot control themselves. The fact of the matter is they LOVE including the public evals in their finetuning, as it makes them appear stronger in the "benchmarks."

Wheatman · a year ago
Interestingly enough, the new "Orion" model by OpenAI doesnt outperform, and even sometimes underperforms in programing tasks, when compared to GPT-4.

There is an interesing discussion about it here:https://news.ycombinator.com/item?id=42104964.

kwlranb · a year ago
The PhD folks will steal from Stackoverflow and leetcode solutions. Just another laundering buffer.

Hardly any PhD has the patience or skill for that matter to code robust solutions from scratch. Just look at PhD code in the wild.

f1shy · a year ago
> like, translating a program that is unlikely to have been included in the training set verbatim, from a lesser-known language—to a well-known language, and back.

I would exactly want to see that, or "make a little interpreter for a basic subset of C, or Scheme or <X>".

tucnak · a year ago
So far, non-English inputs have been most telling: I deal with Ukrainian datasets mostly, and what we see is OpenAI models, the Chinese models, of course, and Llama's, to admittedly, lesser extent—all degrading disproportionately compared to the other models. You know what model degrades the least comparatively? Gemma 27b. The arena numbers would suggest it's not so strong, but they'd actually managed to make something useful for the whole world (I can only judge re: Ukrainian, of course, but I suspect it's probably equally good in the other languages, too.) However, nothing can compete currently with Sonnet 3.5 in reasoning. I predict a surge in the private eval scene when people inevitably grow wary of leaderboard-propaganda.

More people should host https://github.com/lm-sys/FastChat

manamorphic · a year ago
I heard conflicting things about it. Some claim it was trained so it can do well on benchmarks and in real world scenarios it's lacking. Can somebody deny/confirm ?
dodslaser · a year ago
What else should you train for? If the benchmark dosn't represent real world scenarios, isn't that a problem with the benchmark rather than the model?
isoprophlex · a year ago
If your benchmark covers all possible programming tasks then you dont need an llm, you need search over your benchmark.

Hypothetically let's say the benchmark contains "test divisibility of this integer by n" for all n of the form 3x+1. An extremely overfit llm won't be able to code divisibility for all n not of the form 3x+1, and your benchmark will never tell.

LeoPanthera · a year ago
This is called Goodhart's law, who said: "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."

But in modern usage it is often rephrased to: "When a measure becomes a target, it ceases to be a good measure"

https://en.wikipedia.org/wiki/Goodhart%27s_law

exitb · a year ago
Overfitting is a concern.
anonzzzies · a year ago
It's small... And for that size, it does very well. Been using it a few days and it's quite good for it's size and the fact you can run it locally. So not sure if it's true what you say; for us it works really well.
csomar · a year ago
I tried the Qwen2.5 32B a couple weeks ago. It was amazing for a model that can run on my laptop but far from Claude/GPT-4o. I am downloading the coder tuned version now.
tyler33 · a year ago
i tried qwen and it is surprinsingly good, maybe not as good as claude but could replaced it
fareesh · a year ago
I like the idea of offline LLMs but in practice there's no way I'm wasting battery life on running a Language Model.

On a desktop too, I wonder if it's worth the additional stress and heat on my GPU as opposed to one somewhere in a datacenter which will cost me a few dollars per month, or a few cents per hour if I spin up the infra myself on demand.

Super useful for confidential / secret work though

InsideOutSanta · a year ago
In my experience, a lot of companies still have rules against using tools like Copilot due to security and copyright concerns, even though many software engineers just to ignore them.

This could be a way to satisfy both sides, although it only solves the issue of sending internal data to companies like OpenAI, it doesn't solve the "we might accidentally end up with somebody else's copyrighted code in our code base" issue.

dizhn · a year ago
What provider/service do you use for this?

Deleted Comment

kundi · a year ago
It seems fine-tuned for benchmarking more than the actual tasks
joseferben · a year ago
@simonw what is the token/s like on your 64gb m2 mbp?
simonw · a year ago
With MLX:

    Prompt: 49 tokens, 95.691 tokens-per-sec
    Generation: 723 tokens, 10.016 tokens-per-sec
    Peak memory: 32.685 GB

joseferben · a year ago
so quite usable, thanks!
nenadst · a year ago
try it with something more "obscure" like e.g. creating an OPC/Ua Server from scratch.

It fails spectacularly - the code wont even compile and there are many things missing to get a working solution.

simonw · a year ago
Can you share a transcript?