These two phases have pretty different performance characteristics - prefill can really maximize GPU memory. For long contexts, its can be nigh impossible to do it all in a single pass - frameworks like vLLM use a technique called "chunked prefill".
The decode phase is compute intensive, but tends not to maximize GPU memory.
If you are serving these models, you really want to be able to have larger batch sizes during inference, which can only really come with scale - for a smaller app, you won't want to make the user wait that long.
So, long contexts only have to be processed _once_ per inference, which is basically a scheduling problem.
But the number of decode passes scales linearly with the output length. If it was unlimited, you could get some requests just _always_ present in an inference batch, reducing throughput for everyone.
Decode speed is generally memory bandwidth bound. Prefill is typically arithmetic bound. This is the reason for mixed batches (both decode and prefill) - it let's you saturate both memory and arithmetic.
Chunked prefill is for minimizing latency for decode entries in the same batch. It's not needed if you have only one request - in that case it's the fastest to just prefill in one chunk.
I'm pretty sure the sibling comment is right about different length limits - it's because of training and model talking nonsense if you let too long.
It is also a training issue. The model has to be trained to reinforce longer outputs, which has a quadratic train-time cost and requires suitable long-context response training data.
70B+ models typically run great with my MacBook's 96GB of (V)RAM. I want a Mac Studio to run e.g. llama-405B, but I can't justify the marginal model quality ROI for like $7k or whatever. (But I waaant iiit!)
For reference, Qwen 2.5 32B on CPU (5950X) with GPU offloading (to RTX 3090ti) gets about 8.5 token/s, while 14B (fully on GPU) gets about ~64 tokens/s.
It would be nice to have comparisons to Claude 3.5 for the coder model, only comparing to open source models isn’t super helpful because I would want to compare to the model I’m currently using for development work.
Here is a comparison of the prompt "I want to create a basic Flight simulator in Bevy and Rust. Help me figure out the core properties I need for take off, in air flight and landing" between Claude Sonnet 3.5 and Qwen2.5-14B-Instruct-Q4_K_M.gguf:
I'm impressed by the scope of this drop.
The raw intelligence of open models seems to be falling behind closed. But I think that's because frontier models from openai and anthropic are not just raw models, but probably include stuff like COT, 'best of N', or control vectors.
> we are inspired by the recent advancements in reinforcement learning (e.g., o1)
It is interesting to see what the future will bring when models incorporate chain of thought approaches and whether o1 will get outperformed by open source models.
The first phase is referred to as "prefill", where the input is processed to create the KV Cache.
After that phase, the "decode" phase is called auto-regressively. Each decode yields one new token.
This post on [Inference Memory Requirements](https://huggingface.co/blog/llama31#inference-memory-require...) is quite good.
These two phases have pretty different performance characteristics - prefill can really maximize GPU memory. For long contexts, its can be nigh impossible to do it all in a single pass - frameworks like vLLM use a technique called "chunked prefill".
The decode phase is compute intensive, but tends not to maximize GPU memory.
If you are serving these models, you really want to be able to have larger batch sizes during inference, which can only really come with scale - for a smaller app, you won't want to make the user wait that long.
So, long contexts only have to be processed _once_ per inference, which is basically a scheduling problem.
But the number of decode passes scales linearly with the output length. If it was unlimited, you could get some requests just _always_ present in an inference batch, reducing throughput for everyone.
Chunked prefill is for minimizing latency for decode entries in the same batch. It's not needed if you have only one request - in that case it's the fastest to just prefill in one chunk.
I'm pretty sure the sibling comment is right about different length limits - it's because of training and model talking nonsense if you let too long.
Deleted Comment
70B is just a littttle rough trying to run without offloading some layers to the CPU.
Not sure if 128GB VRAM is enough for running 405b (maybe at 3-bit quant?), but it seems to offer great value for running 70B models at 8-bit.
How many tokens/second is that approx?
For reference, Qwen 2.5 32B on CPU (5950X) with GPU offloading (to RTX 3090ti) gets about 8.5 token/s, while 14B (fully on GPU) gets about ~64 tokens/s.
[1]: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5/Qwen...
https://gist.github.com/victorb/7749e76f7c27674f3ae36d791e20...
AFAIK, there isn't any (micro)benchmark comparisons out yet.
Remarkable that it is at all comparable to Sonnet 3.5
Ctrl F - Code Reasoning:
I remember when GPT-3 was trained on 300B tokens.
It is interesting to see what the future will bring when models incorporate chain of thought approaches and whether o1 will get outperformed by open source models.
Deleted Comment