Note the massive context length (130k tokens). Also because it would be kinda pointless to generate a long CoT without enough context to contain it and the reply.
EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't believe I was near 130k -- using ollama with fp16 model). I asked one of my test questions with a coding question totally unrelated to what it says:
<QwQ output>
But the problem is in this question. Wait perhaps I'm getting ahead of
myself.
Wait the user hasn't actually provided a specific task yet. Let me check
again.
The initial instruction says:
"Please act as an AI agent that can perform tasks... When responding,
first output a YAML data structure with your proposed action, then wait
for feedback before proceeding."
But perhaps this is part of a system prompt? Wait the user input here
seems to be just "You will be given a problem. Please reason step by
step..." followed by a possible task?
</QwQ>
Note: Ollama "/show info" shows that the context size set is correct.
> Note: Ollama "/show info" shows that the context size set is correct.
That's not what Ollama's `/show info` is telling you. It actually just means that the model is capable of processing the context size displayed.
Ollama's behavior around context length is very misleading. There is a default context length limit parameter unrelated to the model's capacity, and I believe that default is a mere 2,048 tokens. Worse, when the prompt exceeds it, there is no error -- Ollama just silently truncates it!
If you want to use the model's full context window, you'll have to execute `/set parameter num_ctx 131072` in Ollama chat mode, or if using the API or an app that uses the API, set the `num_ctx` parameter in your API request.
Ok, this explains why QwQ is working great on their chat. Btw I saw this thing multiple times: that ollama inference, for one reason or the other, even without quantization, somewhat had issues with the actual model performance. In one instance the same model with the same quantization level, if run with MLX was great, and I got terrible results with ollama: the point here is not ollama itself, but there is no testing at all for this models.
I believe that models should be released with test vectors at t=0, providing what is the expected output for a given prompt for the full precision and at different quantization levels. And also for specific prompts, the full output logits for a few tokens, so that it's possible to also compute the error due to quantization or inference errors.
Ollama defaults to a context of 2048 regardless of model unless you override it with /set parameter num_ctx [your context length]. This is because long contexts make inference slower. In my experiments, QwQ tends to overthink and question itself a lot and generate massive chains of thought for even simple questions, so I'd recommend setting num_ctx to at least 32768.
In my experiments of a couple mechanical engineering problems, it did fairly well in final answers, correctly solving mechanical engineering problems that even DeepSeek r1 (full size) and GPT 4o did wrong in my tests. However, the chain of thought was absurdly long, convoluted, circular, and all over the place. This also made it very slow, maybe 30x slower than comparably sized non-thinking models.
I used a num_ctx of 32768, top_k of 30, temperature of 0.6, and top_p of 0.95. These parameters (other than context length) were recommended by the developers on Hugging Face.
So you don't have to do the parameter change every load. Is there a better way or is it kind of like setting num_ctx in that "you're just supposed to know"?
My understanding is that top_k and top_p are two different methods of decoding tokens during inference. top_k=30 considers the top 30 tokens when selecting the next token to generate and top_p=0.95 considers the top 95 percentile. You should need to select only one.
Edit: Looks like both work together. "Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)"
Not quite sure how this is implemented - maybe one is preferred over the other when there are enough interesting tokens!
Yeah it did the same in my case too. it did all the work in the <think> tokens. but did not spit out the actual answer. I was not even close to 100K tokens
If you did not change the context length, it is certain that it is not 2k or so. In "/show info" there is a field "context length" which is about the model in general, while "num_ctx" under "parameters" is the context length for the specific chat.
I use modelfiles because I only use ollama because it has easy integration with other stuff eg with zed, so this way I can easily choose models with a set context size directly.
Here nothing fancy, just
FROM qwq
PARAMETER num_ctx 100000
You save this somewhere as a text file, you run
ollama create qwq-100k -f path/to/that/modelfile
and you now have "qwq-100k" in your list of models.
Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required.
that's interesting... i've been noticing similar issues with long context windows & forgetting. are you seeing that the model drifts more towards the beginning of the context or is it seemingly random?
i've also been experimenting with different chunking strategies to see if that helps maintain coherence over larger contexts. it's a tricky problem.
Neither lost-in-the-middle nor long context performance have seen a lot of improvement in the recent year. It's not easy to generate long training examples that also stay meaningful, and all existing models still become significantly dumber after 20-30k tokens, particularly on hard tasks.
Reasoning models probably need some optimization constraint put on the length of the CoT, and also some priority constraint (only reason about things that need it).
If I had to guess, more tariffs and sanctions that increase the competing nation's self-reliance and harm domestic consumers. Perhaps my peabrain just can't comprehend the wisdom of policymakers on the sanctions front, but it just seems like all it does is empower the target long-term.
The tarrifs are for the US to build it's own domestic capabilities, but this will ultimately shift the rest of the world's trade away from the US and toward each other. It's a trade-off – no pun intended – between local jobs/national security and downgrading their own economy/geo-political standing/currency. Anyone who's been making financial bets on business as usual for globalization is going to see a bit of a speed bump over the next few years, but in the long term it's the US taking an L to undo decades of undermining their own peoples' prospects from offshoring their entire manufacturing capability. Their trump card - still no pun intended - is their military capability, which the world will have to wean themselves off first.
China’s strategy is to prevent any one bloc from achieving dominance and cutting off the others, while being the sole locus for the killer combination of industrial capacity + advanced research.
China’s strategy is to prevent any one bloc from achieving dominance and cutting off the others, while being the sole locus for the killer combination of industrial capacity + advanced research.
You're acting like these startups are controlled by the Chinese government. In reality, they're just like any other American startup. They make decisions on how to make the most money - not what the Chinese government wants.
>BTW I am Indian and we are not even in the race as country
Why are you surprised?
India was on a per capita basis poorer than sub-Saharan Africa until 2004.
The only reason India is no longer poorer than Africa is because the West (the IMF and World Bank) forced India to do structural reforms in 1991 that stopped the downward trajectory of the Indian economy since its 1947 independence.
The only reason India is no longer poorer than Africa is because the West (the IMF and World Bank) forced India to do structural reforms in 1991 that stopped the downward trajectory of the Indian economy since its 1947 independence.
India had the world's largest GDP at some point in its history. Why did India lose its status?
India is absolutely embarrassing. Could have been an extremely important 3rd party that obviates the moronic US vs China, us or them, fReEdOm vs communism narrative with all the talent it has.
A mathematician once told me that this might be because math teaches you to have different representations for a same thing, you then have to manipulate those abstractions and wander through their hierarchy until you find an objective answer.
They baited me into putting in a query and then asking me to sign up to submit it. Even have a "Stay Logged Out" button that I thought would bypass it, but no.
I get running these models is not cheap, but they just lost a potential customer / user.
super impressive. we won't need that many GPUs in the future if we can have the performance of DeepSeek R1 with even less parameters. NVIDIA is in trouble. We are moving towards a world of very cheap compute: https://medium.com/thoughts-on-machine-learning/a-future-of-...
Have you heard of Jevons paradox? That says that whenever new tech is used to make something more efficient the tech is just upscaled to make the product quality higher. Same here. Deepseek has some algoritmic improvements that reduces resources for the same output quality. But increasig resources (which are available) will increase the quality. There will be always need for more compute. Nvidia is not in trouble. They have a monopoly on high performing ai chips for which demand will at least rise by a factor of 1000 upcoming years (my personal opinion)
Surprisingly those open models might be savour for Apple and gift for Qualcomm too. They can finetune them to their liking and catch up to competition and also sell more of their devices in the future. Longterm even better models for Vision will have problem to compete with latency of smaller models that are good enough but have very low latency. This will be important in robotics - reason Figure AI dumped OpenAI and started using their own AI models based on Open Source (founder mentioned recently in one interview).
It says "wait" (as in "wait, no, I should do X") so much while reasoning it's almost comical. I also ran into the "catastrophic forgetting" issue that others have reported - it sometimes loses the plot after producing a lot of reasoning tokens.
Overall though quite impressive if you're not in a hurry.
I read somewhere which I can't find now, that for the -reasoning- models they trained heavily to keep saying "wait" so they can keep reasoning and not return early.
I do not understand why to force wait when model want to output </think>.
why not just decrease </think> probability? if model really wants to finish maybe or could over power it in cases were it's really simple question. and definitely would allow model to express next thought more freely
I have a suspicion it does use budget forcing. The word "alternatively" also frequently show up and it happens when it seems logically that a </think> tag could have been place.
39gb if you use a fp8 quantized model.[1] Remember that your OS might be using some of that itself.
As far as I recall, Ollama/llama.cpp recently added a feature to page-in parameters - so you'll be able to go arbitrarily large soon enough (at a performance cost). Obviously more in RAM = more speed = more better.
The quantized model fits in about 20 GB, so 32 would probably be sufficient unless you want to use the full context length (long inputs and/or lots of reasoning). 48 should be plenty.
I wonder if having a big mixture of experts isn't all that valuable for the type of tasks in math and coding benchmarks. Like my intuition is that you need all the extra experts because models store fuzzy knowledge in their feed-forward layers, and having a lot of feed-forward weights lets you store a longer tail of knowledge. Math and coding benchmarks do sometimes require highly specialized knowledge, but if we believe the story that the experts specialize to their own domains, it might be that you only really need a few of them if all you're doing is math and coding. So you can get away with a non-mixture model that's basically just your math-and-coding experts glued together (which comes out to about 32B parameters in R1's case).
MoE is likely temporary, local optimum now that resembles bitter lesson path. With the time we'll likely distill what's important, shrink it and keep it always active. There may be some dynamic retrieval of knowledge (but not intelligence) in the future but it probably won't be anything close to MoE.
I think it will be more akin to o1-mini/o3-mini instead of r1. It is a very focused reasoning model good at math and code, but probably would not be better than r1 at things like general world knowledge or others.
EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't believe I was near 130k -- using ollama with fp16 model). I asked one of my test questions with a coding question totally unrelated to what it says:
<QwQ output> But the problem is in this question. Wait perhaps I'm getting ahead of myself.
Wait the user hasn't actually provided a specific task yet. Let me check again.
The initial instruction says:
"Please act as an AI agent that can perform tasks... When responding, first output a YAML data structure with your proposed action, then wait for feedback before proceeding."
But perhaps this is part of a system prompt? Wait the user input here seems to be just "You will be given a problem. Please reason step by step..." followed by a possible task? </QwQ>
Note: Ollama "/show info" shows that the context size set is correct.
That's not what Ollama's `/show info` is telling you. It actually just means that the model is capable of processing the context size displayed.
Ollama's behavior around context length is very misleading. There is a default context length limit parameter unrelated to the model's capacity, and I believe that default is a mere 2,048 tokens. Worse, when the prompt exceeds it, there is no error -- Ollama just silently truncates it!
If you want to use the model's full context window, you'll have to execute `/set parameter num_ctx 131072` in Ollama chat mode, or if using the API or an app that uses the API, set the `num_ctx` parameter in your API request.
I believe that models should be released with test vectors at t=0, providing what is the expected output for a given prompt for the full precision and at different quantization levels. And also for specific prompts, the full output logits for a few tokens, so that it's possible to also compute the error due to quantization or inference errors.
In my experiments of a couple mechanical engineering problems, it did fairly well in final answers, correctly solving mechanical engineering problems that even DeepSeek r1 (full size) and GPT 4o did wrong in my tests. However, the chain of thought was absurdly long, convoluted, circular, and all over the place. This also made it very slow, maybe 30x slower than comparably sized non-thinking models.
I used a num_ctx of 32768, top_k of 30, temperature of 0.6, and top_p of 0.95. These parameters (other than context length) were recommended by the developers on Hugging Face.
https://github.com/ollama/ollama/blob/main/docs/modelfile.md...
Edit: Looks like both work together. "Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)"
Not quite sure how this is implemented - maybe one is preferred over the other when there are enough interesting tokens!
Many humans would do that
I use modelfiles because I only use ollama because it has easy integration with other stuff eg with zed, so this way I can easily choose models with a set context size directly.
Here nothing fancy, just
You save this somewhere as a text file, you run and you now have "qwq-100k" in your list of models.Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required.
i've also been experimenting with different chunking strategies to see if that helps maintain coherence over larger contexts. it's a tricky problem.
Reasoning models probably need some optimization constraint put on the length of the CoT, and also some priority constraint (only reason about things that need it).
These things are pretty interesting as they are developing. What US will do to retain its power?
BTW I am Indian and we are not even in the race as country. :(
https://sc.mp/sr30f
China’s strategy is to prevent any one bloc from achieving dominance and cutting off the others, while being the sole locus for the killer combination of industrial capacity + advanced research.
Dead Comment
Why are you surprised?
India was on a per capita basis poorer than sub-Saharan Africa until 2004.
The only reason India is no longer poorer than Africa is because the West (the IMF and World Bank) forced India to do structural reforms in 1991 that stopped the downward trajectory of the Indian economy since its 1947 independence.
Deleted Comment
20x smaller than Deep Seek! How small can these go? What kind of hardware can run this?
Its only logical.
I get running these models is not cheap, but they just lost a potential customer / user.
Deleted Comment
They're pretty up to date with latest models. $20 a month
Dead Comment
Overall though quite impressive if you're not in a hurry.
why not just decrease </think> probability? if model really wants to finish maybe or could over power it in cases were it's really simple question. and definitely would allow model to express next thought more freely
- 4bit: https://huggingface.co/mlx-community/QwQ-32B-4bit
- 6bit: https://huggingface.co/mlx-community/QwQ-32B-6bit
As far as I recall, Ollama/llama.cpp recently added a feature to page-in parameters - so you'll be able to go arbitrarily large soon enough (at a performance cost). Obviously more in RAM = more speed = more better.
[1]: https://token-calculator.net/llm-memory-calculator
Device 1 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||20.170Gi/23.988Gi]
Device 2 [NVIDIA GeForce RTX 4090] MEM[||||||||||||||||||19.945Gi/23.988Gi]
Though based on the responses here, it needs sizable context to work, so we may be limited to 4 bit (I'm on an M3 Max w/ 48gb as well).
Dead Comment
I don't think we should believe anything like that.
Does each expert within R1 have 37B parameters? If so, is QwQ only truly competing against one expert in this particular benchmark?
Generally I don't think I follow how MOE "selects" a model during training or usage.