In the case of gpt-oss 120B that would means sqrt(5*120)=24B.
Deleted Comment
In the case of gpt-oss 120B that would means sqrt(5*120)=24B.
This is very important both for cache locality and for minimizing garbage collector pressure.
gpt-oss 120B - 37 tok/sec (with CPU offloading, doesn't fit in the GPU entirely)
Qwen3 32B - 65 tok/sec
Qwen3 30B-A3B - 150 tok/sec
(all at 4-bit)
Only if 120B fits entirely in the GPU. Otherwise, for me, with a consumer GPU that only has 32 GB VRAM, gpt-oss 120B is actually 2 times slower than Qwen3 32B (37 tok/sec vs. 65 tok/sec)
rus. Volk ~ ukr. Vovk
rus. Aleksey ~ ukr. Oleksiy
Different suffixes: rus. Tarasov ~ ukr. Tarasenko, Tarasiuk
Ukrainian surnames often have no suffixes at all (more often than in Russia): rus. Melnikov vs. ukr. Melnyk
rus. Kovalev vs. ukr. Koval
Although it doesn't say anything about a person's nationality for certain because of migrations.Yet they seem to be from many other tests (characters corrections or manipulation in texts, for example).
> The fact that reasoning models can count letters, even though they can't see individual letters
To a mind, every idea is a representation. But we want the processor to work reliably on them representations.
> If we don't allow a [mind] to base its reasoning on the training data it's seen, what should it base it on
On its reasoning and judgement over what it was told. You do not repeat what you heard, or you state that's what you heard (and provide sources).
> uses randomness
That is in a way a problem, a non-final fix - satisficing (Herb Simon) after random germs instead of constructing through a full optimality plan.
In the way I used the expression «chancey guesses» though I meant that guessing by chance when the right answer falls in a limited set ("how many letters in 'but'") is a weaker corroboration than when the right answer falls in a richer set ("how many letters in this sentence").
An LLM is a derivative of collective human knowledge, which is intrinsically unreliable itself. Most human concepts are ill-defined, fuzzy, very contextual. Human reasoning itself is flawed.
I'm not sure why people expect 100% reliability from a language model that is based on human representations which themselves cannot realistically be 100% reliable and perfectly well-defined.
If we want better reliability, we need a combination of tools: a "human mind model", which is intrinsically unreliable, plus a set of programmatic tools (say, like a human would use a calculator or a program to verify their results). I don't know if we can make something which works with human concepts and is 100% reliable in principle. Can a "lesser" mind create a "greater" mind, one free of human limitations? I think it's an open question.
There are 3 three-letter words in the given text:
"Can" (first instance)
"all"
"Can" (second instance)
Sounds about right.Then I took this answer and asked to count the average number of words inside parentheses. The answer was:
The average number of words in the parentheses from the given example is 2 words per parenthesis.
Here's the breakdown:
1. "Can" (first instance): Parentheses contain "first instance" (2 words).
2. "all": No parentheses, so excluded.
3. "Can" (second instance): Parentheses contain "second instance" (2 words).
Total words in parentheses: 2 + 2 = 4
Number of parentheses: 2
Average: 4 ÷ 2 = 2.
Correct again. Everything on the first try.Yep.
> gave me the correct answer
Try real-world tests that cannot be covered by training data or chancey guesses.
The fact that reasoning models can count letters, even though they can't see individual letters, is actually pretty cool.
>Try real-world tests that cannot be covered by training data
If we don't allow a model to base its reasoning on the training data it's seen, what should it base it on? Clairvoyance? :)
> chancey guesses
The default sampling in most LLMs uses randomness to feel less robotic and repetitive, so it’s no surprise it makes “chancey guesses.” That’s literally what the system is programmed to do by default.
Except I create a new file for each new day, to have peace of mind (as opposed to having a million-line-long file). Instead of Ctrl+F, I use grep. The format is Markdown.
My typical TODO file has 3 sections: TODO, Pending, and DONE. If something is done for the day, I move it to the DONE section. When I create a new file for a new day, I copy over everything except the DONE section. The Pending section is for something I can't act on immediately (say, waiting for a coworker's response). I look there less often.
Every morning I also re-prioritize the items in the TODO section.
The only problem is that if I'm away from the work computer, I have to add items in a separate app on the phone (Notepad Free) and then manually copy them to the PC.
This system is something I naturally came to over those 7 years via trial and error, something that works well for me. I had other formats that didn't catch on.
Just my two cents.