Readit News logoReadit News
vessenes · 5 months ago
I tried Kimi on a few coding problems that Claude was spinning on. It’s good. It’s huge, way too big to be a “local” model — I think you need something like 16 H200s to run it - but it has a slightly different vibe than some of the other models. I liked it. It would definitely be useful in ensemble use cases at the very least.
summarity · 5 months ago
Reasonable speeds are possible with 4bit quants on 2 512GB Mac Studios (MLX TB4 Ring - see https://x.com/awnihannun/status/1943723599971443134) or even a single socket Epyc system with >1TB of RAM (about the same real world memory throughput as the M Ultra). So $20k-ish to play with it.

For real-world speeds though yeah, you'd need serious hardware. This is more of a "deploy your own stamp" model, less a "local" model.

wongarsu · 5 months ago
Reasonable speeds are possible if you pay someone else to run it. Right now both NovitaAI and Parasail are running it, both available through Openrouter and both promising not to store any data. I'm sure the other big model hosters will follow if there's demand.

I may not be able to reasonably run it myself, but at least I can choose who I trust to run it and can have inference pricing determined by a competitive market. According to their benchmarks the model is about in a class with Claude 4 Sonet, yet already costs less than one third of Sonet's inference pricing

refulgentis · 5 months ago
I write a local LLM client, but sometimes, I hate that local models have enough knobs to turn that people can advocate they're reasonable in any scenario - in yesterday's post re: Kimi k2, multiple people spoke up that you can "just" stream the active expert weights out of 64 GB of RAM, and use the lowest GGUF quant, and then you get something that rounds to 1 token/s, and that is reasonable for use.

Good on you for not exaggerating.

I am very curious what exactly they see in that, 2-3 people hopped in to handwave that you just have it do agent stuff overnight and it's well worth it. I can't even begin to imagine unless you have a metric **-ton of easily solved problems that aren't coding. Even a 90% success rate gets you into "useless" territory quick when one step depends on the other, and you're running it autonomoously for hours

gpm · 5 months ago
> or even a single socket Epyc system with >1TB of RAM

How many tokens/second would this likely achieve?

spaceman_2020 · 5 months ago
This is fairly affordable if you’re a business honestly
tuananh · 5 months ago
looks very much usable for local usage.
handzhiev · 5 months ago
I tried it a couple of times in comparison to Claude. Kimi wrote much simpler and more readable code than Claude's over-engineered solutions. It missed a few minor subtle edge cases that Claude took care of though.
airstrike · 5 months ago
Claude what? Sonnet? 3.7? 3.5? Opus? 4?
nathan_compton · 5 months ago
The first question I gave it (a sort of pretty simple recreational math question I asked it to code up for me) and it was outrageously wrong. In fairness, and to my surprise, OpenAI's model also failed with this task, although with some prompting, sort of got it.
moffkalast · 5 months ago
Still pretty good, someone with enough resources could distil it down to a more manageable size for the rest of us.

Deleted Comment

Xmd5a · 5 months ago
I asked it to give me its opinion on a mail I'm writing. 95% of its content is quotes from famous authors, and the 5% I wrote is actually minimal glue in-between.

All the models I tested, which includes Sonnet 4, DeepSeekR1, 4o and Gemini 2.5 understand this isn't your normal email and what I ask is literary/philosophical criticism, not remarks about conventions, formatting or how to convey my message in a more impactful way.

    Quick fix:
    - Replace half the commas with periods.
    - Let one metaphor bloom and then stop.
    - Read it aloud; wherever you inhale, insert a line break.
    
    Before:
    >And if the collective imagination is so seized by it, it's because
    we are already—not in some other world—but in this very life, in a
    state of socio-, photo-, and videosynthesis.
    
    After:
    >The collective imagination is seized.
    >We are not in another world.
    >We are in this life, in a state of videosynthesis.

Yes, this quote is by Baudrillard. None of the other models fixated on the fact that it’s an email (I only used the word once in the prompt). My gut feeling is that this reflects not so much a lack of intelligence as a difference in model personality. Here's what it replied when I shared Gemini's analysis. The point was to have it understand that:

    Gemini, second sentence: It is less a simple email than an intellectual
    and personal exploration still in the making.
Point deflected as soon as understood:

    **Gemini is polite and pedagogical; I’m going to be blunt.**
    
    1. **This isn’t an email, it’s a trip.**
       ...
    
    2. **The content is good, the format is not.**
       ...
    
    3. **Gemini is right about one thing:**
       ...
    
       * cut 80% of the material,
       * ask **one clear question**
       * put the rest in a separate Google Doc or PDF for posterity.
    
    4. **Brutal summary:**
       Keep your draft for yourself — it’s valuable. But for the email, write a new one in **three sentences**:
    
       * thank them,
       * recall the context in one line,
       * ask the question.
    
    You’ve already got Gemini’s answer to stroke your ego — now give me the short version you’re actually going to send.

The solution to sycophancy is not disparagement (misplaced criticism). The classical true/false positive/negative dilemma is at play here. I guess the bot got caught in the crossfire of 1°) its no-bullshit attitude (it can only be an attitude) 2°) preference for delivering blunt criticism over insincere flattery 3°) being a helpful assistant. Remove point 3°), and it could have replied: "I'm not engaging in this nonsense". Preserve it and it will politely suggest that you condense your bullshit text, because shorter explanations are better than long winding rants (it's probably in the prompt).

simonw · 5 months ago
cosmojg · 5 months ago
For what it's worth, I think Kimi's modified MIT license still meets the OSI definition of "open source." For example, the explicitly OSI-approved "Attribute Assurance License"[1] contains similar wording:

> each time the resulting executable program or a program dependent thereon is launched, a prominent display (e.g., splash screen or banner text) of the Author’s attribution information

[1] https://opensource.org/license/attribution-php

pabs3 · 5 months ago
It probably doesn't because the attribution requirement discriminates against certain groups (large commercial organisations).
simonw · 5 months ago
Huh, I hadn't seen that one before!
ebiester · 5 months ago
At this point, they have to be training it. At what point will you start using something else?
simonw · 5 months ago
Once I get a picture that genuinely looks like a pelican riding a bicycle!
qmmmur · 5 months ago
I'm glad we are looking to build nuclear reactors so we can do more of this...
sergiotapia · 5 months ago
me too - we must energymaxx. i want a nuclear reactor in my backyard powering everything. I want ac units in every room and my open door garage while i workout.
1vuio0pswjnm7 · 5 months ago
"I'm glad we are looking to build nuclear reactors so we can do more of this..."

Does this actually mean "they" not "we"

neoromantique · 5 months ago
I honestly don't see an issue with that.

Except that instead of this, we're spinning up old coal plants, because apparently nuclear bad.

csomar · 5 months ago
Much better than that of Grok 4.
jug · 5 months ago
That's perhaps the best one I've seen yet! For an open weight model, this performance is of course particularly remarkable and impactful.
_alex_ · 5 months ago
wow!
ozgune · 5 months ago
This is a very impressive general purpose LLM (GPT 4o, DeepSeek-V3 family). It’s also open source.

I think it hasn’t received much attention because the frontier shifted to reasoning and multi-modal AI models. In accuracy benchmarks, all the top models are reasoning ones:

https://artificialanalysis.ai/

If someone took Kimi k2 and trained a reasoning model with it, I’d be curious how that model performs.

GaggiX · 5 months ago
>If someone took Kimi k2 and trained a reasoning model with it

I imagine that's what they are going at MoonshotAI right now

Alifatisk · 5 months ago
Why hasn’t Kimis current and older models been benchmarked and added to Artificial analysis yet?

Dead Comment

exegeist · 5 months ago
Technical strengths aside, I’ve been impressed with how non-robotic Kimi K2 is. Its personality is closer to Anthropic’s best: pleasant, sharp, and eloquent. A small victory over botslop prose.
orbital-decay · 5 months ago
I have a different experience in chatting/creative writing. It tends to overuse certain speech patterns without repeating them verbatim, and is strikingly close to the original R1 writing, without being "chaotic" like R1 - unexpected and overly dramatic sci-fi and horror story turns, "somewhere, X happens" at the end etc.

Interestingly enough, EQ-Bench/Creative Writing Bench doesn't spot this despite clearly having it in their samples. This makes me trust it even less.

simonw · 5 months ago
Big release - https://huggingface.co/moonshotai/Kimi-K2-Instruct model weights are 958.52 GB
c4pt0r · 5 months ago
Paired with programming tools like Claude Code, it could be a low-cost/open-source replacement for Sonnet
scottyeager · 5 months ago
Here's a neat looking project that allows for using other models with Claude Code: https://github.com/musistudio/claude-code-router

I found that while looking for reports of the best agents to use with K2. The usual suspects like Cline and forks, Aider, and Zed should be interesting to test with K2 as well.

martin_ · 5 months ago
how do you low cost run a 1T param model?
kkzz99 · 5 months ago
According to the bench its closer to Opus, but I venture primarily for English and Chinese.
wiradikusuma · 5 months ago
I've only started using Claude, Gemini, etc in the last few months (I guess it comes with age, I'm no longer interested in trying the latest "tech"). I assume those are "non-agentic" models.

From reading articles online, "agentic" means like you have a "virtual" Virtual Assistant with "hands" that can google, open apps, etc, on their own.

Why not use existing "non-agentic" model and "orchestrate" them using LangChain, MCP etc? Why create a new breed of model?

I'm sorry if my questions sound silly. Following AI world is like following JavaScript world.

dcre · 5 months ago
Reasonable question, simple answer: "New breed of model" is overstating it — all these models for years have been fine-tuned using reinforcement learning on a variety of tasks, it's just that the set of tasks (and maybe the amount of RL) has changed over time to include more tool use tasks, and this has made them much, much better at the latter. The explosion of tools like Claude Code this year is driven by the models just being more effective at it. The orchestration external to the model you mention is what people did before this year and it did not work as well.
simonw · 5 months ago
"Agentic" and "agent" can mean pretty much anything, there are a ton of different definitions out there.

When an LLM says it's "agentic" it usually means that it's been optimized for tool use. Pretty much all the big models (and most of the small ones) are designed for tool use these days, it's an incredibly valuable feature for a model to offer.

I don't think this new model is any more "agentic" than o3, o4-mini, Gemini 2.5 or Claude 4. All of those models are trained for tools, all of them are very competent at running tool calls in a loop to try to achieve a goal they have been given.

ozten · 5 months ago
It is not a silly question. The various flavors of LLM have issues with reliability. In software we expect five 9s, LLMs aren't even a one 9. Early on it was reliability of them writing JSON output. Then instruction following. Then tool use. Now it's "computer use" and orchestration.

Creating models for this specific problem domain will have a better chance at reliability, which is not a solved problem.

Jules is the gemini coder that links to github. Half the time it doesn't create a pull request and forgets and assumes I'll do some testing or something. It's wild.

apitman · 5 months ago
I'm new too. Found this article helpful: https://crawshaw.io/blog/programming-with-agents
selfhoster11 · 5 months ago
> I'm sorry if my questions sound silly. Following AI world is like following JavaScript world.

You are more right than you could possibly imagine.

TL;DR: "agentic" just means "can call tools it's been given access to, autonomously, and then access the output" combined with an infinite loop in which the model runs over and over (compared to a one-off interaction like you'd see in ChatGPT). MCP is essentially one of the methods to expose the tools to the model.

Is this something the models could do for a long while with a wrapper? Yup. "Agentic" is the current term for it, that's all. There's some hype around "agentic AI" that's unwarranted, but part of the reason for the hype is that models have become better at tool calling and using data in their context since the early days.

fzysingularity · 5 months ago
If I had to guess, the OpenAI open-source model got delayed because Kimi K2 stole their thunder and beat their numbers.
irthomasthomas · 5 months ago
Someone at openai did say it was too big to host at home, so you could be right. They will probably be benchmaxxing, right now, searching for a few evals they can beat.
johnb231 · 5 months ago
These are all "too big to host at home". I don't think that is the issue here.

https://github.com/MoonshotAI/Kimi-K2/blob/main/docs/deploy_...

"The smallest deployment unit for Kimi-K2 FP8 weights with 128k seqlen on mainstream H200 or H20 platform is a cluster with 16 GPUs with either Tensor Parallel (TP) or "data parallel + expert parallel" (DP+EP)."

16 GPUs costing ~$30k each. No one is running a ~$500k server at home.

cubefox · 5 months ago
According to the benchmarks, Kimi K2 beats GPT-4.1 in many ways. So to "compete", OpenAI would have to release the GPT-4.1 weights, or a similar model. Which, I guess, they likely won't do.
emacdona · 5 months ago
To me, K2 is a mountain and SOTA is “summits on the air”. I saw that headline and thought “holy crap” :-)
esafak · 5 months ago