Readit News logoReadit News
guywhocodes commented on Lossless LLM 3x Throughput Increase by LMCache   github.com/LMCache/LMCach... · Posted by u/lihanc111
kcorbitt · 2 months ago
Looks cool! With vLLM v1, prefix caching is enabled by default and seems quite performant. Is the advantage of LMCache the fact that you can offload to CPU and disk as well? How much is throughput/latency affected if you need to pull a large KV cache from disk/cpu instead of GPU RAM?

Also, how realistic would it be to share the KV cache across vllm nodes within a data center? It would be really nice to be able to freely distribute requests to a pool of vLLM workers without worrying about prefix-aware routing, but maybe that isn't the right approach because moving the KV cache around would be too slow?

guywhocodes · 2 months ago
This is exactly what llm-d is
guywhocodes commented on MCP in LM Studio   lmstudio.ai/blog/lmstudio... · Posted by u/yags
sixhobbits · 2 months ago
MCP terminology is already super confusing, but this seems to just introduce "MCP Host" randomly in a way that makes no sense to me at all.

> "MCP Host": applications (like LM Studio or Claude Desktop) that can connect to MCP servers, and make their resources available to models.

I think everyone else is calling this an "MCP Client", so I'm not sure why they would want to call themselves a host - makes it sound like they are hosting MCP servers (definitely something that people are doing, even though often the server is run on the same machine as the client), when in fact they are just a client? Or am I confused?

guywhocodes · 2 months ago
MCP Host is terminology from the spec. It's the software that makes llm calls, build prompts, interprets tool call requests and performs them etc.
guywhocodes commented on Resurrecting flip phone typing as a Linux driver   github.com/FoxMoss/libt9... · Posted by u/foxmoss
akdev1l · 2 months ago
Why we don’t have t9 support on tv remotes???

I gotta aim and peck some bullshit or open up some app with a QR code instead. Give me T9

guywhocodes · 2 months ago
A few years ago I started building this. The idea was to send the word and when cycling send the same number of backspaces before the next.

I guess I got busy with other things

guywhocodes commented on GPT 4.5 level for 1% of the price   twitter.com/Baidu_Inc/sta... · Posted by u/decide1000
hjgjhyuhy · 6 months ago
[flagged]
guywhocodes · 6 months ago
Europe 2.0
guywhocodes commented on The AMD Radeon Instinct MI300A's Giant Memory Subsystem   chipsandcheese.com/p/insi... · Posted by u/pella
behnamoh · 8 months ago
AMD is done, no one uses their GPUs for AI because AMD were too dumb to understand the value of software lock-in like Nvidia did with CUDA.
guywhocodes · 8 months ago
More like the value of drivers that doesn't require one in-house team per customer to "fix" driver crashes in the customers' particular workloads.
guywhocodes commented on Cognitive load is what matters   minds.md/zakirullin/cogni... · Posted by u/zdw
pkkkzip · 8 months ago
I remember I had this argument with a CTO before about cognitive load. I was concerned of the sheer amount of code behind React/Redux for what could've been just a simple plain server rendered with jQuery sprinkled.

Her answer was "if Facebook (before meta) is doing it then so should we."

I said we aren't facebook. But all the engineers sided with her.

Said startup failed after burning $XX million dollars for a product nobody bought.

guywhocodes · 8 months ago
A tale as old as the SPA
guywhocodes commented on Cognitive load is what matters   minds.md/zakirullin/cogni... · Posted by u/zdw
BiteCode_dev · 8 months ago
Something I noticed is that some vim / keyboard only envs are paying a huge cognitive load price by holding various states in their mind and having to expand efforts every time they switching context.

Sometimes there is the added burden of an exotic linux distro or a dvorak layout on a specially shaped keyboard.

Now, some devs are capable of handling this. But not all do, I've seen many claiming they are more productive with it, but when compared to others, they were less productive.

They were slow and tired easily. They had a higher burn out rate. The had too much to pay upfront for their day to day coding task but couldn't see that their idealization of their situation was not matching reality.

My message here is: if you are in such env be very honest with yourself. Are you good enough that you are among the few that actually benefit from it?

guywhocodes · 8 months ago
Is this bait?
guywhocodes commented on Bringing K/V context quantisation to Ollama   smcleod.net/2024/12/bring... · Posted by u/mchiang
wokwokwok · 9 months ago
Nice.

That said... I mean...

> The journey to integrate K/V context cache quantisation into Ollama took around 5 months.

??

They incorrectly tagged #7926 which is a 2 line change, instead of #6279 where it was implemented, which made me dig a bit deeper and reading the actual change it seems:

The commit (1) is:

    > params := C.llama_context_default_params()
    > ...
    > params.type_k = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
    > params.type_v = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
Which has been part of llama.cpp since Dev 7, 2023 (2).

So... mmmm... while this is great, somehow I'm left feeling kind of vaguely put-off by the comms around what is really 'we finally support some config flag from llama.cpp that's been there for really quite a long time'.

> It took 5 months, but we got there in the end.

... I guess... yay? The challenges don't seem like they were technical, but I guess, good job getting it across the line in the end?

[1] - https://github.com/ollama/ollama/commit/1bdab9fdb19f8a8c73ed...

[2] - since https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5...

guywhocodes · 9 months ago
This is par for Ollama, look at the log_probs issues/prs and you get an idea of how well Ollama is run.

Ollama is IMO a model downloader for llama.cpp so you can do roleplay with ease.

guywhocodes commented on Large Enough   mistral.ai/news/mistral-l... · Posted by u/davidbarker
qeternity · a year ago
It is. Strawberry is one token in many tokenziers. The model doesn't have a concept that there are letters there.
guywhocodes · a year ago
This is pretty much equivalent to the statement "multicharacter tokens are a dead end for understanding text". Which I agree with.
guywhocodes commented on Hacked Nvidia 4090 GPU driver to enable P2P   github.com/tinygrad/open-... · Posted by u/nikitml
throwaway8481 · a year ago
I feel like I should say something about discord not being a suitable replacement for a forum or bugtracker.
guywhocodes · a year ago
We are talking about a literal monologue while poking at a driver for a few hours, this wasn't a huge project.

u/guywhocodes

KarmaCake day553August 4, 2017View Original