guywhocodes (u/guywhocodes)

guywhocodes commented on Lossless LLM 3x Throughput Increase by LMCache github.com/LMCache/LMCach... · Posted by u/lihanc111

kcorbitt · 2 months ago

Looks cool! With vLLM v1, prefix caching is enabled by default and seems quite performant. Is the advantage of LMCache the fact that you can offload to CPU and disk as well? How much is throughput/latency affected if you need to pull a large KV cache from disk/cpu instead of GPU RAM?

Also, how realistic would it be to share the KV cache across vllm nodes within a data center? It would be really nice to be able to freely distribute requests to a pool of vLLM workers without worrying about prefix-aware routing, but maybe that isn't the right approach because moving the KV cache around would be too slow?

guywhocodes · 2 months ago

This is exactly what llm-d is

guywhocodes commented on MCP in LM Studio lmstudio.ai/blog/lmstudio... · Posted by u/yags

sixhobbits · 2 months ago

MCP terminology is already super confusing, but this seems to just introduce "MCP Host" randomly in a way that makes no sense to me at all.

> "MCP Host": applications (like LM Studio or Claude Desktop) that can connect to MCP servers, and make their resources available to models.

I think everyone else is calling this an "MCP Client", so I'm not sure why they would want to call themselves a host - makes it sound like they are hosting MCP servers (definitely something that people are doing, even though often the server is run on the same machine as the client), when in fact they are just a client? Or am I confused?

guywhocodes · 2 months ago

MCP Host is terminology from the spec. It's the software that makes llm calls, build prompts, interprets tool call requests and performs them etc.

guywhocodes commented on Resurrecting flip phone typing as a Linux driver github.com/FoxMoss/libt9... · Posted by u/foxmoss

akdev1l · 2 months ago

Why we don’t have t9 support on tv remotes???

I gotta aim and peck some bullshit or open up some app with a QR code instead. Give me T9

guywhocodes · 2 months ago

A few years ago I started building this. The idea was to send the word and when cycling send the same number of backspaces before the next.

I guess I got busy with other things

guywhocodes commented on GPT 4.5 level for 1% of the price twitter.com/Baidu_Inc/sta... · Posted by u/decide1000

hjgjhyuhy · 6 months ago

[flagged]

guywhocodes · 6 months ago

Europe 2.0

guywhocodes commented on The AMD Radeon Instinct MI300A's Giant Memory Subsystem chipsandcheese.com/p/insi... · Posted by u/pella

behnamoh · 8 months ago

AMD is done, no one uses their GPUs for AI because AMD were too dumb to understand the value of software lock-in like Nvidia did with CUDA.

guywhocodes · 8 months ago

More like the value of drivers that doesn't require one in-house team per customer to "fix" driver crashes in the customers' particular workloads.

guywhocodes commented on Cognitive load is what matters minds.md/zakirullin/cogni... · Posted by u/zdw

pkkkzip · 8 months ago

I remember I had this argument with a CTO before about cognitive load. I was concerned of the sheer amount of code behind React/Redux for what could've been just a simple plain server rendered with jQuery sprinkled.

Her answer was "if Facebook (before meta) is doing it then so should we."

I said we aren't facebook. But all the engineers sided with her.

Said startup failed after burning $XX million dollars for a product nobody bought.

guywhocodes · 8 months ago

A tale as old as the SPA

guywhocodes commented on Cognitive load is what matters minds.md/zakirullin/cogni... · Posted by u/zdw

BiteCode_dev · 8 months ago

Something I noticed is that some vim / keyboard only envs are paying a huge cognitive load price by holding various states in their mind and having to expand efforts every time they switching context.

Sometimes there is the added burden of an exotic linux distro or a dvorak layout on a specially shaped keyboard.

Now, some devs are capable of handling this. But not all do, I've seen many claiming they are more productive with it, but when compared to others, they were less productive.

They were slow and tired easily. They had a higher burn out rate. The had too much to pay upfront for their day to day coding task but couldn't see that their idealization of their situation was not matching reality.

My message here is: if you are in such env be very honest with yourself. Are you good enough that you are among the few that actually benefit from it?

guywhocodes · 8 months ago

Is this bait?

guywhocodes commented on Bringing K/V context quantisation to Ollama smcleod.net/2024/12/bring... · Posted by u/mchiang

wokwokwok · 9 months ago

Nice.

That said... I mean...

> The journey to integrate K/V context cache quantisation into Ollama took around 5 months.

??

They incorrectly tagged #7926 which is a 2 line change, instead of #6279 where it was implemented, which made me dig a bit deeper and reading the actual change it seems:

The commit (1) is:

    > params := C.llama_context_default_params()
    > ...
    > params.type_k = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
    > params.type_v = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this

Which has been part of llama.cpp since Dev 7, 2023 (2).

So... mmmm... while this is great, somehow I'm left feeling kind of vaguely put-off by the comms around what is really 'we finally support some config flag from llama.cpp that's been there for really quite a long time'.

> It took 5 months, but we got there in the end.

... I guess... yay? The challenges don't seem like they were technical, but I guess, good job getting it across the line in the end?

[1] - https://github.com/ollama/ollama/commit/1bdab9fdb19f8a8c73ed...

[2] - since https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5...

guywhocodes · 9 months ago

This is par for Ollama, look at the log_probs issues/prs and you get an idea of how well Ollama is run.

Ollama is IMO a model downloader for llama.cpp so you can do roleplay with ease.

guywhocodes commented on Large Enough mistral.ai/news/mistral-l... · Posted by u/davidbarker

qeternity · a year ago

It is. Strawberry is one token in many tokenziers. The model doesn't have a concept that there are letters there.

guywhocodes · a year ago

This is pretty much equivalent to the statement "multicharacter tokens are a dead end for understanding text". Which I agree with.

guywhocodes commented on Hacked Nvidia 4090 GPU driver to enable P2P github.com/tinygrad/open-... · Posted by u/nikitml

throwaway8481 · a year ago

I feel like I should say something about discord not being a suitable replacement for a forum or bugtracker.

guywhocodes · a year ago

We are talking about a literal monologue while poking at a driver for a few hours, this wasn't a huge project.