mcharytoniuk (u/mcharytoniuk)

kherud · a year ago

I think this comment explains it https://github.com/ggerganov/llama.cpp/discussions/4130#disc... As far as I understand (and mcharytoniuk should better confirm this), llama.cpp allows to chunk the context window of an LLM into independent blocks, such that multiple requests can be processed in a single inference. I think due to the auto-regressive nature of LLMs, you also don't have to wait for all sequences to finish to output them. As soon as one sequence finishes, you can use its "slot" in the context window for other requests.

mcharytoniuk · a year ago

Yes, exactly. You can split the available context into "slots" (chunks) so it can handle multipe requests concurrently. The number of them is configurable.

mcharytoniuk commented on Show HN: Open-source load balancer for llama.cpp github.com/distantmagic/p... · Posted by u/mcharytoniuk

asne11 · a year ago

"slot" is a processing unit. Either GPU or CPU. I believe `llama.c` is only CPU so I'm guessing 1 slot = 1 core (or thread)?

mcharytoniuk · a year ago

It divides the context into smaller "slots", so it can process requests concurrently with continuous batching. See also: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

mcharytoniuk commented on Show HN: Open-source load balancer for llama.cpp github.com/distantmagic/p... · Posted by u/mcharytoniuk

tjfontaine · a year ago

Looks like it’s to only have one completion running at a time currently, https://github.com/distantmagic/paddler/blob/main/loadbalanc... I was curious if there was any other cache tracking happening

mcharytoniuk · a year ago

Just open an issue if you need anything. I want to make it as good and helpful as possible. Every kind of feedback is appreciated.

mcharytoniuk commented on Show HN: Open-source load balancer for llama.cpp github.com/distantmagic/p... · Posted by u/mcharytoniuk

riedel · a year ago

What does stateful mean: I always wonder how loading states of users is done, it seems that one can call `llama_state_set_data` , does this load balancer create a central store for such states? What is the overhead of transfering state?

mcharytoniuk · a year ago

Currently, it is a single instance in memory, so it doesn't transfer state. HA is on the roadmap; only then will it need some kind of distributed state store.

Local states are reported by the agents installed alongside llama.cpp to the load balancer. That means they can be dynamically added and removed; it doesn't need a central configuration.

mcharytoniuk commented on Show HN: Open-source load balancer for llama.cpp github.com/distantmagic/p... · Posted by u/mcharytoniuk

anonzzzies · a year ago

Does it do queuing ? Didn’t see it in the readme. I haven’t seen (but that says nothing at all) an open source solution that queues when all are busy and allow me to show a countdown of people in the queue. Like the closed ones do.

mcharytoniuk · a year ago

In progress. I added that to the readme; I need the feature myself. :)

Posted by u/mcharytoniuk a year ago

Show HN: Open-source load balancer for llama.cpp github.com/distantmagic/p...

Posted by u/mcharytoniuk a year ago

Show HN: Language-agnostic structured data extractor github.com/distantmagic/s...

u/mcharytoniuk

KarmaCake day62November 17, 2023View Original