I'm running this on my iPhone 13 Pro Max as part of the Test Flight beta, and it's interesting. I don't believe I've ever run anything that's ever pushed my phone this hard, and you can feel the heat. The text output performance is pretty inconsistent, it was very fast at first but slowed down considerably after a few answers. In terms of quality, it's prone to hallucinations, which is not unexpected based on the size and highly compressed nature of the model.
"The marvel is not that the bear dances well, but that the bear dances at all."
> I don't believe I've ever run anything that's ever pushed my phone this hard, and you can feel the heat. The text output performance is pretty inconsistent, it was very fast at first but slowed down considerably after a few answers.
Those two events are causally related. The OS has to throttle down the CPU or else it will overheat and malfunction.
It is one of the reasons why heavy number crunching is often performed on the cloud instead.
In my experience, heavy number crunching is more suitable to run on dedicated machines, rather than virtualized "cloud" cores. More consistent performance, no noisy neighbors and cheaper in the long term.
Thanks for sharing! It's definitely a bit painstaking to get a real-world LLM running at all on an iPhone due to memory constraint. It's also quite compute-intense as it has 7B parameters, but we are glad that it's generating texts at reasonable speed!
The model we are using is a quantized Vicuna-7b, which I believe is one of the best open-sourced models. Hallucination is a problem to all LLMs, but I believe research on model side would gradually alleviate this problem :-)
WizardLM-7b would be a fantastic model to try out as well. Though that might be out of date tomorrow (or already!) given how many models are being released at the moment :)
This is our latest project on making LLMs accessible to everyone. With this project, users no longer need to spend a fortune on huge VRAM, top-of-the-line GPUs, or powerful workstations to run LLMs at an acceptable speed. A consumer-grade GPU from years ago should suffice, or even a phone with enough memory.
Our approach leverages TVM Unity, a machine learning compiler that supports compiling GPT/Llama models to a diverse set of targets, including Metal, Vulkan, CUDA, ROCm, and more. Particularly, we've found Vulkan great because it's readily supported by a wide range of GPUs, including AMD and Intel's.
Not sure if you're interested in support questions, but I ran the simple start thing you guys put up (Linux, RX 570) -- and it runs quickly but spits out absolute gibberish?
This field is in crazy progress mode now. Not long ago it was rent cuda gpu only. Now this... AMD could easily chip away some part of the market, if they released a > 24 GB vram gpu now.
I hope they do, and I hope it forces Nvidia to release their own 48GB+ consumer card. 80GB is on my long term wish list as it would allow running a 65B model 8bit quantized. I don’t see local models exceeding ChatGPT performance until we get to a point where folks can run 65B parameter models.
Unless you're doing training, is there much point in 8-bit for this model size? My understanding is that the larger the model, the less affected it is by quantization; for 65b, 4-bit gives you ~2% perplexity penalty over 8-bit.
yeah agreed, its sad to me that we're 2 months after llama and "nobody" is seemingly doing any advances of fine tuning on models with more than 7B or 13B parameters.
I have 64gb RAM (not gpu just normal), I’d like to see proof of concepts that the bigger models can be fine tuned and have far more accepted results, or to know if we’re completely going the wrong direction with this
Question from a noob: how good would it be to run those on a computer with AMD APU (for example Ryzen 9 7940HS) with 128GB RAM and setting aside 64GB for iGPU?
Another noob here. If I had to guess, it's because current models are mostly memory bound. The AI learning gpus (A100, H100 etc.) are not the best TFlop performers, but they have most vram. It seems that researchers found a sweet spot for neural network architectures that perform good on similar configurations, i.e. near real time (reading speed in LLMs).
Once you bring those models to cpu, they might get performance bound again. Llama.cpp somehow illustrates that a bit, for bigger models you tend to wait a lot for the answer. I suspect the story would be similar with igpus
Tokenization is a probable issue here, the longer the context the longer the initial processing with llama I think. Possible tokenizer is not optimized.
That’s very unlikely, tokenization is really simple and usually quite fast (scales with input size). Unexpected slowness with a larger context window might point to a non-existent or unoptimized KV cache.
Does this support int4 tensor core operations on the Nvidia Turing & Ampere architectures? From what I've researched this would be a huge untapped speedup and memory saver for inference, but it's mostly undocumented, unsupported by pytorch etc. It'd basically be like llama.cpp but with a 10x or more GPU speedup, if someone was willing to dig in and write the CUDA logic for it. Given how fast llama.cpp already is on the CPU, this would be impressive to see.
I've been tempted to try it myself, but then the thought of faster LLaMA / Alpaca / Vicuna 7B when I already have cheap gpt-turbo-3.5 access (a better model in most ways) was never compelling enough to justify wading into weird semi-documented hardware.
TVM Unity has a CUDA backend, and TensorCore MMA instructions are supported, so it wouldn't be hard to turn this option on.
It's on our plan, but we haven't looked to enable them by default in the first place, mainly because we wanted to demonstrate it running on all GPUs including old models that don't come with TensorCore at all.
The 7b model is ca. 6gb of VRAM so yes, it is already 4bit quantized.
There are already efforts underway with GPTQ libraries but I have found they incur a substantial performance penalty, with the benefit of consuming much lower VRAM.
EDIT: I had a look at the repo, it appears the Vicuna model is using 3bit quantization.
Nobody wants to run Google on their PCs, why should LLMs be different? I'd expect GPT models to be updated regularly fairly soon, and in much the same way that I wouldn't want to host a personal out-dated web index + search engine, LLMs seem a perfect fit for server-side services given their requirements. Barely anyone even hosts their blogs or mail. What's the excitement about getting it almost running on a phone about?
> What's the excitement about getting it almost running on a phone about?
You get to decide what is appropriate or not.
It works offline.
It can be used to by applications without the having to use an external service.
This can be important for a number of applications (I am thinking about open source games and the modding community right now, but it is just an example)
I think it is important to keep your that in device. I would prefer the chatgpt to work on my pc instead work on a giant companies servers. I am doing very sensitive conversations with it. There is no guarantee of this data to be kept secret by Open AI. Also corporation would want to keep their data in house. Especially it can be a huge problem if your developer can unleashed the private document with the internet.
More to the point, I find it absolutely bizarre that one couldn't come up with quite a few reasons to have this be more private, whether personal or business.
Your personal or business stuff in other people's hands is generally not optimal or preferable, especially when more private options exist.
Because LLMs are expensive to host. It's more of a case that no one wants to run these on their PCs and it's the cheapest if it ends up running on client PCs instead of your own PCs. Not all use cases of LLMs need a super powerful model that is always up to date.
I think the privacy aspect is also interesting, think about journal or second-brain type applications. These could benefit a lot from language model use, but you don’t want these types of information sent to a cloud provider in unencrypted fashion.
1) for people who want to make money using AI but can’t afford to pay for a LLM or servers, they can push that cost to the end user.
2) for people who want to generate porn or spam (probably, also to make money)
The privacy thing is complete nonsense. If you want a private server, rent your own private server. If you’re worried AWS is spying on you, you’re paranoid.
This is about money, making money and being cheap, not about good will.
So, you’re right; from a consumer perspective it’s pretty meaningless.
What useful things are people able to do with the smaller more inaccurate models? I have a hard time understanding why I would build on top of this, rather than just the openaI API, since the performance is so much better.
This is not a direct answer to your question, but performance is better in terms of the _quality_ of completions but not in terms of price, latency, or uptime.
"The marvel is not that the bear dances well, but that the bear dances at all."
Those two events are causally related. The OS has to throttle down the CPU or else it will overheat and malfunction.
It is one of the reasons why heavy number crunching is often performed on the cloud instead.
The model we are using is a quantized Vicuna-7b, which I believe is one of the best open-sourced models. Hallucination is a problem to all LLMs, but I believe research on model side would gradually alleviate this problem :-)
https://apps.apple.com/app/id6444050820
Our approach leverages TVM Unity, a machine learning compiler that supports compiling GPT/Llama models to a diverse set of targets, including Metal, Vulkan, CUDA, ROCm, and more. Particularly, we've found Vulkan great because it's readily supported by a wide range of GPUs, including AMD and Intel's.
BTW, an interesting data point from Reddit that it also works on steam deck: https://www.reddit.com/r/LocalLLaMA/comments/132igcy/comment....
I have 64gb RAM (not gpu just normal), I’d like to see proof of concepts that the bigger models can be fine tuned and have far more accepted results, or to know if we’re completely going the wrong direction with this
But for some reason it dramatically slows down after a few messages
Edit:
Oh no, this one also gives lectures instead of answering questions.
https://i.imgur.com/eiuGzK4.jpg
I'm afraid, in near future the only organic content on the internet would be only the type of content that LLMs refuse to generate.
I've been tempted to try it myself, but then the thought of faster LLaMA / Alpaca / Vicuna 7B when I already have cheap gpt-turbo-3.5 access (a better model in most ways) was never compelling enough to justify wading into weird semi-documented hardware.
It's on our plan, but we haven't looked to enable them by default in the first place, mainly because we wanted to demonstrate it running on all GPUs including old models that don't come with TensorCore at all.
There are already efforts underway with GPTQ libraries but I have found they incur a substantial performance penalty, with the benefit of consuming much lower VRAM.
EDIT: I had a look at the repo, it appears the Vicuna model is using 3bit quantization.
You get to decide what is appropriate or not.
It works offline.
It can be used to by applications without the having to use an external service.
This can be important for a number of applications (I am thinking about open source games and the modding community right now, but it is just an example)
More to the point, I find it absolutely bizarre that one couldn't come up with quite a few reasons to have this be more private, whether personal or business.
Your personal or business stuff in other people's hands is generally not optimal or preferable, especially when more private options exist.
Because LLMs are expensive to host. It's more of a case that no one wants to run these on their PCs and it's the cheapest if it ends up running on client PCs instead of your own PCs. Not all use cases of LLMs need a super powerful model that is always up to date.
1) for people who want to make money using AI but can’t afford to pay for a LLM or servers, they can push that cost to the end user.
2) for people who want to generate porn or spam (probably, also to make money)
The privacy thing is complete nonsense. If you want a private server, rent your own private server. If you’re worried AWS is spying on you, you’re paranoid.
This is about money, making money and being cheap, not about good will.
So, you’re right; from a consumer perspective it’s pretty meaningless.