Readit News logoReadit News
antirez · 2 years ago
Use llama.cpp for quantized model inference. It is simpler (no Docker nor Python required), faster (works well on CPUs), and supports many models.

Also there are better models than the one suggested. Mistral for 7B parameters. Yi if you want to go larger and happen to have 32Gb of memory. Mixtral MoE is the best but requires too much memory right now for most users.

j-bos · 2 years ago
I'm curious, what do you use these small LLMs for, like can you give some examples of (not too) personal uses cases from the past month?
SOLAR_FIELDS · 2 years ago
My understanding (I haven’t used a fine tuned one) is that you can use one that you fine tune yourself for narrow automation tasks. Kind of like a superpowered script. From my llama 2 7b experiments I have not gotten great results out of the non fine tuned versions of the model for coding tasks. I haven’t tried code llama yet.
physicsgraph · 2 years ago
Thanks for the suggestion. I'm new to running LLMs so I'll take a look at your suggestion [0]. My ~10 year old MacBook Air has 4GB of RAM, so I'm primarily interested in smaller LLMs.

[0] https://github.com/ggerganov/llama.cpp

akx · 2 years ago
You don't necessarily need to fit the model all in memory – llama.cpp supports mmaping the model directly from disk in some cases. Naturally inference speed will be affected.
TotalCrackpot · 2 years ago
Btw, shouldn't it in theory be possible to run the Mixtral MoE loading next submodel sequentially and store outputs and then do the rest of the algorithm to make it easier to run on machines that cannot fit whole model in the memory?
wfhpw · 2 years ago
Yes but loading weights into memory takes time
PeterStuer · 2 years ago
Python is only used in the toolchain, the inference engine is entirely C/C++.
upon_drumhead · 2 years ago
I’m a tad confused

> TinyChatEngine provides an off-line open-source large language model (LLM) that has been reduced in size.

But then they download the models from huggingface. I don’t understand how these are smaller? Or do they modify them locally?

lrem · 2 years ago
https://github.com/mit-han-lab/TinyChatEngine

Turns out the original source is actually somewhat informative. Including telling you how much hardware do you need. This blog post looks like your typical note you leave for yourself to annotate a bit of your shell history.

pmontra · 2 years ago
I wish that all these repos were more clear about the hardware requirements. Seeing that it runs on a 8 GB Raspberry, probably with abysmal performance, I'd say that it will run on my 32 GB Intel laptop on the CPU. Will it run on its Nvidia card? I remember that the rule of thumb was one GB of GPU RAM per G parameters, so I'd say that it won't run. However this has 4 bit quantization so it could have lower requirements.

Of course the main problem is that I don't know enough about the subject to reason on it on my own.

physicsgraph · 2 years ago
Your assessment is exactly correct -- the blog post is my note-to-self about getting the repo to work. My "added value" in the post is a Dockerfile for ease of installation.
PeterStuer · 2 years ago
They have postprocessed the models specifically for size and latency. They published several papers on this.

Their optimized models are not downloaded from HF, but from dropbox. I have no idea why.

rodnim · 2 years ago
"Small large" ..... so, medium? :)
the_sleaze9 · 2 years ago
No - LLMs can't talk to the dead, they're just fancy autocompletes
aravindgp · 2 years ago
I have used them and I can say it's pretty decent overall. I personally plan to use tinyengineon iot devices which is for even smaller iot microcontroller devices.
jcjmcclean · 2 years ago
May I ask what your use case is? I've found LLMs are pretty good at parsing unstructured data into JSON, with minimal hallucinations.
collyw · 2 years ago
Is there a tutorial on how to do something like that? It sounds damn useful.
hm-nah · 2 years ago
I’m also curious.
collyw · 2 years ago
Where is a good place to understand the high level topics in AI. Like an offline language model compared to a presumably online model?
dkjaudyeqooe · 2 years ago
I tried this and installation was easy on macOS 10.14.6 (once I updated Clang correctly).

Performance on my relatively old i5-8600 CPU running 6 cores at 3.10GHz with 32GB of memory gives me about 150-250 ms per token on the default model, which is perfectly usable.