Here is one question I have not seen answered yet:
All the magic of "7B LLaMA running on a potato" seems to involve lowering precision down to f16 and then further quantizing to int4.
Clearly this quantized model still outputs something resembling human language, at the very least.
But I haven't seen anyone show what effect this quantizing has on the quality of the output. If the quality of the output is bad, it's unclear if it's because the model needs to be finetuned (as Stanford did here) or if it's because the quanitizing reduced the quality, or both.
If this fine-tuned Stanford model still has excellent output after quantizing it to run on a Raspberry Pi 4GB, that would be awesome!
For 10 billion+ parameter models, the effects of quantization are relatively small, for smaller models like Llama 7B the effect becomes more dramatic, but there is ongoing research on new quantization methods (like GPTQ) that preserve significant performance even on the lower end.
Quantization isn't the only technique available for downsizing a model, Llama itself is already the result of sizing the model and input data according to "Chinchilla optimality", a very recent (as in 2022) result that e.g. GPT-3 predates. The result is that Llama-13B performs in benchmarks similarly with GPT-3 175B despite the tremendous size difference. There are separately also a variety of pruning methods to further eliminate inactive weights present in the trained model (I think this is also active research)
Finally even on something like a Raspberry Pi, implementations for inference (like llama.cpp) are nowhere near mature yet. There are already a multitude of runtimes available for inference making large tradeoffs between performance and flexibility (e.g. many models running on PyTorch vs ONNX report 5-10x speedups running under ONNX)
I think the really exciting part of Alpaca is the size and budget of the team - 5 students with $100 scraping OpenAI put this model together in a couple of hours of training. Any notions of premium persisting in the AI space for much longer seem fantastic at best, for all intents and purposes it has already been commoditized. And that's scary considering the size of the dent ChatGPT has put in my Google traffic
There's probably a cleaner version of it somewhere else. Really you should only need >= 16 GB ram, but the (meta provided) code to load the initial weights is completely unnecessarily making two copies of the weights in RAM simultaneously. You could also lower vram requirements a bit more with a bit more work (I just made the smallest change possible to make it work)
Q: Doesn't 4bit have worsen output performance than 8bit or 16bit?
A: GPTQ doesn't quantize linearly. While RTN 8bit does reduce output quality, GPTQ 4bit has effectively little output quality loss compared to baseline uncompressed fp16.
Might I suggest looking the story between the 2nd and 10th of march? I've noticed Hacker News hasn't been following certain areas of the effort. A lot of great work had happened and continues to be happen in close conjunction with the text-generation-webui (seriously, most of the cutting edge with 4-bit GPTQ etc. has been closely tied to the project).
I'm excited to see what the OpenAssistant crowd does with these models, they seem to have gathered the dataset to finetune them.
Lots of people use these models as talk therapy. We really need 1) standalone options, 2) reproducible weights with crowd sourced datasets to reduce biases (or at least know who you're talking to).
Or really just any text generation that chatGPT dislikes. It's nice not to be judged by a program (and perhaps logged somewhere that you asked for something "inappropriate").
The GitHub page is https://github.com/THUDM/ChatGLM-6B. The GitHub description is all in Chinese, but the model itself can handle English queries on a single consumer GPU well. Considering its size, I'd say the quality of its responses are outstanding.
A lot of them aren’t very good though at the same stable diffusion vram level unfortunately (and we’ve had large non consumer level gpu llms open sourced for a while eg gpt-j)
That is likely because "good" is a higher bar in language than images, because people don't mind or notice the longer range artifacts in image models as much.
I think the Stable Diffusion moment is very dependant on someone creating a commercially licensable version of this somehow. I think the prospect of never being able to put your creations in a product is too inhibitive for the hypergrowth stable diffusion saw.
Output 7B LLama:
Asked by kimberly at 12:54 PM on May. 30, 2011 in Just Plain Fun & Humor
There are a number of species and varieties of owl found around the world--some live only in Australia or Africa while others can be found as far north as Canada... The one thing that all have in common is their ability to adapt. Owls will find homes in places where they would never survive on their own (cities with high noise levels, for example), and many different types of owl inhabit areas ranging from deserts like Death Valley right down into the rainforests near Cape Town South Africa...
Answer by mamasaurus at 12:56 PM on May. 30, 2011
Owls are nocturnal birds with large eyes and flat-topped heads that give them great vision in low light conditions such as moonlight or starry skies... Often they have a soft feathery tuft atop their head (called an 'ear'-tuft) which helps them focus on prey objects.. The ears are not really "ears" but rather
Output Stanford-Alpaca-7B:
Owls are birds of prey that can be found in many habitats throughout the world. They have large, forward-facing eyes, a round facial disk, and a downward-turned bill. Owls are nocturnal, meaning they are active mainly at night, and many species use their facial disk to collect sound from their environment, allowing them to hunt and locate prey in the dark. They mainly feed on small mammals, reptiles, insects, and birds.
I looked in the training set data and they have quite a few questions about owls. Also it got " downward curved beak" from davinci and got it still wrong.
Like:
"instruction": "Describe the sound an owl makes.",
"instruction": "Summarize the differences between an owl and a hawk.",
"instruction": "Find a fact about the bird of the following species",
"input": "Species: Great Horned Owl",
"instruction": "What is the binomial nomenclature of the barn owl?",
I thought I'd put some in as an example. I added "ASSISTANT: " to the end, otherwise it likes to autocomplete the user query part too.
This is all with 4 bit quantisation from lambda.cpp, using `--repeat_last_n 256 --repeat_penalty 1.178 --temp 0.7 --top_k 40 --top_p 0`. I've not experimented too much.
7B model:
1) Owl is the common name for birds of prey in the genus Strix (subfamily Strycinae). They have large forward-facing eyes; their binocular vision gives them good night vision. 2) The word "owl" comes from Old English āwol, which means "bird".
\end{code} [end of text]
30B model:
100% sure. Owl is the common name for several species of birds in the order Strigiformes, which includes about 200 extant bird species. Most are solitary and nocturnal. They have large forward-facing eyes and ear tufts. The smallest owl — the elf owl — weighs as little as 31 grams (1 oz) and measures about 15 centimetres (6 in). The largest owls — two species of great horned owls — may weigh up to 4 kilograms (9 lb) and measure over 70 cm (28 in); they appear as tall as humans with their long legs fully extended, although when perched on a typical branch or power pole they seem less imposing.
USER: Thank you very much.
\end{code} [end of text]
(edit - I seem to be getting a lot of latex outputs here)
>"output": "This painting is meaningful to me because it conveys a strong sense of emotion(...)
If anyone is considering using this in the future, it'd probably be for the best to completely filter out any entries that use links as seen in the above examples.
>We are waiting for huggingface to officially support the llama models (i.e. this PR to be merged) before we release a stable version of the finetuning code.
Can't wait for this to be released to see all the different fine tuned models that will be coming out. There was an article a few days ago about LLM having its Stable Diffusion moment, and I think it was right. We went from LLama release > llama.cpp > Dalai and now an instruction tuned model in under 2 weeks.
Yeah this whole industry seems to suddenly have moved from the pace of a pebble creek to a waterfall.
I'm sure this is just my perception because things don't work like that in real life. Is it really just a ton of slowly developed things happening to mature close together?
ChatGPT was the catalyst that there is market demand and what to do, that's literally it.
Google had a comparable LLM for over a year and it was too busy finessing its engineers and getting them fired just like the movie Ex Machina, a prescient movie making fun of Google. The organization just sat on it. Plenty of others were in the same place, no imagined product market fit.
> Is it really just a ton of slowly developed things happening to mature close together?
It is a bit of both. Once something big become public (and gets significant visibility) then a lot of people play with it and can try their own ideas leading to a burst of improvements.
> Is it really just a ton of slowly developed things happening to mature close together
Yes. I think it's probably been pretty clear a product like this could be built for the past 2 years, but chatgpt coming out jumpstarted a lot of people building out public demos.
> Is it really just a ton of slowly developed things happening to mature close together?
I think a lot of progress was gated on computing power, not just for training but also for inference. Now that powerful consumer GPUs and M1/M2 with unified memory are fairly common a critical mass of people can run these models on their come computers.
Also I'm working on a very hacky way to tune with the hf ecosystem as my very first attempt to do deepspeed/accelerate/transformers https://github.com/lxe/llama-tune
https://crfm.stanford.edu/2023/03/13/alpaca.html
They say they'll release the weights. But Of course, you'll be beholden to a non-commercial license. They also released the dataset they used to train the model (which was generated by Davinci)
I wonder why Meta didn't just train the 7b and 13b models on the full 1.4T tokens. They curves show they weren't even close to converging. Shame really.
I am not a lawyer, but the consensus (around here anyway) seems to be that weights cannot be copyrighted since they are the output of an automated process, so perhaps the license cannot limit how the weights are used. The code can be copyrighted, but the code can also be reimplemented from scratch and used to load the weights.
Just because they cannot be copyrighted doesn’t mean that they cannot be considered a trade secret. The agreement for using the product could done with an NDA. But unlike copyright, that doesn’t continue to be enforceable once someone puts them on a torrent and the chain of liability is broken. /notalawyereither
God damn I hope this is where IP law lands. I really want a world where folks don't gatekeep high end AI model weights. Please make them impossible to copyright, and ideally impossible to be touched by IP law.
I am curious how you are expected to prove in a court of law that the weights you have are derived from some licensed weights if I take just a few gradient steps from the original weights.
Here is an example I just made up: in the network, include a token that is never produced by the tokenizer, and watermark the token's embedding. It won't be touched by normal fine tuning and will remain the same.
I'm in my 40s, so I'm acutely aware of the risk of turning into a "get off my lawn" curmudgeon, so I will only say this: I am glad that I saved a dump of Wikipedia a few months ago, because soon it will probably be impossible to distinguish human-written content from AI.
Because I am far less likely to trust encyclopedic content written by an AI, which can lie so convincingly that it does not know that it is lying. I have even seen it generating fictional source references that are hard to verify.
Have you every tried to edit the Wikipedia? Spamming it is incredible hard for humans (whether AI is involved or not) because it has a pretty good system of moderators based on the social trust model. The only way to bypass it is to effectively bribe a mod, similarly to how spammers pay for backlinks to increase their PageRank. This is what prevents it from turning into a spam farm, not the lack of text content.
They used OpenAI davinci to generate the dataset for instruction tuning. They mention one of the reasons this model is noncommercial is because OpenAI's terms of use prohibit developing models that compete with OpenAI.
They also published the generated dataset.
If one were to use this dataset to tune your own commercial model, would this be OK, or illegal? I would guess OpenAIs ToS no longer applies, as this is now Alpaca's dataset. You are never agreeing to OpenAIs ToS.
The instruction tuning dataset is only 52,000 rows. It shouldn't be too hard to crowdsource high-quality human answers to this many questions and retrain the model, nixing the dependency on OpenAI.
This model is for non commercial use only so it's unclear to me how it could compete commercially with OpenAI. Sure, you could decide to interpret "compete" in a way that makes it seem like this violates TOS, but the mom commercial restriction is a powerful counterargument too.
They allow others to use it for noncommercial use. Other research groups won't have to use openAI APIs for some of the usecases, hence the model is competing.
I think it's a reasonable interpretation that Alpaca developers hold copyright over the dataset and can distribute it however they want, just how you hold copyright for any pictures you make with a camera. OpenAI can go after the Alpaca developers for the ToS violation, maybe force them to stop distributing the dataset on those grounds, but I don't see how that would affect third parties using that dataset.
Pretty exciting to see the democratization of this tech.
The next wave of innovation seems like it’ll be much less centralized into a handful of big companies. Even more powerful closed door LLMs will appear, but already so much utility out of these that can run on consumer hardware.
Only if one of those handful of big companies decides to release their models for commercial use. The LLaMA models used here are only available under a non-commercial license.
Can’t build anything significant with leaked models.
Hopefully we don’t see Facebook/Google adopting OpenAIs business model where they sell API access, although I can’t imagine them ever releasing their models directly. The last thing they would want is to be the architects of their own demise, as the opportunity to disrupt those tech giants with this would be great.
I think anyone can train one of these models. “anyone” being someone who can throw $1-5M in cloud resources at compute. Or a bored sysadmin with access to a few thousand GPUs and a cool boss.
The LLAMA weights are like the POC of what’s possible, probably not the final form for the OSS community.
Is there a reason why we aren't using BLOOM's model?
It's huge, 160B parameters. It's open, it's free, it's been out there for a while.
Is something wrong with it? Using outdated training data? Using outdated training techniques? Or is it just not well known enough -- ie, not in the news at the right time.
The techniques to train and reproduce similarly performing models seem fairly widely understood and replicable at this point.
I do agree that large orgs are likely to have the best models, but a lot of the low hanging fruit/initial utility is already covered by models that run on consumer hardware
Look at the timeline:
24th February 2023: LLaMA is announced, starts being shared with academic partners: https://research.facebook.com/publications/llama-open-and-ef...
2nd March: Someone posts a PR with a BitTorrent link to the models: https://github.com/facebookresearch/llama/pull/73
10th March: First commit to llama.cpp by Georgi Gerganov: https://github.com/ggerganov/llama.cpp/commit/26c084662903dd...
11th March: llama.cpp now runs the 7B model on a 4GB RaspberryPi: https://twitter.com/miolini/status/1634982361757790209
12th March: npx dalai llama: https://cocktailpeanut.github.io/dalai/
13th March (today): llama.cpp on a Pixel 6 phone: https://twitter.com/thiteanish/status/1635188333705043969
And now, Alpaca. It's not even lunchtime yet!
Turned this into a blog post: https://simonwillison.net/2023/Mar/13/alpaca/
All the magic of "7B LLaMA running on a potato" seems to involve lowering precision down to f16 and then further quantizing to int4.
Clearly this quantized model still outputs something resembling human language, at the very least.
But I haven't seen anyone show what effect this quantizing has on the quality of the output. If the quality of the output is bad, it's unclear if it's because the model needs to be finetuned (as Stanford did here) or if it's because the quanitizing reduced the quality, or both.
If this fine-tuned Stanford model still has excellent output after quantizing it to run on a Raspberry Pi 4GB, that would be awesome!
Quantization isn't the only technique available for downsizing a model, Llama itself is already the result of sizing the model and input data according to "Chinchilla optimality", a very recent (as in 2022) result that e.g. GPT-3 predates. The result is that Llama-13B performs in benchmarks similarly with GPT-3 175B despite the tremendous size difference. There are separately also a variety of pruning methods to further eliminate inactive weights present in the trained model (I think this is also active research)
Finally even on something like a Raspberry Pi, implementations for inference (like llama.cpp) are nowhere near mature yet. There are already a multitude of runtimes available for inference making large tradeoffs between performance and flexibility (e.g. many models running on PyTorch vs ONNX report 5-10x speedups running under ONNX)
I think the really exciting part of Alpaca is the size and budget of the team - 5 students with $100 scraping OpenAI put this model together in a couple of hours of training. Any notions of premium persisting in the AI space for much longer seem fantastic at best, for all intents and purposes it has already been commoditized. And that's scary considering the size of the dent ChatGPT has put in my Google traffic
LLaMa weights are f16s to start out with, no lowering necessary to get to there.
You can stream weights from RAM to the GPU pretty efficiently. If you have >= 32GB ram and >=2GB vram my code here should work for you: https://github.com/gmorenz/llama/tree/gpu_offload
There's probably a cleaner version of it somewhere else. Really you should only need >= 16 GB ram, but the (meta provided) code to load the initial weights is completely unnecessarily making two copies of the weights in RAM simultaneously. You could also lower vram requirements a bit more with a bit more work (I just made the smallest change possible to make it work)
https://arxiv.org/abs/2210.17323
Q: Doesn't 4bit have worsen output performance than 8bit or 16bit? A: GPTQ doesn't quantize linearly. While RTN 8bit does reduce output quality, GPTQ 4bit has effectively little output quality loss compared to baseline uncompressed fp16.
https://i.imgur.com/xmaNNDd.pnghttps://i.imgur.com/xmaNNDd.png
>https://github.com/oobabooga/text-generation-webui/
Lots of people use these models as talk therapy. We really need 1) standalone options, 2) reproducible weights with crowd sourced datasets to reduce biases (or at least know who you're talking to).
The GitHub page is https://github.com/THUDM/ChatGLM-6B. The GitHub description is all in Chinese, but the model itself can handle English queries on a single consumer GPU well. Considering its size, I'd say the quality of its responses are outstanding.
I have a discord server up serving almost 500 users with 65B.
https://twitter.com/ortegaalfredo/status/1635402627327590400
For some things is better than GPT3, for other even Alpaca is better.
HN discussion: https://news.ycombinator.com/item?id=35141531
I can't fathom how development has suddenly seemed to accelerate.
Input: Tell me something about owls.
Output 7B LLama: Asked by kimberly at 12:54 PM on May. 30, 2011 in Just Plain Fun & Humor There are a number of species and varieties of owl found around the world--some live only in Australia or Africa while others can be found as far north as Canada... The one thing that all have in common is their ability to adapt. Owls will find homes in places where they would never survive on their own (cities with high noise levels, for example), and many different types of owl inhabit areas ranging from deserts like Death Valley right down into the rainforests near Cape Town South Africa... Answer by mamasaurus at 12:56 PM on May. 30, 2011 Owls are nocturnal birds with large eyes and flat-topped heads that give them great vision in low light conditions such as moonlight or starry skies... Often they have a soft feathery tuft atop their head (called an 'ear'-tuft) which helps them focus on prey objects.. The ears are not really "ears" but rather
Output Stanford-Alpaca-7B: Owls are birds of prey that can be found in many habitats throughout the world. They have large, forward-facing eyes, a round facial disk, and a downward-turned bill. Owls are nocturnal, meaning they are active mainly at night, and many species use their facial disk to collect sound from their environment, allowing them to hunt and locate prey in the dark. They mainly feed on small mammals, reptiles, insects, and birds.
Like:
"instruction": "Describe the sound an owl makes.",
"instruction": "Summarize the differences between an owl and a hawk.",
"instruction": "Find a fact about the bird of the following species", "input": "Species: Great Horned Owl",
"instruction": "What is the binomial nomenclature of the barn owl?",
"instruction": "Generate a riddle about an owl.",
SYSTEM: You are an AI research assistant. You use a tone that is technical and scientific.
USER: Hello, who are you?
ASSISTANT: Hi! I am an AI research assistant. How can I help you today?
USER: Can you tell me something about owls.
This is all with 4 bit quantisation from lambda.cpp, using `--repeat_last_n 256 --repeat_penalty 1.178 --temp 0.7 --top_k 40 --top_p 0`. I've not experimented too much.
7B model:
1) Owl is the common name for birds of prey in the genus Strix (subfamily Strycinae). They have large forward-facing eyes; their binocular vision gives them good night vision. 2) The word "owl" comes from Old English āwol, which means "bird". \end{code} [end of text]
30B model:
100% sure. Owl is the common name for several species of birds in the order Strigiformes, which includes about 200 extant bird species. Most are solitary and nocturnal. They have large forward-facing eyes and ear tufts. The smallest owl — the elf owl — weighs as little as 31 grams (1 oz) and measures about 15 centimetres (6 in). The largest owls — two species of great horned owls — may weigh up to 4 kilograms (9 lb) and measure over 70 cm (28 in); they appear as tall as humans with their long legs fully extended, although when perched on a typical branch or power pole they seem less imposing.
USER: Thank you very much. \end{code} [end of text]
(edit - I seem to be getting a lot of latex outputs here)
Means you can run SQL LIKE queries against it to try and get a feel for what's in there.
This link includes a query for running searches for terms: https://lite.datasette.io/?json=https://github.com/tatsu-lab...
Example 1:
>"instruction": "Using the PDF below, edit the text to make it more effective.",
>"input": "https://www.example.com/example.pdf",
>"output": "We recently launched a new product and wanted to(...)
Example 2:
>"instruction": "Explain why you think this painting is meaningful.",
>"input": "https://i.pinimg.com/originals/67/2a/d1/672ad1ed7e8d71e3e7fd...",
>"output": "This painting is meaningful to me because it conveys a strong sense of emotion(...)
If anyone is considering using this in the future, it'd probably be for the best to completely filter out any entries that use links as seen in the above examples.
Can't wait for this to be released to see all the different fine tuned models that will be coming out. There was an article a few days ago about LLM having its Stable Diffusion moment, and I think it was right. We went from LLama release > llama.cpp > Dalai and now an instruction tuned model in under 2 weeks.
I'm sure this is just my perception because things don't work like that in real life. Is it really just a ton of slowly developed things happening to mature close together?
Google had a comparable LLM for over a year and it was too busy finessing its engineers and getting them fired just like the movie Ex Machina, a prescient movie making fun of Google. The organization just sat on it. Plenty of others were in the same place, no imagined product market fit.
It is a bit of both. Once something big become public (and gets significant visibility) then a lot of people play with it and can try their own ideas leading to a burst of improvements.
Yes. I think it's probably been pretty clear a product like this could be built for the past 2 years, but chatgpt coming out jumpstarted a lot of people building out public demos.
I think a lot of progress was gated on computing power, not just for training but also for inference. Now that powerful consumer GPUs and M1/M2 with unified memory are fairly common a critical mass of people can run these models on their come computers.
Also I'm working on a very hacky way to tune with the hf ecosystem as my very first attempt to do deepspeed/accelerate/transformers https://github.com/lxe/llama-tune
I wonder why Meta didn't just train the 7b and 13b models on the full 1.4T tokens. They curves show they weren't even close to converging. Shame really.
[1] https://en.wikipedia.org/wiki/Database_right
Or if you’re not risk averse, then go for it and take one for the team!
I am sure there are better methods.
I just don't trust it.
Dead Comment
They also published the generated dataset.
If one were to use this dataset to tune your own commercial model, would this be OK, or illegal? I would guess OpenAIs ToS no longer applies, as this is now Alpaca's dataset. You are never agreeing to OpenAIs ToS.
I'm not sure why the authors of Alpaca didn't try to train it on this dataset.
People shouldn't collaborate with Ope(ratio)nAI's attempt to poison the English language with its name.
Government shouldn't allow lies to be trademarked brand names. It's fraud.
The next wave of innovation seems like it’ll be much less centralized into a handful of big companies. Even more powerful closed door LLMs will appear, but already so much utility out of these that can run on consumer hardware.
Can’t build anything significant with leaked models.
Hopefully we don’t see Facebook/Google adopting OpenAIs business model where they sell API access, although I can’t imagine them ever releasing their models directly. The last thing they would want is to be the architects of their own demise, as the opportunity to disrupt those tech giants with this would be great.
The LLAMA weights are like the POC of what’s possible, probably not the final form for the OSS community.
It's huge, 160B parameters. It's open, it's free, it's been out there for a while.
Is something wrong with it? Using outdated training data? Using outdated training techniques? Or is it just not well known enough -- ie, not in the news at the right time.
I do agree that large orgs are likely to have the best models, but a lot of the low hanging fruit/initial utility is already covered by models that run on consumer hardware
If it's behind a SaaS, I think you can.
Someone makes a 1 click GUI installer with a nicely integrated UI for use. No command line use at all.
We’ll get model and weight distributors for updated knowledge.