Run Llama 13B with a 6GB graphics card

llama_print_timings: load time = 15459.43 ms llama_print_timings: sample time = 23.64 ms / 38 runs ( 0.62 ms per token) llama_print_timings: prompt eval time = 9338.10 ms / 356 tokens ( 26.23 ms per token) llama_print_timings: eval time = 31700.73 ms / 37 runs ( 856.78 ms per token) llama_print_timings: total time = 47192.68 ms

llama_print_timings: load time = 10285.15 ms llama_print_timings: sample time = 21.60 ms / 35 runs ( 0.62 ms per token) llama_print_timings: prompt eval time = 3889.65 ms / 356 tokens ( 10.93 ms per token) llama_print_timings: eval time = 8126.90 ms / 34 runs ( 239.03 ms per token) llama_print_timings: total time = 18441.22 ms

./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin llama_print_timings: load time = 3725.08 ms llama_print_timings: sample time = 612.06 ms / 536 runs ( 1.14 ms per token) llama_print_timings: prompt eval time = 13876.81 ms / 259 tokens ( 53.58 ms per token) llama_print_timings: eval time = 221647.40 ms / 534 runs ( 415.07 ms per token) llama_print_timings: total time = 239423.46 ms ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin -ngl 30 llama_print_timings: load time = 7638.95 ms llama_print_timings: sample time = 280.81 ms / 294 runs ( 0.96 ms per token) llama_print_timings: prompt eval time = 2197.82 ms / 2 tokens ( 1098.91 ms per token) llama_print_timings: eval time = 112790.25 ms / 293 runs ( 384.95 ms per token) llama_print_timings: total time = 120788.82 ms

Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Write a long blog post with 5 sections, about the pros and cons of emphasising procedural fluency over conceptual understanding, in high school math education. ### Response:

This is cool but are people actually getting stuff done with these models? I'm enthusiastic about their potential too but after playing with it for a day I'm at a loss for what to use it for anymore at this point

ineedasername · 3 years ago

I can run the Wizard 30B ggml model in CPU mode using a Ryzen 5700 and 16GB of system RAM, not GPU VRAM. I’m using oobabooga as the front end.

It’s slow, but if I ask it to write a Haiku it’s slow on the order of “go brew some coffee and come back in 10 minutes” and does it very well. Running it overnight on something like “summarize an analysis of topic X it does a reasonable job.

It can produce answers to questions only slightly less well than ChatGPT (3.5). The Wizard 13B model runs much faster, maybe 2-3 tokens per second.

It is free, private, and runs on a midrange laptop.

A little more than a month ago that wasn’t possible, not with my level of knowledge of the tooling involved at least, now it requires little more than running an executable and minor troubleshooting of python dependencies (on another machine it “just worked”)

So: Don’t think of these posts as “doing it just because you can and it’s fun to tinker”

Vast strides are being made pretty much daily in both quality and efficiency, raising their utility while lowering the cost of usage, doing both to a very significant degree.

theaiquestion · 3 years ago

> It’s slow, but if I ask it to write a Haiku it’s slow on the order of “go brew some coffee and come back in 10 minutes” and does it very well. Running it overnight on something like “summarize an analysis of topic X it does a reasonable job.

I'm sorry but that's unusably slow, even GPT-4 can take a retry or a prompt to fix certain type of issues. My experience is the open options require a lot more attempts/manual prompt tuning.

I can't think of a single workload where that is usable. That said once consumer GPUs are involved it does become usable

saltcured · 3 years ago

I don't know if anybody is following this thread anymore, but I find it interesting how similarly your timelines match what it was like to experiment with POV-Ray (a ray-tracing renderer) back in the early 1990s. Your difference in problem scope was like whether you had "a couple spheres on a checkerboard plane" or something more like "a chess set". Things seemed to change rapidly due to Moore's Law and the changes in brute force computing power available to normal people.

Computers got much more powerful in the next 30 years, and ray-tracing or various related techniques appear in more tool sets and games, they didn't fundamentally change the world of image generation or consumption. Most people still roughly interact as before, just with more details in the eye candy.

Are we seeing these large language models today at a tipping point towards unfathomable societal impact, or as something like ray tracing in the 1990s? Will more compute power send us spiraling towards some large-model singularity, or just add more pixels until we are bored of seemingly endless checkerboard planes covered in spheres and cones... I don't know the answer, but it seems like we're seeing camps divided by this question of faith.

BaculumMeumEst · 3 years ago

Wow you can run a 30B model on 16gb ram? Is it hitting swap?

sp332 · 3 years ago

What prompt do you use to get haikus?

Joeri · 3 years ago

My experience is that even the "toy" 7B models can do pretty amazing things, but they're very sensitive to prompting in the right way to force the model down the right pathways. My assumption is that the larger models have more redundancy and therefore can be prompted less precisely while still going down the right pathway.

For example, the WizardLM model (5_1 quantized, as 4_x quantization makes the 7B models perform terribly):

Prompt: I went to the market and bought 10 apples. I gave 2 to my neighbor and 2 to my sister. Then I bought 5 more and ate 1. How many apples do I have?

Response:

  Let me check that for you! You had 9 apples left after giving away 4, so you still have 9 - 4 = 5 apples left.

Prompt: I went to the market and bought 10 apples. I gave 2 to my neighbor and 2 to my sister. Then I bought 5 more and ate 1. How many apples do I have? Write out any calculations in full. List the steps used to come to the final result.

Response:

  Sure, let's work through this problem together! Here are the steps we can use:
  Step 1: Start with the initial number of apples you bought (10).
  Step 2: Subtract the amount given to your neighbor and sister from the total (2 + 2 = 4). This gives us a new total of 6 apples.
  Step 3: Add the additional 5 apples you bought later on to the remaining 6 apples, giving us a final total of 11 apples.
  Step 4: Finally, subtract the apple you ate from the total (1), leaving us with 10 apples in total.
  So the answer is 10 apples.

mcnnowak · 3 years ago

I've always wondered if we could get a "good enough" general model, then add plugins for shortcomings.

E.g., Can this model solve math problems? ~%50 of the time. Can this model identify math problems? ~99.9% of the time. Ok, then divert math questions to the math plugin and mathematical-problem-language-parser which has ~99.9% accuracy on supported math problems.

I know it's a stopgap, but it seems like it'd make these models super useful on certain tasks quicker.

olabyne · 3 years ago

I agree with you. That's why all to the small model are showing some benchmarks putting them close to GPT3.5 or even 4, only because they use specific test tasks !

In a way it just shows the amazing performance that will come from small future models

Taek · 3 years ago

I've successfully gotten at-home models (https://huggingface.co/NousResearch/GPT4-x-Vicuna-13b-fp16) to go through my messages and pull out key todo items. For example, reminding me to message my father about travel plans.

Is it comparable to GPT-4? No, it's not remotely close. But it's getting closer every week, and it very recently crossed the threshold of 'it can do stuff I would never send to a cloud service' (namely, reading all of my messages and pulling out action items).

MPSimmons · 3 years ago

How do you deal with the 'too many tokens' problem? Like, For 13b, max tokens I've gotten it to is 508 - is there a way to increase the number it will accept?

instance · 3 years ago

I tested on a serious use case and quality was subpar. For real use cases I had to either host the most powerful model you can get (e.g. LLaMA-65B or so) on a cloud machine, which again costs too much (you'll be paying like 500-1000 USD per month), or just go straight for GPT-3.5 on OpenAI. The latter economically makes most sense.

inferense · 3 years ago

what real use case did you use it for?

raffraffraff · 3 years ago

What about turning the cloud vm off except when you're actually using it?

m3kw9 · 3 years ago

Most part if you actually need stuff done, you can afford 20$ for OpenAIs proven models.

These local projects are great because maybe eventually they will have a equivalent model that can be run on cheap parts

thelittleone · 3 years ago

Knowledge cut off and sending potentially sensitive IP to a centralised and untrusted third party. This would likely limit the addressable market.

Beltiras · 3 years ago

The knowledge cutoff date is brutal for some applications.

elorant · 3 years ago

I've setup and use Vicuna-13b for text classification, summarization and topic modelling. Works like a charm.

It's also good for math lessons.

jstarfish · 3 years ago

Maybe others' experiences are different, but I find smaller models to work just as well for "reductive" tasks.

Dolly sucks for generating long-form content (not very creative) but if I need a summary or classification, it's quicker and easier to spin up dolly-3b than vicuna-13b.

I suspect OpenAI is routing prompts to select models based on similar logic.

BOOSTERHIDROGEN · 3 years ago

Would like to know how you setup this. A posts would be awesome.

redox99 · 3 years ago

People are extensively using these models (more specifically the finetuned, uncensored ones) for role playing.

irzzy · 3 years ago

I’m interested in this. Can you elaborate?

iforgotpassword · 3 years ago

I haven't tried the fine-tuned variants yet, but when I played around with it shortly after the leak, it tended to quickly derail into nonsense when you let it complete sentences or paragraphs, especially when using other languages than English. When I tried to get it to produce Python code, most of the time it wasn't even syntactically correct.

jhbadger · 3 years ago

Depends on what "getting stuff done" means. I find 13B models (running on my M1 Mac) useful for playing AI Dungeon-like games -- where you describe the environment and your character and you can play an RPG.

fredguth · 3 years ago

GitHub Copilot is (or once was) a 13b model, according to Nat Friedman in the scale.ai interview. (https://youtu.be/lnufceCxwG0)

xrd · 3 years ago

Can you share more about what you are doing or what you've seen others doing?

rain1 · 3 years ago

It's just for fun!

These local models aren't as good as Bard or GPT-4.

happycube · 3 years ago

There are two major advantages though - you can retrain them, and they're not on the guardrails that the commercial models have.

logicchains · 3 years ago

LLaMA 65B in my experience is comparable to GPT3.5 or the old Bard for usefulness at most things.

silvervine · 3 years ago

I am aiming to connect it to my fav game engine to make a character speak in generative sentences. I am aiming to do this mainly for fun however.

Zetobal · 3 years ago

We run some llamas to analyze user content.

Why does AMD or Intel not release a medium performant GPU with minimum 128gb of memory for a good consumer price. These models require lots of memory to 'single' pass an operation. Throughput could be bit slower. A 1080 Nvidia with 256gb of memory would run all these models fast right? Or am I forgetting something here.

elabajaba · 3 years ago

Because they can't do that for a "good consumer price".

If you want more than ~48GB, you're looking at HBM which is extremely expensive (HBM chips are very expensive, packaging+interposer is extremely expensive, designing and producing a new GPU is expensive).

Normal GPUs are limited by both their bus width (wider bus = more pins = harder to design, more expensive to produce, and increases power consumption), and GDDR6(x) (which maxes out at 2GB/chip currently), so on a 384bit bus (4090/7900xtx, don't expect anyone to make a 512bit busses anymore) you need 12x2GB (GDDR6 uses 32 pins per package) which gives you 24GB. You can double the memory capacity to 48GB, but that requires putting the chips on the back of the GPU which leads to a bunch of cooling issues (and GDDR6 is expensive).

Of course, even if they did all that they're selling expensive GPUs to a small niche market and cannibalizing sales of their own high end products (and even if AMD somehow managed to magic up a 128GB gpu for $700 people still wouldn't buy it because so much of the ML software is CUDA only).

eurekin · 3 years ago

3090 has a lot of vram chips on the back though

Tuna-Fish · 3 years ago

It would not be trivial to do.

GDDR achieves higher speeds than normal DDR mainly by specifying much tighter tolerances on the electrical interface, and using wider interface to the memory chips. This means that using commodity GDDR (which is the only fast DRAM that will be reasonably cheap), you have fairly strict limitations on the maximum amount of RAM your can use with the same GPUs that are manufactured for consumer use. (Typically, at most 4x difference between the lowest-end reasonable configuration and the highest-end one, 2x from higher density modules and 2x from using clamshell memory configuration, although often you only have one type of module for a new memory interface generation.)

If the product requires either a new memory or GPU die configuration, it's cost will be very high.

The only type of memory that can support very different VRAM sizes for an efficiently utilized bus of the same size is HBM, and so far that is limited to the very high end.

magicalhippo · 3 years ago

Anandtech has an article on the GDDR6X variant[1] that NVIDIA has in their 3000-cards, where they use a more complex encoding to transmit two bits per clock edge.

I hadn't realized just how insane the bandwidth on the higher-ends cards are, the 3090 being just shy of 1 TB/s, yes, one terrabyte per second...

For comparison a couple of DDR5 sticks[2] will just get you north of 70GB/s...

[1]: https://www.anandtech.com/show/15978/micron-spills-on-gddr6x...

[2]: https://www.anandtech.com/show/17269/ddr5-demystified-feat-s...

q7xvh97o2pDhNrh · 3 years ago

Do you happen to know where Apple's integrated approach falls on this spectrum?

I was actually wondering about this the other day. A fully maxed out Mac Studio is about $6K, and it comes with a "64-core GPU" and "128GB integrated memory" (whatever any of that means). Would that be enough to run a decent Llama?

fennecfoxy · 3 years ago

Well yeah I guess binned cards come into play, cheaper binned cards have a narrower bus. It seems there are quite a few models that aren't too heavy on compute but require a tonne of vram.

It would be nice for Nvidia to release a chip targeted for medium compute/high memory, the lower binning of which should revolve around their max 384b bus on the 4090. But then, it would be hard to financially justify it on their end I suppose.

0xcde4c3db · 3 years ago

Probably because if they take that exact same GPU+VRAM configuration and slap it on a rackmount-optimized board, they can charge AI companies 5-10x the price for it.

jsheard · 3 years ago

They don't even offer that much VRAM on cards aimed at those price-insensitive customers, Nvidias current lineup maxes out at 48GB for GDDR-based models or 80GB for HBM-based models. Even if money is no object there's still practical engineering limits on how much memory they can put on a card without sacrificing bandwidth.

vegabook · 3 years ago

this is where the new third player, Intel, can (if it can tear itself away from identical behaviour in the consumer/server CPU market) hopefully break the duopoly. Love to see a 32 or 64GB card from Intel. Their software stack on Linux is competent enough (unlike the dumpster fire that is AMD's ROCm).

fomine3 · 3 years ago

This must be the reason. AI homelab market is too small while enterprise market will become extremely big.

hackernudes · 3 years ago

I don't think there was a market for it before LLMs. Still might not be (especially if they don't want to cannibalize data center products). Also, they might have hardware constraints. I wouldn't be that surprised if we see some high ram consumer GPUs in the future, though.

It won't work out unless it becomes common to run LLMs locally. Kind of a chicken-and-egg problem so I hope they try it!

the8472 · 3 years ago

> I don't think there was a market for it before LLMs.

At $work CGI assets sometimes grow pretty big and throwing more VRAM at the problem would be easier than optimizing the scenes in the middle of the workflow. They can be optimized, but that often makes it less ergonomic to work with them.

Perhaps asset-streaming (nanite&co) will make this less of an issue, but that's also fairly new.

Do LLM implementations already stream the weights layer by layer or in whichever order they're doing the evaluation or is PCIe bandwidth too limited for that?

Deleted Comment

layer8 · 3 years ago

Releasing a new model takes time, and it’s unclear how large the consumer market would actually be. Maybe they’re working on it right now.

kevingadd · 3 years ago

The margins on VRAM are pretty bad for them since they don't manufacture it themselves. And every memory module they add needs additional power delivery and memory controller muscle to drive, so adding that memory is going to raise the cost of the card significantly. Most games and consumer workloads won't use all that extra memory.

Keep in mind video cards don't use the same kind of RAM as consumer CPUs do, they typically use GDDR or HBM.

dragonwriter · 3 years ago

> Why does AMD or Intel not release a medium performant GPU with minimum 128gb of memory for a good consumer price.

They do. Well, not “medium performant”, but for VRAM-bound tasks they’d still be an improvement over CPUs if you could use them — iGPUs use main memory.

What they don’t have is support for them for popular GPGPU frameworks (though there was a third party CUDA-for-Intel-iGPUs a while ago.)

duxup · 3 years ago

>for a good consumer price

Was there a consumer market for them until recently?

fooker · 3 years ago

That's exactly what the next generation of 'accelerators' will be like.

Whether it will be co-located with a GPU for consumer hardware remains to be seen.

The thing to determine is how essential running LLMs locally is for consumers.

BigTech is pushing hard to make their clouds the only place to run LLMs unfortunately, so unless there is a killer app that is just better locally (like games were for GPUs), this might not change.

boppo1 · 3 years ago

> unless there is a killer app that is just better locally

Therapy & relationship bots, like the movie 'Her'. It's ugly, but it's coming.

pbhjpbhj · 3 years ago

There's a type of DMA for GPUs to access NVMe on the motherboard, IIRC. Perhaps that is a better solution here?

https://developer.nvidia.com/blog/gpudirect-storage/

boppo1 · 3 years ago

Isn't pci-e latency dramatically higher than onboard vram?

andromeduck · 3 years ago

Because then memory would be 90% of the BOM.

Kye · 3 years ago

GDDR probably hasn't seen the same cost reduction benefits of volume DDR has.

TaylorAlexander · 3 years ago

One question I have is: can they use cheaper kinds of RAM and still be perfectly usable for large ML models? They could put 4GB of GDDR and 128GB of cheap RAM maybe? I do realize as others are saying, this would be a new kind of card so they will need time to develop it. But would this work?