Readit News logoReadit News
operator-name · 2 years ago
This looks quite cool! It's basically a tech demo for TensorRT-LLM, a framework that amongst other things optimises inference time for LLMs on Nvidia cards. Their base repo supports quite a few models.

Previously there was TensorRT for Stable Diffusion[1], which provided pretty drastic performance improvements[2] at the cost of customisation. I don't forsee this being as big of a problem with LLMs as they are used "as is" and augmented with RAG or prompting techniques.

[1]: https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT [2]: https://reddit.com/r/StableDiffusion/comments/17bj6ol/hows_y...

operator-name · 2 years ago
Having installed this, this is an incredibly then wrapper around the following github repos:

https://github.com/NVIDIA/trt-llm-rag-windowshttps://github.com/NVIDIA/TensorRT-LLM

It's quite a thin wrapper around putting both projects into %LocalAppData%, along with a miniconda environment with the correct dependnancies installed. Also for some reason the LLaMA 13b (24.5GB) and Ministral 7b (13.6GB) but only installed Ministral?

Ministral 7b runs about as accurate as I remeber, but responses are faster than I can read. This seems at the cost of context and variance/temperature - although it's a chat interface the implementation doesn't seem to take into account previous questions or answers. Asking it the same question also gives the same answer.

The RAG (llamaindex) is okay, but a little suspect. The installation comes with a default folder dataset, containing text files of nvidia marketing materials. When I tried asking questions about the files, it often cites the wrong file even if it gave the right answer.

kkielhofner · 2 years ago
The wrapping of TensorRT-LLM alone is significant.

I’ve been working with it for a while and it’s… Rough.

That said it is extremely fast. With TensorRT-LLM and Triton Inference Server with conservation performance settings I get roughly 175 tokens/s on an RTX 4090 with Mistral-Instruct 7B. Following commits, PRs, etc I expect this to increase significantly in the future.

I’m actually working on a project to better package Triton and TensorRT-LLM and make it “name and model and press enter” level usable with support for embeddings models, Whisper, etc.

FirmwareBurner · 2 years ago
>LLaMA 13b (24.5GB) and Ministral 7b (13.6GB)

But the HW requirements state 8GB of VRAM. How do those models fit in that?

randerson · 2 years ago
The Creative Labs sound cards of the early 90's came with Dr. Sbaitso, an app demoing their text-to-speech engine by pretending to an AI psychologist. Someone needs to remake that!
crtified · 2 years ago
Awhile back I engaged in the utterly banal 5-minute pursuit of having Dr. Sbaitso speak with ChatGPT.

It did not go well. The generation gap was perhaps even starker than it is between real people.

https://forums.overclockers.com.au/threads/chatgpt-vs-dr-sba...

Deleted Comment

pests · 2 years ago
I found the early response of ChatGPT claiming it's not Dr.Sbaito a weird response
lysp · 2 years ago
It's actually been ported to web - along with a lot of other dos based games.

https://classicreload.com/dr-sbaitso.html

swozey · 2 years ago
This is hilarious.

Ok, tell me your problem $name.

"I'm sad."

Do you enjoy being sad?

"No"

Are you sure?

"Yes"

That should solve your problem. Lets move on to discuss about some other things.

Also wow that creative app/sound/etc brings back memories.

davedunkin · 2 years ago
Levels is making something like that. https://twitter.com/levelsio/status/1756396158652432695

I would really want it to have that Dr. Sbaitso voice, though, telling me how to be fitter, happier, more productive.

sedatk · 2 years ago
Somebody actually integrated ChatGPT with Dr.Sbaitso: https://bert.org/2023/01/06/chatgpt-in-dr-sbaitso/
dcist · 2 years ago
Yes, I remember Dr. Sbaitso very, very well. I spent many hours with it as a kid and thought it was tons of fun. To be frank, Dr. Sbaitso is why I was underwhelmed when chatbots were hyped in the early 2010s. I couldn't understand why anyone would be excited about 90s tech.
Mistletoe · 2 years ago
Chatting with ALICE is what has tempered my ChatGPT hype. It was neat and seemed like magic, but I think it was in the 90s when I tried it. I'm sure for new people it feels like an unprecedented event to talk to a computer and it seem sentient.

https://www.pandorabots.com/pandora/talk?botid=b8d616e35e36e...

Like other bogus things like tarot or horoscopes, it's amazing what you can discover when you talk about something, it asks you questions, and what you want or desire eventually floats to the surface. And now people are even more lonely...

>Human: do you like video games

>A.L.I.C.E: Not really, but I like to play the Turing Game.

jhbadger · 2 years ago
Really just 1960s tech. Dr. Sbaitso was really just a version of ELIZA with the only new part being the speech synthesis.
McAtNite · 2 years ago
I’m struggling to understand the point of this. It appears to be a more simplified way of getting a local LLM running on your machine, but I expect less technically inclined users would default to using the AI built into Windows while the more technical users will leverage llama.cpp to run whatever models they are interested in.

Who is the target audience for this solution?

operator-name · 2 years ago
This is a tech demo for TensorRT, which is ment to greatly improve inference time for compatible models.
brucethemoose2 · 2 years ago
> the more technical users will leverage llama.cpp to run whatever models they are interested in.

Llama.cpp is much slower, and does not have built-in RAG.

TRT-LLM is a finicky deployment grade framework, and TBH having it packaged into a one click install with llama index is very cool. The RAG in particular is beyond what most local LLM UIs do out-of-the-box.

dkarras · 2 years ago
>It appears to be a more simplified way of getting a local LLM running on your machine

No, it answers questions from the documents you provide. Off the shelf local LLMs don't do this by default. You need a RAG stack on top of it or fine tune with your own content.

westurner · 2 years ago
From "Artificial intelligence is ineffective and potentially harmful for fact checking" (2023) https://news.ycombinator.com/item?id=37226233 : pdfgpt, knowledge_gpt, elasticsearch :

> Are LLM tools better or worse than e.g. meilisearch or elasticsearch for searching with snippets over a set of document resources?

> How does search compare to generating things with citations?

pdfGPT: https://github.com/bhaskatripathi/pdfGPT :

> PDF GPT allows you to chat with the contents of your PDF file by using GPT capabilities.

GH "pdfgpt" topic: https://github.com/topics/pdfgpt

knowledge_gpt: https://github.com/mmz-001/knowledge_gpt

From https://news.ycombinator.com/item?id=39112014 : paperai

neuml/paperai: https://github.com/neuml/paperai :

> Semantic search and workflows for medical/scientific papers

RAG: https://news.ycombinator.com/item?id=38370452

Google Desktop (2004-2011): https://en.wikipedia.org/wiki/Google_Desktop :

> Google Desktop was a computer program with desktop search capabilities, created by Google for Linux, Apple Mac OS X, and Microsoft Windows systems. It allowed text searches of a user's email messages, computer files, music, photos, chats, Web pages viewed, and the ability to display "Google Gadgets" on the user's desktop in a Sidebar

GNOME/tracker-miners: https://gitlab.gnome.org/GNOME/tracker-miners

src/miners/fs: https://gitlab.gnome.org/GNOME/tracker-miners/-/tree/master/...

SPARQL + SQLite: https://gitlab.gnome.org/GNOME/tracker-miners/-/blob/master/...

https://news.ycombinator.com/item?id=38355385 : LocalAI, braintrust-proxy; promptfoo, chainforge, mixtral

fortran77 · 2 years ago
It seems really clear to me! I downloaded it, pointed it to my documents folder, and started running it. It's nothing like the "AI built into Windows" and it's much easier than dealing with rolling my own.
SirMaster · 2 years ago
This lets you run Mistral or Llama 2, so whomever has an RTX card and wants to run either of those models?

And perhaps they will add more models in the future?

pquki4 · 2 years ago
I don't think your comment answers the question? Basically, those who bother to know underlying model's name can already run their model without this tool from nvidia?
McAtNite · 2 years ago
I suppose I’m just struggling to see the value add. Ollama already makes it dead simple to get a local LLM running, and this appears to be a more limited vendor locked equivalent.

From my point of view the only person who would be likely to use this would be the small slice of people who are willing to purchase an expensive GPU, know enough about LLMs to not want to use CoPilot, but don’t know enough about them to know of the already existing solutions.

papichulo2023 · 2 years ago
Does windows uses the pc's gpu or just cpu or cloud?
robotnikman · 2 years ago
If they are talking about the Bing AI, just using whatever OpenAI has in the cloud
joenot443 · 2 years ago
The immediate value prop here is the ability to load up documents to train your model on the fly. 6mos ago I was looking for a tool to do exactly this and ended up deciding to wait. Amazing how fast this wave of innovation is happening.
seydor · 2 years ago
Windows users who haven't bought an Nvidia card yet
tuananh · 2 years ago
this is exactly what i want: a personal assistant.

a personal assistant to monitor everything i do on my machine, ingest it and answer question when i need.

it's not there yet (still need to manually input url, etc...) though but it's very much feasible.

mistermann · 2 years ago
I'd like something that monitors my history on all browsers (mobile and desktop, and dedicated client apps like substance, Reddit, etc) and then ingests the articles (and comments, other links with some depth level maybe) and then allows me to ask questions....that would be amazing.
tuananh · 2 years ago
yes, i want that too. not sure if anyone is building sth like this?
majestic5762 · 2 years ago
rewind.ai
majestic5762 · 2 years ago
mykin.ai is building this with privacy in mind. Runs small models on-device, while large ones in confidential VMs in the cloud.

Deleted Comment

Xeyz0r · 2 years ago
But it sounds kinda creepy don't you think?
gmueckl · 2 years ago
You'd be the one controlling the off-switch and the physical storage devices for the data. I'd think that this fact takes most of the potential creep out. What am I not seeing here?
chollida1 · 2 years ago
> But it sounds kinda creepy don't you think?

is the bash history command creepy?

Is your browsers history command creepy?

spullara · 2 years ago
it is all local so, no?
tuananh · 2 years ago
if it's 100% local then fine.
yuck39 · 2 years ago
Interesting. Since you are running it locally do they still have to put up all the legal guardrails that we see from Chat GPT and the like?
dist-epoch · 2 years ago
Yes, because otherwise there would be news articles "NVIDIA installs racist/sexist/... LLM on users computers"
phone8675309 · 2 years ago
Gaming company

Gaming LLM

Checks out

mchinen · 2 years ago
Given that you can pick llama or mistral in the NVIDIA interface, I'm curious if this is built around ollama or reimplementing something similar. The file and URL retrieval is a nice addition in any case.
navjack27 · 2 years ago
30 and 40 series only? My 2080 Ti scoffs at the artificial limitation
andy_xor_andrew · 2 years ago
so they branded this "Chat with RTX", using the RTX branding. Which, originally, meant "ray tracing". And the full title of your 2080 Ti is the "RTX 2080 Ti".

So, reviewing this...

- they are associating AI with RTX (ray tracing) now (??)

- your RTX card cannot chat with RTX (???)

wat

a13o · 2 years ago
The marketing whiff on ray tracing happened long ago. DLSS is the killer app on RTX cards, another 'AI'-enabled workload.
startupsfail · 2 years ago
No support for bf16 in a card that was released more than 5 years ago, I guess? Support starts with Ampere?

Although you’d realistically need 5-6 bit quantization to get anything large/usable enough running on a 12GB card. And I think it’s just CUDA then, so you should be able to use 2080 Ti.

nottorp · 2 years ago
That was my first question, does it display pretty ray traced images instead of answers?
Havoc · 2 years ago
RTX is a brand more than ray tracing.

It is largely an arbitrary generational limit

0x457 · 2 years ago
> I pull my PC with Intel 8086 out of closet

> I try to run windows 10 on it

> It doesn't work

> pff, Intel cpu cannot run OS meant for intel CPUs

wat

Jokes aside, nvidia been using RTX branding for products that use Tensor Cores for a long-time now. Limitation due to 1st gen tensor cores not supporting precisions required.

operator-name · 2 years ago
Yeah, seems a bit odd because the TensorRT-LLM repo lists Turing as supported architecture.

https://github.com/NVIDIA/TensorRT-LLM?tab=readme-ov-file#pr...

speckx · 2 years ago
I, too, was hoping that my 2080 Ti from 2019 would suffice. =(
phone8675309 · 2 years ago
Don't worry, they'll be happy to charge you $750 for an entry level card next generation that can run this.
tekla · 2 years ago
Yes peasants, Nvidia requires you to buy the latest and greatest expensive luxury gear, and you will BEG for it.
nickthegreek · 2 years ago
A 4060 8gb is $300.
redder23 · 2 years ago
> and all you need is an RTX 30- or 40-series GPU with at least 8GB of VRAM

Smells like artificial restriction to me. I have a 2080 Ti with 8GB of VRAM that is still perfectly fine for gaming. I play in 3440x1440 and the modern games need DLSS/FSR on quality for nice 60++ - 90 FPS. That is perfectly enough for me and I have not had a game, even UE5 games where I really thought I really NEED a new one. I bet that card is totally capable of running that chatbot.

They do the same with frame generation. There they even require you a 40 series card. That is ridiculous to me as these cards are so fast that you do not even need any frame generation. The slower cards are the ones that would benefit from it most so they just lock it down artificially to boost their sales.

magicalhippo · 2 years ago
> 2080 Ti with 8GB of VRAM

Sure you don't mean 11GB[1]? Or did they make other variants? FWIW I have a 2080 Ti with 11GB, been considering upgrading but thinking I'll wait til 5xxx.

[1]: https://www.techpowerup.com/gpu-specs/geforce-rtx-2080-ti.c3...

redder23 · 2 years ago
Yes of course, I have a 11GB of VRAM.

My next card will be an AMD one. I like that they are open sourcing most of their stuff and I think they play better with Linux Wine/Proton. FSR 3 also not artificially restricts cards and runs even on the competition. I read today about at open source API that takes CUDA calls and runs them on AMD or everywhere. I am sure there will be some cool open source projects that do all kinds of things if I ever even need them.