This looks quite cool! It's basically a tech demo for TensorRT-LLM, a framework that amongst other things optimises inference time for LLMs on Nvidia cards. Their base repo supports quite a few models.
Previously there was TensorRT for Stable Diffusion[1], which provided pretty drastic performance improvements[2] at the cost of customisation. I don't forsee this being as big of a problem with LLMs as they are used "as is" and augmented with RAG or prompting techniques.
It's quite a thin wrapper around putting both projects into %LocalAppData%, along with a miniconda environment with the correct dependnancies installed. Also for some reason the LLaMA 13b (24.5GB) and Ministral 7b (13.6GB) but only installed Ministral?
Ministral 7b runs about as accurate as I remeber, but responses are faster than I can read. This seems at the cost of context and variance/temperature - although it's a chat interface the implementation doesn't seem to take into account previous questions or answers. Asking it the same question also gives the same answer.
The RAG (llamaindex) is okay, but a little suspect. The installation comes with a default folder dataset, containing text files of nvidia marketing materials. When I tried asking questions about the files, it often cites the wrong file even if it gave the right answer.
The wrapping of TensorRT-LLM alone is significant.
I’ve been working with it for a while and it’s… Rough.
That said it is extremely fast. With TensorRT-LLM and Triton Inference Server with conservation performance settings I get roughly 175 tokens/s on an RTX 4090 with Mistral-Instruct 7B. Following commits, PRs, etc I expect this to increase significantly in the future.
I’m actually working on a project to better package Triton and TensorRT-LLM and make it “name and model and press enter” level usable with support for embeddings models, Whisper, etc.
The Creative Labs sound cards of the early 90's came with Dr. Sbaitso, an app demoing their text-to-speech engine by pretending to an AI psychologist. Someone needs to remake that!
Yes, I remember Dr. Sbaitso very, very well. I spent many hours with it as a kid and thought it was tons of fun. To be frank, Dr. Sbaitso is why I was underwhelmed when chatbots were hyped in the early 2010s. I couldn't understand why anyone would be excited about 90s tech.
Chatting with ALICE is what has tempered my ChatGPT hype. It was neat and seemed like magic, but I think it was in the 90s when I tried it. I'm sure for new people it feels like an unprecedented event to talk to a computer and it seem sentient.
Like other bogus things like tarot or horoscopes, it's amazing what you can discover when you talk about something, it asks you questions, and what you want or desire eventually floats to the surface. And now people are even more lonely...
>Human: do you like video games
>A.L.I.C.E: Not really, but I like to play the Turing Game.
I’m struggling to understand the point of this. It appears to be a more simplified way of getting a local LLM running on your machine, but I expect less technically inclined users would default to using the AI built into Windows while the more technical users will leverage llama.cpp to run whatever models they are interested in.
> the more technical users will leverage llama.cpp to run whatever models they are interested in.
Llama.cpp is much slower, and does not have built-in RAG.
TRT-LLM is a finicky deployment grade framework, and TBH having it packaged into a one click install with llama index is very cool. The RAG in particular is beyond what most local LLM UIs do out-of-the-box.
>It appears to be a more simplified way of getting a local LLM running on your machine
No, it answers questions from the documents you provide. Off the shelf local LLMs don't do this by default. You need a RAG stack on top of it or fine tune with your own content.
From "Artificial intelligence is ineffective and potentially harmful for fact checking" (2023) https://news.ycombinator.com/item?id=37226233 : pdfgpt, knowledge_gpt, elasticsearch :
> Are LLM tools better or worse than e.g. meilisearch or elasticsearch for searching with snippets over a set of document resources?
> How does search compare to generating things with citations?
> Google Desktop was a computer program with desktop search capabilities, created by Google for Linux, Apple Mac OS X, and Microsoft Windows systems. It allowed text searches of a user's email messages, computer files, music, photos, chats, Web pages viewed, and the ability to display "Google Gadgets" on the user's desktop in a Sidebar
It seems really clear to me! I downloaded it, pointed it to my documents folder, and started running it. It's nothing like the "AI built into Windows" and it's much easier than dealing with rolling my own.
I don't think your comment answers the question? Basically, those who bother to know underlying model's name can already run their model without this tool from nvidia?
I suppose I’m just struggling to see the value add. Ollama already makes it dead simple to get a local LLM running, and this appears to be a more limited vendor locked equivalent.
From my point of view the only person who would be likely to use this would be the small slice of people who are willing to purchase an expensive GPU, know enough about LLMs to not want to use CoPilot, but don’t know enough about them to know of the already existing solutions.
The immediate value prop here is the ability to load up documents to train your model on the fly. 6mos ago I was looking for a tool to do exactly this and ended up deciding to wait. Amazing how fast this wave of innovation is happening.
I'd like something that monitors my history on all browsers (mobile and desktop, and dedicated client apps like substance, Reddit, etc) and then ingests the articles (and comments, other links with some depth level maybe) and then allows me to ask questions....that would be amazing.
You'd be the one controlling the off-switch and the physical storage devices for the data. I'd think that this fact takes most of the potential creep out. What am I not seeing here?
Given that you can pick llama or mistral in the NVIDIA interface, I'm curious if this is built around ollama or reimplementing something similar. The file and URL retrieval is a nice addition in any case.
so they branded this "Chat with RTX", using the RTX branding. Which, originally, meant "ray tracing". And the full title of your 2080 Ti is the "RTX 2080 Ti".
So, reviewing this...
- they are associating AI with RTX (ray tracing) now (??)
No support for bf16 in a card that was released more than 5 years ago, I guess? Support starts with Ampere?
Although you’d realistically need 5-6 bit quantization to get anything large/usable enough running on a 12GB card. And I think it’s just CUDA then, so you should be able to use 2080 Ti.
> pff, Intel cpu cannot run OS meant for intel CPUs
wat
Jokes aside, nvidia been using RTX branding for products that use Tensor Cores for a long-time now. Limitation due to 1st gen tensor cores not supporting precisions required.
> and all you need is an RTX 30- or 40-series GPU with at least 8GB of VRAM
Smells like artificial restriction to me. I have a 2080 Ti with 8GB of VRAM that is still perfectly fine for gaming. I play in 3440x1440 and the modern games need DLSS/FSR on quality for nice 60++ - 90 FPS. That is perfectly enough for me and I have not had a game, even UE5 games where I really thought I really NEED a new one. I bet that card is totally capable of running that chatbot.
They do the same with frame generation. There they even require you a 40 series card. That is ridiculous to me as these cards are so fast that you do not even need any frame generation. The slower cards are the ones that would benefit from it most so they just lock it down artificially to boost their sales.
Sure you don't mean 11GB[1]? Or did they make other variants? FWIW I have a 2080 Ti with 11GB, been considering upgrading but thinking I'll wait til 5xxx.
My next card will be an AMD one. I like that they are open sourcing most of their stuff and I think they play better with Linux Wine/Proton. FSR 3 also not artificially restricts cards and runs even on the competition. I read today about at open source API that takes CUDA calls and runs them on AMD or everywhere. I am sure there will be some cool open source projects that do all kinds of things if I ever even need them.
Previously there was TensorRT for Stable Diffusion[1], which provided pretty drastic performance improvements[2] at the cost of customisation. I don't forsee this being as big of a problem with LLMs as they are used "as is" and augmented with RAG or prompting techniques.
[1]: https://github.com/NVIDIA/Stable-Diffusion-WebUI-TensorRT [2]: https://reddit.com/r/StableDiffusion/comments/17bj6ol/hows_y...
https://github.com/NVIDIA/trt-llm-rag-windowshttps://github.com/NVIDIA/TensorRT-LLM
It's quite a thin wrapper around putting both projects into %LocalAppData%, along with a miniconda environment with the correct dependnancies installed. Also for some reason the LLaMA 13b (24.5GB) and Ministral 7b (13.6GB) but only installed Ministral?
Ministral 7b runs about as accurate as I remeber, but responses are faster than I can read. This seems at the cost of context and variance/temperature - although it's a chat interface the implementation doesn't seem to take into account previous questions or answers. Asking it the same question also gives the same answer.
The RAG (llamaindex) is okay, but a little suspect. The installation comes with a default folder dataset, containing text files of nvidia marketing materials. When I tried asking questions about the files, it often cites the wrong file even if it gave the right answer.
I’ve been working with it for a while and it’s… Rough.
That said it is extremely fast. With TensorRT-LLM and Triton Inference Server with conservation performance settings I get roughly 175 tokens/s on an RTX 4090 with Mistral-Instruct 7B. Following commits, PRs, etc I expect this to increase significantly in the future.
I’m actually working on a project to better package Triton and TensorRT-LLM and make it “name and model and press enter” level usable with support for embeddings models, Whisper, etc.
But the HW requirements state 8GB of VRAM. How do those models fit in that?
It did not go well. The generation gap was perhaps even starker than it is between real people.
https://forums.overclockers.com.au/threads/chatgpt-vs-dr-sba...
Deleted Comment
https://classicreload.com/dr-sbaitso.html
Ok, tell me your problem $name.
"I'm sad."
Do you enjoy being sad?
"No"
Are you sure?
"Yes"
That should solve your problem. Lets move on to discuss about some other things.
Also wow that creative app/sound/etc brings back memories.
I would really want it to have that Dr. Sbaitso voice, though, telling me how to be fitter, happier, more productive.
https://www.pandorabots.com/pandora/talk?botid=b8d616e35e36e...
Like other bogus things like tarot or horoscopes, it's amazing what you can discover when you talk about something, it asks you questions, and what you want or desire eventually floats to the surface. And now people are even more lonely...
>Human: do you like video games
>A.L.I.C.E: Not really, but I like to play the Turing Game.
Who is the target audience for this solution?
Llama.cpp is much slower, and does not have built-in RAG.
TRT-LLM is a finicky deployment grade framework, and TBH having it packaged into a one click install with llama index is very cool. The RAG in particular is beyond what most local LLM UIs do out-of-the-box.
No, it answers questions from the documents you provide. Off the shelf local LLMs don't do this by default. You need a RAG stack on top of it or fine tune with your own content.
> Are LLM tools better or worse than e.g. meilisearch or elasticsearch for searching with snippets over a set of document resources?
> How does search compare to generating things with citations?
pdfGPT: https://github.com/bhaskatripathi/pdfGPT :
> PDF GPT allows you to chat with the contents of your PDF file by using GPT capabilities.
GH "pdfgpt" topic: https://github.com/topics/pdfgpt
knowledge_gpt: https://github.com/mmz-001/knowledge_gpt
From https://news.ycombinator.com/item?id=39112014 : paperai
neuml/paperai: https://github.com/neuml/paperai :
> Semantic search and workflows for medical/scientific papers
RAG: https://news.ycombinator.com/item?id=38370452
Google Desktop (2004-2011): https://en.wikipedia.org/wiki/Google_Desktop :
> Google Desktop was a computer program with desktop search capabilities, created by Google for Linux, Apple Mac OS X, and Microsoft Windows systems. It allowed text searches of a user's email messages, computer files, music, photos, chats, Web pages viewed, and the ability to display "Google Gadgets" on the user's desktop in a Sidebar
GNOME/tracker-miners: https://gitlab.gnome.org/GNOME/tracker-miners
src/miners/fs: https://gitlab.gnome.org/GNOME/tracker-miners/-/tree/master/...
SPARQL + SQLite: https://gitlab.gnome.org/GNOME/tracker-miners/-/blob/master/...
https://news.ycombinator.com/item?id=38355385 : LocalAI, braintrust-proxy; promptfoo, chainforge, mixtral
And perhaps they will add more models in the future?
From my point of view the only person who would be likely to use this would be the small slice of people who are willing to purchase an expensive GPU, know enough about LLMs to not want to use CoPilot, but don’t know enough about them to know of the already existing solutions.
a personal assistant to monitor everything i do on my machine, ingest it and answer question when i need.
it's not there yet (still need to manually input url, etc...) though but it's very much feasible.
Deleted Comment
is the bash history command creepy?
Is your browsers history command creepy?
Gaming LLM
Checks out
So, reviewing this...
- they are associating AI with RTX (ray tracing) now (??)
- your RTX card cannot chat with RTX (???)
wat
Although you’d realistically need 5-6 bit quantization to get anything large/usable enough running on a 12GB card. And I think it’s just CUDA then, so you should be able to use 2080 Ti.
It is largely an arbitrary generational limit
> I try to run windows 10 on it
> It doesn't work
> pff, Intel cpu cannot run OS meant for intel CPUs
wat
Jokes aside, nvidia been using RTX branding for products that use Tensor Cores for a long-time now. Limitation due to 1st gen tensor cores not supporting precisions required.
https://github.com/NVIDIA/TensorRT-LLM?tab=readme-ov-file#pr...
https://pcpartpicker.com/products/video-card/#c=552&sort=pri...
Chepest 8GB 30xx is $220
https://pcpartpicker.com/products/video-card/#sort=price&c=5...
Smells like artificial restriction to me. I have a 2080 Ti with 8GB of VRAM that is still perfectly fine for gaming. I play in 3440x1440 and the modern games need DLSS/FSR on quality for nice 60++ - 90 FPS. That is perfectly enough for me and I have not had a game, even UE5 games where I really thought I really NEED a new one. I bet that card is totally capable of running that chatbot.
They do the same with frame generation. There they even require you a 40 series card. That is ridiculous to me as these cards are so fast that you do not even need any frame generation. The slower cards are the ones that would benefit from it most so they just lock it down artificially to boost their sales.
Sure you don't mean 11GB[1]? Or did they make other variants? FWIW I have a 2080 Ti with 11GB, been considering upgrading but thinking I'll wait til 5xxx.
[1]: https://www.techpowerup.com/gpu-specs/geforce-rtx-2080-ti.c3...
My next card will be an AMD one. I like that they are open sourcing most of their stuff and I think they play better with Linux Wine/Proton. FSR 3 also not artificially restricts cards and runs even on the competition. I read today about at open source API that takes CUDA calls and runs them on AMD or everywhere. I am sure there will be some cool open source projects that do all kinds of things if I ever even need them.