Is it a throughput constraint given too much data from the environment sensors?
Is it processing the data?
I'm curious about where the bottleneck is.
Depending on the model being used, we may get just one set of joint angle deltas or a series of them. In order to be able to complete a task, it will need to capture images from the cameras, current joint angles and send them to the model along with the task text to get the joint angle changes we will need to apply. Once the joint angles are updated, we will need to check if the task is complete (this can come from the model too). We run this loop till the task is complete.
Combine this with the motion planning that has to happen to make sure the joint angles we are getting do not result in colliding with the surroundings and are safe, results in overall slowness.
I have been working with LLMs and VLMs to automate browser based workflows among other things for the last couple of years. Given how good the vision models have gotten lately, the perception problem is solved to level where it opens up a lot of possibilities. Manipulation is not generally solved yet but there is a lot of activity in the field and there are promising approaches to solve (OpenVLA, π0). Given these, I'm trying to build an affordable robot that can help around with household chores using language and vision models. Idea is to ship capable enough hardware that can do a few things really well with the currently available models and keep upgrading the AI stack as manipulation models get better over time.
If you want a playground to test this model locally or want to quickly build some applications with it, you can try LLMStack (https://github.com/trypromptly/LLMStack). I wrote last week about how to configure and use Ollama with LLMStack at https://docs.trypromptly.com/guides/using-llama3-with-ollama.
Disclaimer: I'm the maintainer of LLMStack
Basically, I'd like to be able to take PDFs of, say, D&D books, extract that data (this step is, at least, something I can already do), and load it into an LLM to be able to ask questions like:
* What does the feat "Sentinel" do?
* Who is Elminster?
* Which God(s) do Elves worship in Faerûn?
* Where I can I find the spell "Crusader's Mantle"?
And so on. Given this data is all under copyright, I'd probably have to stick to using a local LLM to avoid problems. And, while I wouldn't expect it to have good answers to all (or possibly any!) of those questions, I'd nevertheless love to be able to give it a try.
I'm just not sure where to start - I think I'd want to fine-tune an existing model since this is all natural language content, but I get a bit lost after that. Do I need to pre-process the content to add extra information that I can't fetch relatively automatically. e.g., page numbers are simple to add in, but would I need to mark out things like chapter/section headings, or in-character vs out-of-character text? Do I need to add all the content in as a series of questions and answers, like "What information is on page 52 of the Player's Handbook? => <text of page>"?
For now it still uses openai for embeddings generation by default and we are updating that in the next couple of releases to be able to use a local model for embedding generation before writing to a vector db.
Disclosure: I'm the maintainer of LLMStack project