The problem here is where these actions come from. Generic LLM cannot generate correct actions in many (if not most) real life cases. So, it will have to learn, and LLMs aren't good at learning. For example: "I'm tired, play my favorite". The action depends on _who_ is saying and on what's going on right now. There may be someone sleeping, or watching TV. I'm afraid that acceptable solution is much more complicated.
I have investigated use of agents for real support agent type work and the rate of failure made it unacceptable for my use case. This is even after giving it very explicit and finely tuned context.
I suspect that if engineering of LLM solutions utilizes unseen testing data more, it's going to become apparent that it really does not have sufficiently reliable "cognitive" ability to do any practical agent type work.
DO we have to expect _that_ level of understanding from the agent, though? If my wife said that to me, I may have a good chance of queuing up the song she has in mind, but anyone else? No chance. I don't expect tools like this to be able to understand cryptic requests and always come to the right answer. I'm happy if I can request a song or an action, or anything else in the same way i might ask another human who doesn't know me intimately.
Why would we want this at all if it doesn't know you that well? Current voice assistants without AI can already handle songs and actions like that. Seems like it's largely solved.
I'm genuinely not seeing a problem there that the Planner part of the paper couldn't cover. "Who said that" and "what's going on right now" are just API calls. Besides which, if one person says "play my favourite" while another person is watching TV, that's not the LLM's job to unpack.
The point is that the ability to call APIs gives them the ability to learn so that the actions that are eventually taken are correct in context. It's like a more generic version of https://code-as-policies.github.io/.
hopefully it can be solved with the target API, the target API knows who is calling this API, the service has user information. Or this will be translated into "Play the most played playlist", and the action will be enough.
I agree with you in general though, more useful AI is, more data it will need to see. I strongly believe companies like Microsoft, Google or Apple will bring the best experience because they own operating systems. It is going to be very hard for a third party to build a general AI assistant.
> So, it will have to learn, and LLMs aren't good at learning
LLMs are bad at human-like learning, but their zero-shot performance + semantic search more than make up for it.
If you give an LLM access to your Spotify account via an API, it has access to your playlists and access to details about each song like `BPM`, `vocality`, even `energy` :
An LLM with no prior explanation of either endpoint, can figure out that it should look at your favorites playlists, and find which songs in your favorite list are most suitable for a tired person.
-
But it can go even further and identify its own sorting criteria for different situations with chain of thought:
Rather than blindly selecting the most relaxing songs it understands nuance like:
> Room State: "lights on" and "garage door open" can imply either returning home from work or engaging in some evening activity. The environment is probably not yet set for relaxation completely.
And genuinely comes up with an intelligently adapted strategy based on the situation
-
And say it gets your favorite wrong, and you correct it: an LLM with no specialized training can classify your follow up as a correction vs an unrelated command. It can even use chain-of-thought to posit why it may have been wrong.
You can then store all messages it classified as corrections and fetch those using semantic similarity.
That addresses both the customization and determinism issues: You don't need to rely on the zero-shot performance getting it right every time, the model can use the same chain of thought to translate past corrections into future guidance without further training.
For example, if your last correction was from classical music to hard metal when you got back from work, it's able to understand that you prefer higher energy songs, but still able to understand that doesn't mean every time in perpetuity it should play hard metal
I experimented heavily with things like this when GPT came out; part of me wants to go back to it since I've seen shockingly few projects do what I assumed everyone would do.
LLMs + well thought out memory access can do some incredible things as general assistants right now, but that seemed so obvious I moved on from the idea almost immediately.
In retrospect, there's an interesting irony at play: LLMs make simple products very attractive. But if you embed them in more thoroughly engineered solutions, you can do some incredible things that are far above what they otherwise seem capable of.
Yet a large number of the people most experienced in creating thoroughly engineered solutions view LLMs very cynically because of the simple (and shallow) solutions that are being churned out.
Eventually LLMs may just advanced far enough that they bridge the gap in implementation, but I think there's a lot of opportunity left on the table because of that catch-22
> Yet a large number of the people most experienced in creating thoroughly engineered solutions view LLMs very cynically because of the simple (and shallow) solutions that are being churned out.
Maybe, just maybe, because even simple solutions are invariably an incomplete brittle complicated unpredictable mess that you can't use to build anything complex with?
As eloquently demonstrated by your "simple" solutions
> Gorilla enables LLMs to use tools by invoking APIs. Given a natural language query, Gorilla comes up with the semantically- and syntactically- correct API to invoke. With Gorilla, we are the first to demonstrate how to use LLMs to invoke 1,600+ (and growing) API calls accurately while reducing hallucination. We also release APIBench, the largest collection of APIs, curated and easy to be trained on! Join us, as we try to expand the largest API store and teach LLMs how to write them!
It seems after 1-2 years that the true power of LLMs is in DevOps. I got pretty excited when I tried GPT-3 (completion model), but as time went by and OpenAI shifted to chat models, we lost control over the LLM part and found new meaning in taking whatever model OpenAI made available as a blackbox and "chained" it to other tools we already had, like data bases, APIs, function calls/tools, etc. I'd say DevOps is exactly where open source is seriously behind; there are decent open source models but it costs so much to self host them, despite the full power and control we have on them (via text generation webui and the like).
OpenAI is playing the DevOps game (starting maybe with introduction of ChatML). Open source community plays the LLM and benchmarks game. Ironically, the two are converging, meaning that OpenAI's models are getting dumber (not the API) thanks to censorship and RLHF, to the point that open source models are even better than some OpenAI models in some aspects. On the other hand, open source models are getting better tooling and DevOps thanks to oobabooga, llama.cpp, etc.
I'm seriously waiting for competitors to change nVidia's monopoly in this space. Maybe Apple?
I think currently M2 max is best bang for buck running interface in open source model. But use case is so niche that Apple probably doesn't actively start supporting open source models. In the long run I hope some smaller company gets shit together and starts competing with NVIDIA.
The GPU support in ML frameworks however is really not impressive. I have a Macbook with M1 Max 64G RAM, I can load a 7b model for fine-tuning (Huggingface Trainer, Pytorch, MPS), but the speed is just too slow, can only reach to 50% the speed of an i5-12500 CPU in my tests.
> I'm seriously waiting for competitors to change nVidia's monopoly in this space. Maybe Apple?
I would have thought AMD is the obvious contender. They are #2 in GPU's, they have formidable programming talent (based on their advances with Ryzen vs Intel) and they have targeted AI as their goal.
AMD have repeatedly dropped the ball when it comes to software support for compute and AI. Their hardware is quite capable, but very few people can actually make it work, which means most of the existing models have poor AMD support.
This is getting better with ROCm and such, but that's Linux-only and only works for a subset of tasks.
Both Intel and Apple have better "out of the box" support for ML and the ability to invest more into making these things work (e.g. Apple have implemented Stable Diffusion against Core ML themselves)
ChatGPT + Noteable is already powerful to get some work done via API calls (after installing and importing the libraries, writing Python code, managing secrets for authentication etc)
There is surely scope to streamline this much further
papers:
1. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs https://arxiv.org/abs/2307.16789
2. Gorilla: Large Language Model Connected with Massive APIs https://arxiv.org/abs/2305.15334
Gorilla is compared;
ToolLLM seems to postdate this project.
I suspect that if engineering of LLM solutions utilizes unseen testing data more, it's going to become apparent that it really does not have sufficiently reliable "cognitive" ability to do any practical agent type work.
```
You are home assistant. Here is information what's going on in the house:
It's 4PM. Bob likes Chopin Fantaisie-Impromptu. Alice likes Mozart Rondo in D. Bob is in the house. Alice will be back from office at 5PM.
You get a prompt: I'm tired, play my favorite
```
For the above input any LLM will play Chopin.
The point is that the ability to call APIs gives them the ability to learn so that the actions that are eventually taken are correct in context. It's like a more generic version of https://code-as-policies.github.io/.
I agree with you in general though, more useful AI is, more data it will need to see. I strongly believe companies like Microsoft, Google or Apple will bring the best experience because they own operating systems. It is going to be very hard for a third party to build a general AI assistant.
LLMs are bad at human-like learning, but their zero-shot performance + semantic search more than make up for it.
If you give an LLM access to your Spotify account via an API, it has access to your playlists and access to details about each song like `BPM`, `vocality`, even `energy` :
https://developer.spotify.com/documentation/web-api/referenc...https://developer.spotify.com/documentation/web-api/referenc...
An LLM with no prior explanation of either endpoint, can figure out that it should look at your favorites playlists, and find which songs in your favorite list are most suitable for a tired person.
-
But it can go even further and identify its own sorting criteria for different situations with chain of thought:
Bedroom at night: https://chat.openai.com/share/6b1787ef-cd84-4834-b582-5024f8... Kitchen at 5pm: https://chat.openai.com/share/7ddaa047-0855-48c1-bcea-308083...
Rather than blindly selecting the most relaxing songs it understands nuance like:
> Room State: "lights on" and "garage door open" can imply either returning home from work or engaging in some evening activity. The environment is probably not yet set for relaxation completely.
And genuinely comes up with an intelligently adapted strategy based on the situation -
And say it gets your favorite wrong, and you correct it: an LLM with no specialized training can classify your follow up as a correction vs an unrelated command. It can even use chain-of-thought to posit why it may have been wrong.
You can then store all messages it classified as corrections and fetch those using semantic similarity.
That addresses both the customization and determinism issues: You don't need to rely on the zero-shot performance getting it right every time, the model can use the same chain of thought to translate past corrections into future guidance without further training.
For example, if your last correction was from classical music to hard metal when you got back from work, it's able to understand that you prefer higher energy songs, but still able to understand that doesn't mean every time in perpetuity it should play hard metal
Kitchen w/ memory: https://chat.openai.com/share/43635427-55d5-4394-b282-46acae... Bedroom w/ memory: https://chat.openai.com/share/8c146dd5-2233-4aba-8f6a-b97b7a...
-
I experimented heavily with things like this when GPT came out; part of me wants to go back to it since I've seen shockingly few projects do what I assumed everyone would do.
LLMs + well thought out memory access can do some incredible things as general assistants right now, but that seemed so obvious I moved on from the idea almost immediately.
In retrospect, there's an interesting irony at play: LLMs make simple products very attractive. But if you embed them in more thoroughly engineered solutions, you can do some incredible things that are far above what they otherwise seem capable of.
Yet a large number of the people most experienced in creating thoroughly engineered solutions view LLMs very cynically because of the simple (and shallow) solutions that are being churned out.
Eventually LLMs may just advanced far enough that they bridge the gap in implementation, but I think there's a lot of opportunity left on the table because of that catch-22
Maybe, just maybe, because even simple solutions are invariably an incomplete brittle complicated unpredictable mess that you can't use to build anything complex with?
As eloquently demonstrated by your "simple" solutions
> Gorilla enables LLMs to use tools by invoking APIs. Given a natural language query, Gorilla comes up with the semantically- and syntactically- correct API to invoke. With Gorilla, we are the first to demonstrate how to use LLMs to invoke 1,600+ (and growing) API calls accurately while reducing hallucination. We also release APIBench, the largest collection of APIs, curated and easy to be trained on! Join us, as we try to expand the largest API store and teach LLMs how to write them!
eval/: https://github.com/ShishirPatil/gorilla/tree/main/eval
- "Gorilla: Large Language Model connected with massive APIs" (2023-05) https://news.ycombinator.com/item?id=36073241
- "Gorilla: Large Language Model Connected with APIs" (2023-06) https://news.ycombinator.com/item?id=36333290
- "Gorilla-CLI: LLMs for CLI including K8s/AWS/GCP/Azure/sed and 1500 APIs (github.com/gorilla-llm)" (2023-06) https://news.ycombinator.com/item?id=36524078
OpenAI is playing the DevOps game (starting maybe with introduction of ChatML). Open source community plays the LLM and benchmarks game. Ironically, the two are converging, meaning that OpenAI's models are getting dumber (not the API) thanks to censorship and RLHF, to the point that open source models are even better than some OpenAI models in some aspects. On the other hand, open source models are getting better tooling and DevOps thanks to oobabooga, llama.cpp, etc.
I'm seriously waiting for competitors to change nVidia's monopoly in this space. Maybe Apple?
One could get two used 3090s and setup a decent PC at lower prices.
I would have thought AMD is the obvious contender. They are #2 in GPU's, they have formidable programming talent (based on their advances with Ryzen vs Intel) and they have targeted AI as their goal.
Am I missing something?
This is getting better with ROCm and such, but that's Linux-only and only works for a subset of tasks.
Both Intel and Apple have better "out of the box" support for ML and the ability to invest more into making these things work (e.g. Apple have implemented Stable Diffusion against Core ML themselves)
That is quite the caveat.
Deleted Comment
There is surely scope to streamline this much further
I am very intently watching this space
It would be also cool to have such plugin for Google Colab.
I hope someone will come with a new way to interact with LLM models other than chat UI. It would make code writing even more faster.