We were working on this webpage to collect the entire three part article in one place (the third part isn't published yet). We didn't expect anyone to notice the site! Either way, part 3 should be out in a week or so.
> Note that in recent times, some doubt has been cast on if this technique is as powerful as believed. Additionally, there’s significant debate as to exactly what is going on during inference when Chain-of-Thought is being used...
I love this new era of computing we're in where rumors, second-guessing and something akin to voodoo have entered into working with LLMs.
That's the thing, it's a novel form of computing that's increasingly moving away from computer science. It deserves to be treated as a discipline of its own, with lots of words of caution and danger stickers slapped over it.
It’s text (word) manipulation based on probalistic rules derived from analyzing human-produced text. And everyone knows language is imperfect. That’s why we have introduced logic and formalism so that we can reliably transmit knowledge.
That’s why LLMs are good at translating and spellchecking. We’ve been describing the same world and almost all texts respect grammar. That’s the first things that surface. But you can extract the same rules in other way and create a program that does it without the waste of computing power.
If we describe computing as solving problems, then it’s not computing because if your solution was not part of the training data, you won’t solve anything. If we describe computing as symbol manipulation, then it’s not doing a good job because the rules changes with every model and they are probabilistic. No way to get a reliable answer. It’s divination without the divine (no hint from an omniscient entity).
Anyone have a convenience solution for doing multi-step workflows? For example, I'm filling out the basics of an NPC character sheet on my game prep. I'm using a certain rule system, give the enemy certain tactics, certain stats, certain types of weapons, right now I have a 'god prompt' trying to walk the LLM through creating the basic character sheet, but the responses get squeezed down into what one or two prompt responses can be.
If I can do node-red or a function chain for prompts and outputs, that would be sweet.
For me, a very simple "breakdown tasks into a queue and store in a DB" solution has help tremendously with most requests.
Instead of trying to do everything into a single chat or chain, add steps to ask the LLM to break down the next tasks, with context, and store that into SQLite or something. Then start new chats/chains on each of those tasks.
Then just loop them back into LLM.
I find that long chats or chains just confuse most models and we start seeing gibberish.
Right now I'm favoring something like:
"We're going to do task {task}. The current situation and context is {context}.
Break down what individual steps we need to perform to achieve {goal} and output these steps with their necessary context as {standard_task_json}. If the output is already enough to satisfy {goal}, just output the result as text."
I find that leaving everything to LLM in a sequence is not as effective as using LLM to break things down and having a DB and code logic to support the development of more complex outcomes.
One option for doing this is to incrementally build up the "document" using isolated prompts for each section. I say document because I am not exactly sure what the character sheet looks like, but I am assuming it can be constructed one section at a time. You create a prompt to create the first section. Then, you create a second prompt that gives the agent your existing document and prompts it to create the next section. You continue until all the sections are finished. In some cases this works better than doing a single conversation.
You can do multi shot workflows pretty easy, I like to have the model produce markdown, then add code blocks (```json/yaml```) to extract the interim results. You can lay out multiple "phases" in your prompt and have it perform each one in turn, and have each one reference prior phases. Then at the end you just pull out the code blocks for each phase and you have your structured result.
RAGs do not prevent hallucinations nor does it guarantee that the quality of your output is contingent solely on the quality of your input. Using LLMs for legal use cases for example has shown it to be poor for anything other than initial research as it is accurate at best 65%:
> So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.
I can't speak for "what was promised" by anyone, but LLMs have been good enough to live in production as a core feature in my product since early last year, and have only gotten better.
You may be interested "Deterministic Quoting"[1]. This doesn't completely "solve" hallucinations, but I would argue that we do get "good enough" in several applications
Looks like the same content that was posted on oreilly.com a couple days ago, just on a separate site. That has some existing discussion: https://news.ycombinator.com/item?id=40508390.
As we go about moving LLM enabled products into production we definitely see a bunch of what is being spoken about resonate. We also see the below as areas which need to be expanded upon for developers building in the space to take products to production :
I would love to see this article also expand to touch upon things like :
- data management - (tooling, frameworks, open vs closed data management, labelling & annotations)
- inference as a pipeline - frameworks for breaking down model inference into smaller tasks & combining outputs (do DAG's have a role to play here?)
- prompts - areas like caching, management, versioning, evaluations
- model observability - tokens, costs, latency, drift?
- evals for multimodality - how do we tackle evals here which in turn can go into loops e.g. quality of audio, speech or visual outputs
I'm not saying the content of the article is wrong, but what apps are people/companies writing articles like this actually building? I'm seriously unable to imagine any useful app. I only use GPT via API (as better Google for documentations, and its output is never usable without heavy editing). This week I tried to use "AI" in Notion: I needed to generate 84 check boxes for each day starting with specific date. I got 10 check boxes and line "here should go rest..." (or some variation of such lazy output). Completely useless.
I've built many production applications using a lot of these techniques and others - it's made money either by increasing sales or decreasing operational costs.
I think you're going about it backwards. You don't take a tool, and then try to figure out what to do with it. You take a problem, and then figure out which tool you can use to solve it.
But it seems to me that's what they're doing: "We have LLMs, what to do with them?" But anyway, I'm seriously just looking for an example of app that is build with stuff described in the article.
Me personally, I only used LLM for one "serious" application: I used GPT-3.5Turbo for transforming unstructured text into JSON; it was basically just ad-hoc Node.js script that called API (prompt was few examples of input-output pairs), and then it did some checks (these checks usually failed only because GPT also corrected misspellings). It would take me weeks to do it manually, but with the help of GPT it was few hours (writing of the script + I made a lot of misspellings so the script stopped a lot). But I cannot imagine anything more complex.
Are you sure? The article says "cite this as Yan et al. (May 2024)" and published-time in the metadata is 2024-05-12.
Weird: I just refreshed the page and it now redirects to a different domain (than the originally-submitted URL) and has a date of June 8, 2023. It still cites articles and blog posts from 2024, though.
Part 1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of... Part 2: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...
We were working on this webpage to collect the entire three part article in one place (the third part isn't published yet). We didn't expect anyone to notice the site! Either way, part 3 should be out in a week or so.
I love this new era of computing we're in where rumors, second-guessing and something akin to voodoo have entered into working with LLMs.
That’s why LLMs are good at translating and spellchecking. We’ve been describing the same world and almost all texts respect grammar. That’s the first things that surface. But you can extract the same rules in other way and create a program that does it without the waste of computing power.
If we describe computing as solving problems, then it’s not computing because if your solution was not part of the training data, you won’t solve anything. If we describe computing as symbol manipulation, then it’s not doing a good job because the rules changes with every model and they are probabilistic. No way to get a reliable answer. It’s divination without the divine (no hint from an omniscient entity).
Imagine if physics literature was filled with stuff about psychology and how that would drive physicists nuts. That's how I feel right now ;)
If I can do node-red or a function chain for prompts and outputs, that would be sweet.
Instead of trying to do everything into a single chat or chain, add steps to ask the LLM to break down the next tasks, with context, and store that into SQLite or something. Then start new chats/chains on each of those tasks.
Then just loop them back into LLM.
I find that long chats or chains just confuse most models and we start seeing gibberish.
Right now I'm favoring something like:
"We're going to do task {task}. The current situation and context is {context}.
Break down what individual steps we need to perform to achieve {goal} and output these steps with their necessary context as {standard_task_json}. If the output is already enough to satisfy {goal}, just output the result as text."
I find that leaving everything to LLM in a sequence is not as effective as using LLM to break things down and having a DB and code logic to support the development of more complex outcomes.
Also mentioning what to "forget" or not focus on anymore seems to remove some noise from the responses if they are large.
Deleted Comment
“You are in charge of game prep and must work with an LLM over many prompts to…”
https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...
So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.
I can't speak for "what was promised" by anyone, but LLMs have been good enough to live in production as a core feature in my product since early last year, and have only gotten better.
Disclosure: author on [1]
[1] https://mattyyeung.github.io/deterministic-quoting
It's the yes we hallucinate but don't worry because we provide the sources for users to check.
Even though everyone knows that users will never check unless the hallucination is egregious.
It's such a disingenuous way of handling this.
I would love to see this article also expand to touch upon things like : - data management - (tooling, frameworks, open vs closed data management, labelling & annotations) - inference as a pipeline - frameworks for breaking down model inference into smaller tasks & combining outputs (do DAG's have a role to play here?) - prompts - areas like caching, management, versioning, evaluations - model observability - tokens, costs, latency, drift? - evals for multimodality - how do we tackle evals here which in turn can go into loops e.g. quality of audio, speech or visual outputs
Here's a more dramatic example: https://www.grey-wing.com/
This company provides deeply integrated LLM-powered software for operating freight ships.
There are a lot of people who are doing this and achieving very good results.
Sorry, if it's not working for you, it doesn't mean that it doesn't work.
Me personally, I only used LLM for one "serious" application: I used GPT-3.5Turbo for transforming unstructured text into JSON; it was basically just ad-hoc Node.js script that called API (prompt was few examples of input-output pairs), and then it did some checks (these checks usually failed only because GPT also corrected misspellings). It would take me weeks to do it manually, but with the help of GPT it was few hours (writing of the script + I made a lot of misspellings so the script stopped a lot). But I cannot imagine anything more complex.
Deleted Comment
Weird: I just refreshed the page and it now redirects to a different domain (than the originally-submitted URL) and has a date of June 8, 2023. It still cites articles and blog posts from 2024, though.