What we've learned from a year of building with LLMs

Anyone have a convenience solution for doing multi-step workflows? For example, I'm filling out the basics of an NPC character sheet on my game prep. I'm using a certain rule system, give the enemy certain tactics, certain stats, certain types of weapons, right now I have a 'god prompt' trying to walk the LLM through creating the basic character sheet, but the responses get squeezed down into what one or two prompt responses can be.

If I can do node-red or a function chain for prompts and outputs, that would be sweet.

hugocbp · 2 years ago

For me, a very simple "breakdown tasks into a queue and store in a DB" solution has help tremendously with most requests.

Instead of trying to do everything into a single chat or chain, add steps to ask the LLM to break down the next tasks, with context, and store that into SQLite or something. Then start new chats/chains on each of those tasks.

Then just loop them back into LLM.

I find that long chats or chains just confuse most models and we start seeing gibberish.

Right now I'm favoring something like:

"We're going to do task {task}. The current situation and context is {context}.

Break down what individual steps we need to perform to achieve {goal} and output these steps with their necessary context as {standard_task_json}. If the output is already enough to satisfy {goal}, just output the result as text."

I find that leaving everything to LLM in a sequence is not as effective as using LLM to break things down and having a DB and code logic to support the development of more complex outcomes.

datameta · 2 years ago

Indeed! If I'm met with several misunderstandings in a row, asking it to explain what I'm trying to do is a pretty surefire way to move forward.

Also mentioning what to "forget" or not focus on anymore seems to remove some noise from the responses if they are large.

gpsx · 2 years ago

One option for doing this is to incrementally build up the "document" using isolated prompts for each section. I say document because I am not exactly sure what the character sheet looks like, but I am assuming it can be constructed one section at a time. You create a prompt to create the first section. Then, you create a second prompt that gives the agent your existing document and prompts it to create the next section. You continue until all the sections are finished. In some cases this works better than doing a single conversation.

e1g · 2 years ago

Perplexity recently released something like this https://www.perplexity.ai/hub/blog/perplexity-pages

punkspider · 2 years ago

Perhaps this would be of use? https://github.com/langgenius/dify/ I use it for quick workflows and it's pretty intuitive.

proc0 · 2 years ago

Sounds like you need an agent system, some libs are mentioned here: https://lilianweng.github.io/posts/2023-06-23-agent/

CuriouslyC · 2 years ago

You can do multi shot workflows pretty easy, I like to have the model produce markdown, then add code blocks (```json/yaml```) to extract the interim results. You can lay out multiple "phases" in your prompt and have it perform each one in turn, and have each one reference prior phases. Then at the end you just pull out the code blocks for each phase and you have your structured result.

Deleted Comment

127 · 2 years ago

Did you force it into a parser? You can define a simple language in llama.cpp for the LLM to obey.

mentos · 2 years ago

I still haven’t played with using one LLM to oversee another.

“You are in charge of game prep and must work with an LLM over many prompts to…”

I'm not saying the content of the article is wrong, but what apps are people/companies writing articles like this actually building? I'm seriously unable to imagine any useful app. I only use GPT via API (as better Google for documentations, and its output is never usable without heavy editing). This week I tried to use "AI" in Notion: I needed to generate 84 check boxes for each day starting with specific date. I got 10 check boxes and line "here should go rest..." (or some variation of such lazy output). Completely useless.

exhaze · 2 years ago

I've built many production applications using a lot of these techniques and others - it's made money either by increasing sales or decreasing operational costs.

Here's a more dramatic example: https://www.grey-wing.com/

This company provides deeply integrated LLM-powered software for operating freight ships.

There are a lot of people who are doing this and achieving very good results.

Sorry, if it's not working for you, it doesn't mean that it doesn't work.

robbiep · 2 years ago

That’s really interesting. Surely the crewing roster stuff is actually using linear algebra rather than AI though?

qeternity · 2 years ago

I think you're going about it backwards. You don't take a tool, and then try to figure out what to do with it. You take a problem, and then figure out which tool you can use to solve it.

jakubmazanec · 2 years ago

But it seems to me that's what they're doing: "We have LLMs, what to do with them?" But anyway, I'm seriously just looking for an example of app that is build with stuff described in the article.

Me personally, I only used LLM for one "serious" application: I used GPT-3.5Turbo for transforming unstructured text into JSON; it was basically just ad-hoc Node.js script that called API (prompt was few examples of input-output pairs), and then it did some checks (these checks usually failed only because GPT also corrected misspellings). It would take me weeks to do it manually, but with the help of GPT it was few hours (writing of the script + I made a lot of misspellings so the script stopped a lot). But I cannot imagine anything more complex.

Deleted Comment

mloncode · 2 years ago

This is Hamel, one of the authors of the article. We published the article with OReilly here:

Part 1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of... Part 2: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...

We were working on this webpage to collect the entire three part article in one place (the third part isn't published yet). We didn't expect anyone to notice the site! Either way, part 3 should be out in a week or so.

xnx · 2 years ago

The link to part II from part I points back to part I.

seventytwo · 2 years ago

Was wondering about the June 8th date on there :)

JKCalhoun · 2 years ago

> Note that in recent times, some doubt has been cast on if this technique is as powerful as believed. Additionally, there’s significant debate as to exactly what is going on during inference when Chain-of-Thought is being used...

I love this new era of computing we're in where rumors, second-guessing and something akin to voodoo have entered into working with LLMs.

ezst · 2 years ago

That's the thing, it's a novel form of computing that's increasingly moving away from computer science. It deserves to be treated as a discipline of its own, with lots of words of caution and danger stickers slapped over it.

skydhash · 2 years ago

It’s text (word) manipulation based on probalistic rules derived from analyzing human-produced text. And everyone knows language is imperfect. That’s why we have introduced logic and formalism so that we can reliably transmit knowledge.

That’s why LLMs are good at translating and spellchecking. We’ve been describing the same world and almost all texts respect grammar. That’s the first things that surface. But you can extract the same rules in other way and create a program that does it without the waste of computing power.

If we describe computing as solving problems, then it’s not computing because if your solution was not part of the training data, you won’t solve anything. If we describe computing as symbol manipulation, then it’s not doing a good job because the rules changes with every model and they are probabilistic. No way to get a reliable answer. It’s divination without the divine (no hint from an omniscient entity).

amelius · 2 years ago

Yeah like psychology being a different field from physics even if it is running on atoms ultimately.

Imagine if physics literature was filled with stuff about psychology and how that would drive physicists nuts. That's how I feel right now ;)

Multicomp · 2 years ago

threeseed · 2 years ago

RAGs do not prevent hallucinations nor does it guarantee that the quality of your output is contingent solely on the quality of your input. Using LLMs for legal use cases for example has shown it to be poor for anything other than initial research as it is accurate at best 65%:

https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...

So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.

phillipcarter · 2 years ago

> So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.

I can't speak for "what was promised" by anyone, but LLMs have been good enough to live in production as a core feature in my product since early last year, and have only gotten better.

mattyyeung · 2 years ago

You may be interested "Deterministic Quoting"[1]. This doesn't completely "solve" hallucinations, but I would argue that we do get "good enough" in several applications

Disclosure: author on [1]

[1] https://mattyyeung.github.io/deterministic-quoting

Have seen this approach before.

It's the yes we hallucinate but don't worry because we provide the sources for users to check.

Even though everyone knows that users will never check unless the hallucination is egregious.

It's such a disingenuous way of handling this.

DylanSp · 2 years ago

Looks like the same content that was posted on oreilly.com a couple days ago, just on a separate site. That has some existing discussion: https://news.ycombinator.com/item?id=40508390.

mercurialsolo · 2 years ago

As we go about moving LLM enabled products into production we definitely see a bunch of what is being spoken about resonate. We also see the below as areas which need to be expanded upon for developers building in the space to take products to production :

I would love to see this article also expand to touch upon things like : - data management - (tooling, frameworks, open vs closed data management, labelling & annotations) - inference as a pipeline - frameworks for breaking down model inference into smaller tasks & combining outputs (do DAG's have a role to play here?) - prompts - areas like caching, management, versioning, evaluations - model observability - tokens, costs, latency, drift? - evals for multimodality - how do we tackle evals here which in turn can go into loops e.g. quality of audio, speech or visual outputs

sheepscreek · 2 years ago

I’m sure this has some decent insights but it’s from almost 1 year ago! A lot has changed in this space since then.

bgrainger · 2 years ago

Are you sure? The article says "cite this as Yan et al. (May 2024)" and published-time in the metadata is 2024-05-12.

Weird: I just refreshed the page and it now redirects to a different domain (than the originally-submitted URL) and has a date of June 8, 2023. It still cites articles and blog posts from 2024, though.

jph00 · 2 years ago

Looks like they made a mistake in the article metadata - they definitely just released this article.