Ask HN: How do you add guard rails in LLM response without breaking streaming?

If it's the problem I think it is, the solution is to run two concurrent prompts.

First prompt validates the input. Second prompt starts the actual content generation.

Combine both streams with SSE on the front end and don't render the content stream result until the validation stream returns "OK". In the SSE, encode the chunks of each stream with a stream ID. You can also handle it on the server side by cancelling execution once the first stream ends.

Generally, the experience is good because the validation prompt is shorter and faster to last (and only) token.

The SSE stream ends up like this:

    data: ing|tomatoes
    
    data: ing|basil
    
    data: ste|3. Chop the

I have a writeup (and repo) of the general technique of multi-streaming: https://chrlschn.dev/blog/2024/05/need-for-speed-llms-beyond... (animated gif at the bottom).

lolinder · 10 months ago

This doesn't solve the critical problem, which is that you usually can't tell if something is okay until you have context that you don't yet have. This is why even SOTA models will backtrack when you hit the filter—they only realize you're treading into banned territory after a bunch of text has already been generated, including text that already breaks the rules.

This is hard to fix because if you don't wait until you have enough context, you've given your censor a hair trigger.

> Combine both streams with SSE on the front end and don't render the content stream result until the validation stream returns "OK".

Just a note that this particular implementation has the additional problem of not actually applying your validation stream at the API level, which means your service can and will be abused worse than it would be if you combined the streams server-side. You should never rely on client-side validation for security or legal compliance.

CharlieDigital · 10 months ago

That's why I qualified it "general technique" and explicitly mentioned the option of server abort.

For most consumer use cases, it probably doesn't matter if a few tokens leak before the about, especially if they're not rendered.

Tune it to your needs :)

vunderba · 10 months ago

The OP is talking about constraining the response not the input. Granted, in many cases, the input may give some kind of indicator that the large language model may be more prone to generating output that could violate the given constraints but this is not guaranteed by any measure.

As far as I know, there's no way of validating a streamed response until those tokens have already been streamed unfortunately. You could try buffering the stream in larger chunks before displaying them on screen in the hopes that you might be able to catch it earlier, but that's not going to be a great user experience either.

bhawks · 10 months ago

One of the things I love about gen ai is that all it's problems are solved with using more gen ai.

olalonde · 10 months ago

From what I can tell, ChatGPT appears to be doing "optimistic" streaming... It will start streaming the response to the user but may eventually hide the response if it trips some censorship filter. The user can theoretically capture the response from the network since the censorship is essentially done client-side but I guess they consider that good enough.

joshhart · 10 months ago

Hi, I run the model serving team at Databricks. Usually you run regex filters, LLAMA Guard, etc on chunks at a time so you are still streaming but it's in batches of tokens rather than single tokens at a time. Hope that helps!

You could of course use us and get that out of the box if you have access to Databricks.

lordswork · 10 months ago

But ultimately, it's an unsolved problem in the field. Every single LLM has been jailbroken.

accrual · 10 months ago

Has o1 been jailbroken? My understanding is o1 is unique in that one model creates the initial output (chain of thought) then another model prepares the first response for viewing. Seems like that would be a fairly good way to prevent jailbreaks, but I haven't investigated myself.

brrrrrm · 10 months ago

fake it.

add some latency to the first token and then "stream" at the rate you received tokens even though the entire thing (or some sizable chunk) has been generated. that'll give you the buffer you need to seem fast while also staying safe.

tweezy · 10 months ago

I've tried a few things that seem to work. The first works pretty much perfectly, but adds quite a bit of latency to the final response. The second isn't perfect, but it's like 95% there

1 - the first option is to break this in to three prompts. The first prompt is either write a brief version, an outline of the full response, or even the full response. The second prompt is a validator, so you pass the output of the first to a prompt that says "does this follow the instructions. Return True | False." If True, send it to a third that says "Now rewrite this to answer the user's question." If False, send it back to the first with instructions to improve the response. This whole process can mean it takes 30 seconds or longer before the streaming of the final answer starts.

There are plenty of variations on the above process, so obviously feel free to experiment.

2 - The second option is to have instructions in your main prompt that says "Start each response with an internal dialogue wrapped in <thinking> </thinking> tags. Inside those tags first describe all of the rules you need to follow, then plan out exactly how you will respond to the user while following those rules."

Then on your frontend have the UI watch for those tags and hide everything between them from the user. This method isn't perfect, but it works extremely well in my experience. And if you're using a model like gpt-4o or claude 3.5 sonnet, it makes it really hard to make a mistake. This is the approach we're currently going with.

throwaway888abc · 10 months ago

Not sure about the exact nature of your project, but for something similar I’ve worked on, I had success using a combination of custom stop words and streaming data with a bit of custom logic layered on top. By fine-tuning the stop words specific to the domain and applying filters in real-time as the data streams in, I was able to improve the response to users taste. Depending on your use case, adding logic to dynamically adjust stop words or contextually weight them might also help you.

anshumankmr · 10 months ago

Google has some safety feature in Vertex AI to block certain keywords, but that does break the streaming. If it finds something offending, it replaces with a static response. That is one that I have felt "works", but it is a bit wonky from UX side.