Show HN: LLMs can generate valid JSON 100% of the time

Mechanistically, I think this library takes the simple idea of masking part of the vocabulary space and steps in time efficiently. Great!

I am curious, however, for the ones who have played around with such libraries wrapping base LLMs with output structure: do base models like Llama2 work very well? My experience says "hell no!" and you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.

And even then, it seems very counter-intuitive to me that given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution, and potentially detrimental to instruction-tuning?

make3 · 2 years ago

I'm not sure of why you would want to use raw llama-2 though when there is a million super strong instruction fine-tuned versions of llama-2 on HF hub that would do the job a million times better? Like Stability-AI's Beluga-2. See https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.

nabakin · 2 years ago

Don't rely too much on automated benchmarks for LLMs. They are often gamed, made to overfit, and result in worse performance in the general case.

Human evaluation is the gold standard and the Llama 2 paper gave significant evidence that Llama 2 70b chat is on-par, if not, better than ChatGPT for that metric so I tend to stick to it unless there is good reason not to.

activatedgeek · 2 years ago

> I'm not sure of why you would want to use raw llama-2

Sure. My concern was not specific to llama-2, and was only using it as a placeholder example of a decent pre-trained base model. Replace it with your favorite base model, which you want to use for guided generation. My question is more fundamental - how does post-hoc guided generation interfere with the potential benefits of instruction-tuning?

> About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.

Mechanistically, yes. I am not arguing that. The whole point is to generate JSON that is "useful".

simonw · 2 years ago

I'm quite impressed with Llama 2 13B - the more time I spend with it the more I think it might be genuinely useful for more than just playing around with local LLMs.

I'm using the MLC version (since that works with a GPU on my M2 Mac) via my https://github.com/simonw/llm-mlc plugin.

gsuuon · 2 years ago

Even the 7B model is shockingly good! I've been hacking on a project also built on MLC (but the web runtime) and the completions I'm seeing from Llama 2 7B, just running on my laptop's browser, have been really impressive. There's a demo page here: https://ad-llama.vercel.app/

moneywoes · 2 years ago

What are your use cases

LakshyAAAgrawal · 2 years ago

In our experience, at least for code generation, the experience has been that base models can be improved significantly by guiding token level generation.

In our paper titled "Guiding Language Models of Code with Global Context using Monitors" (https://arxiv.org/abs/2306.10763), we propose Monitor Guided Decoding, which interfaces LLMs to static analysis, and guides the model to generate type-consistent code. Without any kind of fine-tuning, we show that using static analysis to guide token level generation at specific points leads to significantly improved quality of generated code, both in terms of compilability and match with ground truth. Even very small models (1.1B) are able to generate more compilable code than much larger models (175B) while also improving on match with ground truth.

activatedgeek · 2 years ago

Thanks for the reference, Lakshya. Looks very cool!

(Just thinking out loud next)

If you allow me to be a little imprecise, guided-generation is prompting "just-in-time" unlike the other kind of prompting where you provide all reference tokens "ahead-of-time". Now there's work [1] out there that shows that smaller models rely much more on prompting than larger models do, i.e. smaller models are more faithful to the tokens in the prompt than the larger models which just do whatever they were going to do anyways.

Your results seem very much in line with this kind of a qualitative result --- you show that CodeGen-350M outperforms CodeGen-6B, and CodeGen-6B outperforms text-davinci-003 using MGD. Smaller models perhaps respond more strongly to certain kinds of prompting strategies than larger models do.

[1]: https://arxiv.org/pdf/2307.13702.pdf

Roark66 · 2 years ago

It is an interesting paper. Any idea when the code/data will be released? It appears it has been almost 2 months since the paper was submitted, but the link given leads to a random bing page :-(

ethbr1 · 2 years ago

> ...given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution...

Isn't that what we did with test driven development?

The primary difference was our generator functions were human instead of LLM. Why not cut out the middle-human?

spockz · 2 years ago

Yes. And if that human was smart and knowledgable they would use property based testing to automatically generate test inputs. Most libraries make it trivial to do for custom data types and can even reduce the failing test case to a minimal size input. I have been using this since 2008 and it was around before that.

activatedgeek · 2 years ago

I think what I am saying is tangential to TDD. I am not really even concerned about the ability of LLM to function as desired, and its verification.

I was rather concerned about a broader fundamental question - how does post-hoc guided generation interfere with the potential benefits of instruction-tuning?

Havoc · 2 years ago

>you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.

The instruction tuning part is "trivial"...it's the dealing with edge cases part that gets me.

With classic code edge cases are well insignificant edge cases. With LLM you never know what will make it go off on a tangent & the parsing code needs to deal with that chaos.

Or put differently the % of cases that are edge cases seems to have gone up dramatically

I can make GPT4 return valid JSON simply by providing examples in the system message. This works nine times out of ten.

But it's still probabilistic, and nine times out of ten isn't good enough.

Occasionally it will hallucinate responses like this:

{"key1": "value1", "key2": "value2" for i in range(n)}

Re-prompting with the parsing error message is usually enough to get it on the second try.

But escaping double-quotes and newline characters is less reliable. Even after giving it multiple examples, it correctly escapes only about half the time.

Re-prompting for escaping errors still yields a ~50% success rate.

simonw · 2 years ago

That re-prompting on error trick is what this new Microsoft library does, too: https://github.com/microsoft/TypeChat

Here's their prompt for that: https://github.com/microsoft/TypeChat/blob/c45460f4030938da3...

I think the approach using grammars (seen here, but also in things like https://github.com/ggerganov/llama.cpp/pull/1773 ) is a much more elegant solution.

creatonez · 2 years ago

A "repair prompt" instead of rewinding and starting back from the error seems like the wrong choice, and might only make sense with how payment for OpenAI API usage currently works.

padolsey · 2 years ago

I've had more luck with getting it to output XML as (1) You can imbue XML with actual language/meaning (which LLMs adore) and (2) parsers can be made to be more forgiving. I get why people want to make JSON, but to me it's a bit like trying to get a cat to swim - you might eventually succeed, but it's not their natural inclination.

prempv · 2 years ago

I've had the same experience as well. I suspect if it's due to large presence of HTML in the training data as part of codebases and online content

gowld · 2 years ago

How do you imbue XML with meaning?

caesil · 2 years ago

With ChatGPT function calling I get valid JSON 100% of the time from GPT-4 unless I have made some error in prompting.

The chief error is not providing escape hatches. LLMs look for a right answer. If you are feeding it some texts and asking it to return structured data about the texts, but then one of the texts is blank, it will be difficult to determine a right answer, so you get hallucinations. The solution is an escape hatch where one of the arguments is a `textIsMissing` boolean or something.

As long as you've accounted for these failure modes, it works flawlessly.

reissbaker · 2 years ago

GPT-4 is amazing, but the upside of smaller models is much lower cost. I get basically 100% accuracy on JSON modeling with GPT-4 with function calling too, but I will say that gpt-3.5-turbo with function calling is somewhat less accurate — it usually generates valid JSON in terms of JSON.parse not exploding, but not necessarily JSON following the schema I passed in (although it's surprisingly good, maybe ~90% accurate?). I use 3.5-turbo a decent amount in API calls because it's just a lot cheaper, and performs well enough even if it's not gpt-4 level.

I haven't gotten a chance to earnestly use the smaller Llama models yet in more than small prototypes (although I'm building a 4090-based system to learn more about finetuning them), but the little amount of experimenting I've done with them makes me think they need a decent amount of help with generating consistently-valid JSON matching some schema out of the box. This is a pretty neat tool to use for them, since it doesn't require finetuning runs, it just masks logits.

selcuka · 2 years ago

The premise of function calling is great, but in my experience (at least on GPT-3.5, haven't tried it with GPT-4 yet) it seems to generate wildly different, and less useful results, for the same prompt.

andreygrehov · 2 years ago

Meh... I asked GPT4 to return a sample PHP code inside of a random JSON. It failed the JSON linter from the very first try. I actually couldn't pass the validation despite many retries, eg follow up corrections. Not a single time it generated a 100% valid JSON, I eventually gave up.

adamrezich · 2 years ago

if you think that's bad, try to get it to generate Inform 7 games—Inform's natural-English-ish syntax completely throws all LLMs for a loop, consistently. it generates code that looks possibly correct (to an Inform newbie at least), but fails to compile far more often than not. I find this super interesting.

ipaddr · 2 years ago

This worked with chatGPT: create a sample hello world in php

store that code in a json[object

code: { "php_code": "<?php echo 'Hello, World!'; ?>" }

karmasimida · 2 years ago

I see grammar constrained generation for 2 major advantages:

1. It consumes fewer tokens, no need to add too many examples into the prompt.

2. It suffers less from the forgetting issue.

Another minor advantage is you can control precisely where your desired output to begin with.

But overall, those are nice perks not too substantial IMO.

nextaccountic · 2 years ago

What about reprompting with a different temperature value?

If this works, how to select the optimal value? Maybe you can train a model that can excel at the task of querying gpt4 for valid jsons

MuffinFlavored · 2 years ago

I wonder if the next iteration of OpenAI features is something like:

right now you can inject prompts that the LLM takes into consideration before the output

I wonder if you can make it have a "post" generation function that says like "keep re-trying in a loop (aka hallucinating with randomness) until the output message passes XYZ format/checks/scoring"

padjo · 2 years ago

It’s starting to feel like LLMs are to “classical” software engineering what quantum physics was to classical physics

kristjansson · 2 years ago

Why wait for OpenAI?

msp26 · 2 years ago

>I can make GPT4 return valid JSON simply by providing examples in the system message. This works nine times out of ten

But you can do both. For my current use case of extracting information from articles, I have a json schema + one/two example articles along with their correct answers. This increases token costs but 3.5 is so cheap that it doesn't matter and for 4 you can use batching to decrease token cost per article.

vsrinivasan · 2 years ago

Can you please explain what is batching ? any pointers?

phillipcarter · 2 years ago

This is what we do, but for GPT-3.5. And it doesn't need to be system messages either. We even have it emitting only JSON in a specific structure (except for when it fails to produce an output altogether). This is without the function calling model.

keiferwiseman · 2 years ago

It took some iterations but I've managed to get the OpenAI API to give me valid JSON 100% of the time now(based on my testing). I think I put in the prompt to never use newlines because it was causing issues lol.

thumbsup-_- · 2 years ago

Yeah same thing. I have done the same with GPT-3.5. Simply ask it to output using provided schema only and give a few examples. Always outputs in provided json format

orasis · 2 years ago

What about using ChatGPT’s new function calling mechanism?

superasn · 2 years ago

That returns broken JSON a lot of the times too

panarky · 2 years ago

hansvm · 2 years ago

A major part of the power of an LLM is the calibrated probability distribution in its responses, and this technique probably throws that ability away. Why is it good enough?

As a brief example, suppose the only possible LLM outputs were "hello world", "food", "hello", and "good day" (and that they're all equally probable with no prompting). Suppose your grammar requires a space in the output somewhere and has no other constraints. If you sampled LLM outputs till something passed the grammer you'd receive "hello world" and "good day" with equal probability. If you apply the website's technique you'll receive "hello world" twice as frequently as "good day".

The core problem is that an answer prefix might have been extremely unlikely to yield a valid response, but the technique (probably -- assuming it succeeds -- my example assumed retries would eventually succeed) constructs a valid response from it regardless. Assuming enough independence in the right places everything is fine and dandy still, but correlated errors compound quickly in autoregressive models.

As a brief JSON-specific question, is an LLM more or less likely to make factual errors (hallucinations, truncated strings, missing main characters, ...) when it produces a response failing to adhere to a schema? If factual error rate relates nontrivially to schema error rate then this path is more perilous than it seems. Given the outsized impact certain words or schmooshed together word-phrases seem to have on LLM output, I'd be surprised if details like schema adherence didn't bleed into other characteristics of the output.

druskacik · 2 years ago

In this case (multiple choice generation), if one of the possible outputs does no match the regex, you can just exclude it from generation.

I am trying to think of an example where "answer prefix might have been extremely unlikely to yield a valid response, but the technique ( ... ) constructs a valid response from it regardless", which might really cause a problem. But to no luck. Anyone has any idea? This could potentially be an interesting research question.

newhouseb · 2 years ago

An example from an earlier comment of mine on a different thread (assuming I've understood correctly):

> let's say we had a grammar that had a key "healthy" with values "very_unhealthy" or "moderately_healthy." For broccoli, the LLM might intend to say "very_healthy" and choose "very" but then be pigeonholed into saying "very_unhealthy" because it's the only valid completion according to the grammar.

That said, you can use beam search to more or less solve this problem by evaluating the joint probability of all tokens in each branch of the grammar and picking the one with the highest probability (you might need some more nuance for free-form strings where the LLM can do whatever it wants and be "valid").

The multiple choice example was just for tractable computations and illustrative purposes. Pretend the LLM has characters===tokens and is doing autoregressive probability prediction as per usual -- "f"-25%, "h"-50%, "g"-25% to start with, and then appropriate probabilities thereafter to yield that multiple-choice example (plus an <end-of-string> token).

> I am trying to think of an example where "answer prefix might have been extremely unlikely to yield a valid response, but the technique ( ... ) constructs a valid response from it regardless", which might really cause a problem. But to no luck. Anyone has any idea? This could potentially be an interesting research question.

At one point in the past ChatGPT (at a model probability layer, not just because of the context window issue) was prone to truncating long JSON responses, and if that happened in a long string field then you'd see the observed behavior. An example application:

(-) You're asking the LLM to turn some written podcast description into something machine-readable. You chunk the input, feed each chunk into the model (somehow; ignore the details; they're not important), and turn paragraphs into {speaker_name: str, timestamp: str, content: str} blobs.

(1) The LLM is prone to turning long paragraphs into `{"content": "the beginning of the content...` patterns, using ellipses to indicate that there's more to that JSON object.

(2) If you actually retry till the LLM succeeds, it's leaps and bounds more likely to end that string with a quotation mark if the string has all the original input. I.e., output like `{"content": "the beginning of the content..."}` is comparatively rare.

(3) The article's technique, however, always morphs those truncated json blobs into valid json. Since the ellipses is _valid_ at that point (a sub-string), instead of the vast majority of inputs failing you instead end up with the vast majority succeeding and having an incorrect ellipses sub-string.

In general, the LLM does autoregressive completions. Imagine two prefixes P1 and P2, each of which can be completed by classes of data so that P1{G1} adheres to the grammar, P1{F1} fails to adhere to the grammar, P2{G2} succeeds, and P2{F2} fails. With retry-till-passing-grammar the weighted probabilities are:

P1{G1}: Chance[P1] Chance[G1 | P1]

P2{G2}: Chance[P2] Chance[G2 | P2]

Whereas the weighted probabilities produced by the technique are:

P1{G1}: Chance[P1]

P2{G2}: Chance[P2]

In both cases you'd need to divide by the total probability, but the convolution by conditionals is both important and notably absent. For very simple schemas like {sentiment: "positive"|"negative"|"neutral"} the results might potentially be similar, but nothing in the idea of a greedy token filter forces that constraint.

sneedchucker · 2 years ago

Relevant; LLama.cpp implemented grammar-based sampling last month.

https://news.ycombinator.com/item?id=36819906 https://github.com/ggerganov/llama.cpp/pull/1773

remilouf · 2 years ago

We can extend our approach to grammar-based sampling, as explained in the paper linked above. Relevant PR: https://github.com/normal-computing/outlines/pull/178

Our method is much more efficient. llama.cpp loops over the entire vocabulary (~50k tokens) at each step to generate the mask. We generate an index at initialization, and building the masks at each step only requires a dictionary lookup (trade speed for memory). Sampling is just as fast as standard sampling.

popinman322 · 2 years ago

It should hopefully be a quick change to llama.cpp to add a mask per grammar state to bring it in line with your generation method; I don't think the two are incompatible, thankfully.

I do wonder how much you win here by masking the tokens? You still need to iterate along the output vector to apply the mask. Masking on the accelerator still requires filtering on the CPU side? Compared to running the language model, the cost of iterating over the edges in the grammar seems small.

burke · 2 years ago

Yes! This is closer to the approach I took in my port of llama.cpp's grammar support to PyTorch: https://github.com/Shopify/torch-grammar/blob/main/torch_gra... ... it generates a tensor mapping each PDA stack to a map of which tokens are acceptable from that state. It seems like a much better way to do it than looping over the sampled tokens on each turn.

btwillard · 2 years ago

We also had an implementation of grammar-driven guidance around the same time: https://github.com/normal-computing/outlines/pull/131. I imagine many others did as well, given all the papers we found on the subject. The point of this and our ongoing work is the availability of very low cost guidance, which was implemented a while ago for the regex case and expanded upon with JSON.

xigency · 2 years ago

Thanks for building this. The mechanics are such an obvious idea that it's astounding that the first-party platforms haven't done this yet. I would be interested to see how this could be used for other tasks outside of JSON that require structured input.

umvi · 2 years ago

> it's astounding that the first-party platforms haven't done this yet

I was under the impression LLM tech is currently in a breakneck arms race and that things are dramatically changing every few months. It could simply just be a consequence of limited developer resources. It would be "astounding" if decade-old tech were missing such a fundamental feature, but for AI tech in arms-race mode it seems reasonable that they are still missing QoL features.

winwang · 2 years ago

I think they meant that you'd expect simpler/more obvious ideas to be implemented first.

Thanks! We have extended the approach to grammar-based sampling. We describe the approach in the paper linked above. The following PR is relevant: https://github.com/normal-computing/outlines/pull/178

Lerc · 2 years ago

Could this same approach be applied at training? If the guidance does a lot of the syntactical heavy lifting, would that create the opportunity for a model to use the weights for something else. Essentially not bothering to reduce the error of things that the guidance will stomp on anyway.

Hi, the paper at https://arxiv.org/abs/2306.10763 titled "Guiding Language Models of Code with Global Context using Monitors" shows how to have the language models generate code without hallucinated dereferences.

BoorishBears · 2 years ago

I'm not sure how this is different than:

https://github.com/1rgs/jsonformer

https://github.com/newhouseb/clownfish

https://github.com/mkuchnik/relm

https://github.com/ggerganov/llama.cpp/pull/1773

https://github.com/Shopify/torch-grammar

Overall there are a ton of these logit based guidance systems, the reason they don't get tons of traction is the SOTA models are behind REST APIs that don't enable this fine-grained approach.

Those models perform so much better that people generally settle for just re-requesting until they get the correct format (and with GPT-4 that ends up being a fairly rare occurrence in my experience)

Thanks for bringing clownfish and relm to my attention! afaik other libraries loop over the entire vocabulary at every step of the generation. We on the other hand build an index at initialization by looping once over the vocabulary. Then generation is just as fast as standard generation.

torch-grammar generates a mask per PDA stack... we don't try to compute all the possible stacks. I'm sure there's something smarter that could be done here and you've probably figured it out (though IIRC regular languages don't have the arbitrarily recursive stack problem that you get when you get to context-free languages?) anyway, in practice we spend a few milliseconds on the first few requests building caches and then just apply masks from caches after that.

mkuchnik · 2 years ago

Hi, author of ReLM here. We use automata as well, like you describe, if I understand correctly.

J_Shelby_J · 2 years ago

So to explain this another way:

After each token generated by the LLM you update the logit bias “mask” to only allow the next token to be a valid json token?

Very slick!

dontreact · 2 years ago

You would also need to keep generating until the whole string is valid. And what if it gets caught in a loop?

Not sure how this can really guarantee 100%

orlp · 2 years ago

> And what if it gets caught in a loop? Not sure how this can really guarantee 100%

It's not great but after some timeout you can just set the mask to only include closing brackets.

Same problem with normal sampling - if it doesn't pick the <end> token, you're stuck generating until you hit some stopping heuristic (max tokens, timeout, etc.)

Indeed. And we're able to update the mask with a dictionary lookup instead of looping over the entire vocabulary (slow!).

Deleted Comment

bmc7505 · 2 years ago

You also need some kind of beam search or rejection sampling since JSON tokens to not exactly correspond to logits.

edit: They describe this more carefully in the paper.

behnamoh · 2 years ago

It’s actually a very old trick. Lots of libraries do this. idk what’s the big deal about this one.

Perhaps I didn’t explain clearly enough in the original post?

Q6T46nT668w6i3m · 2 years ago

Is this Brandon Willard the breakdancer from Detroit Brandon Willard?

Edit: It is! https://brandonwillard.github.io/

Ha, yeah, in a distant, but really fun, past!