Recently we came up with a fast way to generate text that matches a regex (https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.
Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.
From there it was only a small leap to be able to generate text that follows a JSON schema (https://json-schema.org/), or is parseable into a Pydantic model (https://docs.pydantic.dev/latest/usage/models/). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.
I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.
I look forward to feedback, bug reports, feature requests and discussions!
Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar https://arxiv.org/abs/2307.09702
I am curious, however, for the ones who have played around with such libraries wrapping base LLMs with output structure: do base models like Llama2 work very well? My experience says "hell no!" and you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.
And even then, it seems very counter-intuitive to me that given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution, and potentially detrimental to instruction-tuning?
About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.
Human evaluation is the gold standard and the Llama 2 paper gave significant evidence that Llama 2 70b chat is on-par, if not, better than ChatGPT for that metric so I tend to stick to it unless there is good reason not to.
Sure. My concern was not specific to llama-2, and was only using it as a placeholder example of a decent pre-trained base model. Replace it with your favorite base model, which you want to use for guided generation. My question is more fundamental - how does post-hoc guided generation interfere with the potential benefits of instruction-tuning?
> About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.
Mechanistically, yes. I am not arguing that. The whole point is to generate JSON that is "useful".
I'm using the MLC version (since that works with a GPU on my M2 Mac) via my https://github.com/simonw/llm-mlc plugin.
In our paper titled "Guiding Language Models of Code with Global Context using Monitors" (https://arxiv.org/abs/2306.10763), we propose Monitor Guided Decoding, which interfaces LLMs to static analysis, and guides the model to generate type-consistent code. Without any kind of fine-tuning, we show that using static analysis to guide token level generation at specific points leads to significantly improved quality of generated code, both in terms of compilability and match with ground truth. Even very small models (1.1B) are able to generate more compilable code than much larger models (175B) while also improving on match with ground truth.
(Just thinking out loud next)
If you allow me to be a little imprecise, guided-generation is prompting "just-in-time" unlike the other kind of prompting where you provide all reference tokens "ahead-of-time". Now there's work [1] out there that shows that smaller models rely much more on prompting than larger models do, i.e. smaller models are more faithful to the tokens in the prompt than the larger models which just do whatever they were going to do anyways.
Your results seem very much in line with this kind of a qualitative result --- you show that CodeGen-350M outperforms CodeGen-6B, and CodeGen-6B outperforms text-davinci-003 using MGD. Smaller models perhaps respond more strongly to certain kinds of prompting strategies than larger models do.
[1]: https://arxiv.org/pdf/2307.13702.pdf
Isn't that what we did with test driven development?
The primary difference was our generator functions were human instead of LLM. Why not cut out the middle-human?
I was rather concerned about a broader fundamental question - how does post-hoc guided generation interfere with the potential benefits of instruction-tuning?
The instruction tuning part is "trivial"...it's the dealing with edge cases part that gets me.
With classic code edge cases are well insignificant edge cases. With LLM you never know what will make it go off on a tangent & the parsing code needs to deal with that chaos.
Or put differently the % of cases that are edge cases seems to have gone up dramatically
But it's still probabilistic, and nine times out of ten isn't good enough.
Occasionally it will hallucinate responses like this:
{"key1": "value1", "key2": "value2" for i in range(n)}
Re-prompting with the parsing error message is usually enough to get it on the second try.
But escaping double-quotes and newline characters is less reliable. Even after giving it multiple examples, it correctly escapes only about half the time.
Re-prompting for escaping errors still yields a ~50% success rate.
Here's their prompt for that: https://github.com/microsoft/TypeChat/blob/c45460f4030938da3...
I think the approach using grammars (seen here, but also in things like https://github.com/ggerganov/llama.cpp/pull/1773 ) is a much more elegant solution.
The chief error is not providing escape hatches. LLMs look for a right answer. If you are feeding it some texts and asking it to return structured data about the texts, but then one of the texts is blank, it will be difficult to determine a right answer, so you get hallucinations. The solution is an escape hatch where one of the arguments is a `textIsMissing` boolean or something.
As long as you've accounted for these failure modes, it works flawlessly.
I haven't gotten a chance to earnestly use the smaller Llama models yet in more than small prototypes (although I'm building a 4090-based system to learn more about finetuning them), but the little amount of experimenting I've done with them makes me think they need a decent amount of help with generating consistently-valid JSON matching some schema out of the box. This is a pretty neat tool to use for them, since it doesn't require finetuning runs, it just masks logits.
store that code in a json[object
code: { "php_code": "<?php echo 'Hello, World!'; ?>" }
1. It consumes fewer tokens, no need to add too many examples into the prompt.
2. It suffers less from the forgetting issue.
Another minor advantage is you can control precisely where your desired output to begin with.
But overall, those are nice perks not too substantial IMO.
If this works, how to select the optimal value? Maybe you can train a model that can excel at the task of querying gpt4 for valid jsons
right now you can inject prompts that the LLM takes into consideration before the output
I wonder if you can make it have a "post" generation function that says like "keep re-trying in a loop (aka hallucinating with randomness) until the output message passes XYZ format/checks/scoring"
But you can do both. For my current use case of extracting information from articles, I have a json schema + one/two example articles along with their correct answers. This increases token costs but 3.5 is so cheap that it doesn't matter and for 4 you can use batching to decrease token cost per article.
As a brief example, suppose the only possible LLM outputs were "hello world", "food", "hello", and "good day" (and that they're all equally probable with no prompting). Suppose your grammar requires a space in the output somewhere and has no other constraints. If you sampled LLM outputs till something passed the grammer you'd receive "hello world" and "good day" with equal probability. If you apply the website's technique you'll receive "hello world" twice as frequently as "good day".
The core problem is that an answer prefix might have been extremely unlikely to yield a valid response, but the technique (probably -- assuming it succeeds -- my example assumed retries would eventually succeed) constructs a valid response from it regardless. Assuming enough independence in the right places everything is fine and dandy still, but correlated errors compound quickly in autoregressive models.
As a brief JSON-specific question, is an LLM more or less likely to make factual errors (hallucinations, truncated strings, missing main characters, ...) when it produces a response failing to adhere to a schema? If factual error rate relates nontrivially to schema error rate then this path is more perilous than it seems. Given the outsized impact certain words or schmooshed together word-phrases seem to have on LLM output, I'd be surprised if details like schema adherence didn't bleed into other characteristics of the output.
I am trying to think of an example where "answer prefix might have been extremely unlikely to yield a valid response, but the technique ( ... ) constructs a valid response from it regardless", which might really cause a problem. But to no luck. Anyone has any idea? This could potentially be an interesting research question.
> let's say we had a grammar that had a key "healthy" with values "very_unhealthy" or "moderately_healthy." For broccoli, the LLM might intend to say "very_healthy" and choose "very" but then be pigeonholed into saying "very_unhealthy" because it's the only valid completion according to the grammar.
That said, you can use beam search to more or less solve this problem by evaluating the joint probability of all tokens in each branch of the grammar and picking the one with the highest probability (you might need some more nuance for free-form strings where the LLM can do whatever it wants and be "valid").
> I am trying to think of an example where "answer prefix might have been extremely unlikely to yield a valid response, but the technique ( ... ) constructs a valid response from it regardless", which might really cause a problem. But to no luck. Anyone has any idea? This could potentially be an interesting research question.
At one point in the past ChatGPT (at a model probability layer, not just because of the context window issue) was prone to truncating long JSON responses, and if that happened in a long string field then you'd see the observed behavior. An example application:
(-) You're asking the LLM to turn some written podcast description into something machine-readable. You chunk the input, feed each chunk into the model (somehow; ignore the details; they're not important), and turn paragraphs into {speaker_name: str, timestamp: str, content: str} blobs.
(1) The LLM is prone to turning long paragraphs into `{"content": "the beginning of the content...` patterns, using ellipses to indicate that there's more to that JSON object.
(2) If you actually retry till the LLM succeeds, it's leaps and bounds more likely to end that string with a quotation mark if the string has all the original input. I.e., output like `{"content": "the beginning of the content..."}` is comparatively rare.
(3) The article's technique, however, always morphs those truncated json blobs into valid json. Since the ellipses is _valid_ at that point (a sub-string), instead of the vast majority of inputs failing you instead end up with the vast majority succeeding and having an incorrect ellipses sub-string.
In general, the LLM does autoregressive completions. Imagine two prefixes P1 and P2, each of which can be completed by classes of data so that P1{G1} adheres to the grammar, P1{F1} fails to adhere to the grammar, P2{G2} succeeds, and P2{F2} fails. With retry-till-passing-grammar the weighted probabilities are:
P1{G1}: Chance[P1] Chance[G1 | P1]
P2{G2}: Chance[P2] Chance[G2 | P2]
Whereas the weighted probabilities produced by the technique are:
P1{G1}: Chance[P1]
P2{G2}: Chance[P2]
In both cases you'd need to divide by the total probability, but the convolution by conditionals is both important and notably absent. For very simple schemas like {sentiment: "positive"|"negative"|"neutral"} the results might potentially be similar, but nothing in the idea of a greedy token filter forces that constraint.
https://news.ycombinator.com/item?id=36819906https://github.com/ggerganov/llama.cpp/pull/1773
Our method is much more efficient. llama.cpp loops over the entire vocabulary (~50k tokens) at each step to generate the mask. We generate an index at initialization, and building the masks at each step only requires a dictionary lookup (trade speed for memory). Sampling is just as fast as standard sampling.
I do wonder how much you win here by masking the tokens? You still need to iterate along the output vector to apply the mask. Masking on the accelerator still requires filtering on the CPU side? Compared to running the language model, the cost of iterating over the edges in the grammar seems small.
I was under the impression LLM tech is currently in a breakneck arms race and that things are dramatically changing every few months. It could simply just be a consequence of limited developer resources. It would be "astounding" if decade-old tech were missing such a fundamental feature, but for AI tech in arms-race mode it seems reasonable that they are still missing QoL features.
https://github.com/1rgs/jsonformer
or
https://github.com/newhouseb/clownfish
or
https://github.com/mkuchnik/relm
or
https://github.com/ggerganov/llama.cpp/pull/1773
or
https://github.com/Shopify/torch-grammar
Overall there are a ton of these logit based guidance systems, the reason they don't get tons of traction is the SOTA models are behind REST APIs that don't enable this fine-grained approach.
Those models perform so much better that people generally settle for just re-requesting until they get the correct format (and with GPT-4 that ends up being a fairly rare occurrence in my experience)
After each token generated by the LLM you update the logit bias “mask” to only allow the next token to be a valid json token?
Very slick!
Not sure how this can really guarantee 100%
It's not great but after some timeout you can just set the mask to only include closing brackets.
Deleted Comment
edit: They describe this more carefully in the paper.
Edit: It is! https://brandonwillard.github.io/