Jsonformer: Generate structured output from LLMs

kcorbitt · 2 years ago

I've thought about building this for a while, glad it's out there!

Not only does this guarantee your output is JSON, it lowers your generation cost and latency by filling in many of the repetitive schema tokens without passing them through the LLM.

For the very common case of "extracting multiple structured fields from a piece of unstructured text," I believe there's an even stronger optimization possible that would further decrease costs, latency and potentially even improve accuracy.

Assuming the fields you want to extract are independent (and they often are), you don't need to generate them all in one go autoregressively. Eg. instead of running the following pseudo-prompt:

    "Input: 'It's sunny and cold today'
     Output schema: {"sunny": boolean, "temperature": string}"

You could instead run the following two:

    "Input: 'It's sunny and cold today'
     Output schema: {"sunny": boolean}"

    "Input: 'It's sunny and cold today'
     Output schema: {"temperature": string}"

We don't do that today because when done naively it's very inefficient -- you'd be tokenizing, passing to the GPU, and computing the KV cache of the shared part of the prompt twice. But a library with the right abstraction could run the second two queries in a batch in parallel and reuse the same tokenization and KV cache for both of them. It would actually be more efficient than generating both fields in one go, since when you factor out the shared prefixes both the generated text and its context are shorter!

I mentioned above that this could also improve accuracy. Of course it doesn't do that by default (except that by excluding all the irrelevant fields it makes self-attention's job easier). But what it does do is give you an independent prompt for each field you're interested in. And so for particularly tricky fields you're trying to extract, you have the flexibility to eg. add several examples to make the generation N-shot.

travisjungroth · 2 years ago

Maybe this will make CUE popular. It’s similar to JSON, but the idea of schema and values are put together through unification, or you could say narrowing constraints. CUE would handle taking all of those values individually, then combining them into something concrete, incomplete, or erroring out.

verdverm · 2 years ago

I've already been experimenting. CUE is different enough, but close enough to confuse GPT. I haven't tried fine tuning yet. I should go back and try few-shot with GPT4 now that I have access.

What I've been doing instead is having LLMs generate JSON, putting what jsonformer does in the first prompt for few-shot learning, and then combining it with CUE after, since you can easily intermix data files with CUE.

My latest experiment: https://twitter.com/verdverm/status/1652504163635347456

Creating the prompt for this was pretty interesting and illuminating. While it works for the full text there, you can also do it in parts, only outputting the new parts of the JSON that is merged with the CUE.

bckr · 2 years ago

Can you briefly describe how you got to the point of having this kind of intuition about language models?

kcorbitt · 2 years ago

Kind of a meta-answer, but my personal learning style is "think of something cool to build, then figure out what I need to know to build it." It just so happens that a lot of the interesting/cool stuff going on right now builds on top of LLMs, so my projects have naturally gravitated that way. But I've never sat down and to take an AI course or read the "Attention Is All You Need" paper or anything. I absorb much more when I learn something because I need it for something I'm working on.

kolinko · 2 years ago

Not op, but I can share my approach - I went line by line by Recmo's Cria: https://github.com/recmo/cria - which is an implementation of Llama in Numpy - so very low level. Took me I think 3-4 days x 10 hours + 1-2 days of reading about Transformers to understand what's going on - but from that you can see how models generate text and have a deep understanding of what's going on.

tysam_and · 2 years ago

I can't speak for OP, but something that I think helps is if you think about the generation process as a jumping off point that one can control the placement of, but not really much that is generated afterwards.

Adding a scheme like this reduces the area of potential off-roading that the LLM can do to a much smaller zone. Additionally, it breaks up the chain of dependencies between the two example outputs, because now we do not need to depend upon past inputs to correctly output this scheme.

Since the information for JSON semantic structure is no longer required to be driven by the LLM (it still has to understand it to still be able to generate things with a modicum of sense, IIRC), we can look at our dependency graph for outputs. _This changes because now the fields really and truly are independent, (if they are truly informationally independent) _.

So now some kind of conjoined information requirement of ( autoregressive output ) <- (( field A ) <- ( field B )) becomes ( autoregressive output ) <- (( field A ) && ( field B )) which then can be factored out into separate calls instead of sequentially, which yields us a batched call of (( autoregressive output A ) <- ( field A ) && ( autoregressive output B ) <- ( field B )).

From there it is just implementation. I likely would not have thought about the OP's way of handling things for a good while, though maybe I would have stumbled into it had I enough reason to think about structured/templated kinds of generation, which I do believe that I do now! <3 :) It really breaks a lot of assumptions that are easy to quietly make and I had not thought appropriately about the consequences of reframing things in this way, to be honest.

As for "how" to think about this, if I were to give my take, it would be always just turning whatever problem in front of you is into a puzzle where you simplify it further each time. Optimizing for less computation, time, code, or even just what all of those are a kind of proxy for: less information to sufficiently solve a problem. We can see that this problem is reduced in complexity appropriately because we remove a redundancy that does not need to be there at all.

One way to look at this is in the relationships between parts of an idea. If you're able to understand, even vaguely, the concepts behind some other concept and how they interact, and maybe even have a 'standard toolkit' of relating to them, you can start finding/transferring/applying other skills to these parts of a concept. I don't think there's a guaranteed-efficient way to maybe reduce a concept down to its parts, or down to a more efficient representation without already, well, knowing that representation. It's an NP-hard problem to me personally, and is the reason why research and other academic pursuits can take a while. It is a good skill to learn I suppose and I certainly enjoy trying to use it, personally.

To tie this back to your question about language models -- yes, some things have to do with the language model, but oftentimes it's actually just the raw mathematical components underlying a model. If you look for that, and (please please please please please!!!!) then you don't necessarily _have_ to concern yourself with the implementation details (beyond runtime limits, etc), as long as the math still applies you should be able to reason really quite well about what else is happening/could happen with a model type like these are.

In particular, LLMs being an autoregressive model where each output depends upon its inputs lets us set up a dependency graph. Then based upon some prior assumptions, we can maybe make some substitutions/changes that allow us to fragment the dependency graph and move it around as we wish. This is not just applicable to LLMs, however, dependency graphs are useful in a wide number of areas.

So one other thing that we're not talking about here is that we're optimizing for an objective we want (clean JSON) by explicitly...well, injecting that objective instead of living on just hopes and dreams, y'aknow. This is a pretty straightforward way of solving the problem by putting the answer in the question, though poor input content still can be a problem.

Stated a different way, we're collapsing the entropy of what the network can introduce (which should be JSON, but remember [!!!!!!!], neural networks are noisy estimators, and JSON errors are mathematically guaranteed (even if rare), which means any pipeline depending upon output like code can and will fail, and is brittle to all sorts of other kinds of complicated parsing errors. This is because to catch/detect/enumerate/correct these errors, we need to have all of the information needed to implement a JSON structure itself. So basically we'd be using the same exact information, just enforcing it in a horrendously inefficient manner, which is how people have been doing it until the present, which is okay as we humans are certainly not NP-optimal machines IMO. In any case, we're still in the parentheses, and the point was that any kind of variance can be a problem here beyond some extremely tiny limit, and that's not what LLMs are made to do. So at some point it's guaranteed to break, and high volumes -- it's basically guaranteed to break in a way that's either unusable or requires so much effort to fix that you might as well have embedded a JSON prior into your network generation process because it would have required the same amount of information as external validation would, albeit with less effort (!!!!)), which is perfectly fine in our case if we're exclusively generating JSON as it gives us what we want. Most methods like this thankfully should have a low level of invasiveness to the model as well, freeing us up to use either the same or a similar model for multiple tasks.

This can create a bit of an ideological illusion as we technically are destroying information by collapsing the distributions of sentences/strings of tokens/etc that we are generating, and maybe can lend to a "oh, we can add whatever functionality we want!" kind of belief about this kind of modeling. It's important what we're adding and taking away. Also important is part of how/why/what is so powerful about training these models on next token prediction on large text corpora. We can trim them down to some smaller subproblem much much more easily than we can expand them to cover a larger subset. Which is pretty darn cool!

I know this sorta flew around a lot of places and touched on a lot of things, probably not as cogently as I'd want to if I had more time to review and revise it. Hope it was/is helpful for you and feel free to let me know if you have any questions. It's a very cool topic on the whole to me, tbh, and there's a number of interesting conversations that can branch off from this one. Honestly this whole general area is where I see the real value in LLM development in research. It's practical and it's helpful! :D :) <3 :)

Source for experience is a number of years of experience across a wide variety of ML models, though I'm sure I made an embarassing blunder or two in this post. ;P

execveat · 2 years ago

You'd need to put the input first for this approach to work, but in my testing models work better if you lead with a question.

kcorbitt · 2 years ago

Hmm. I admit that I haven't thought about this deeply, but I'm not sure that's true? It seems to me that you could extend the KV cache either backwards or forwards equally easily.

tysam_and · 2 years ago

I could be reading this wrong, but my assumption is/has been that the prompt goes up to the end of the JSON field name, and the LLM is only filling in the actual value, not the key. I could be wrong on this one, however.

newhouseb · 2 years ago

Oh nice! I built a similar system a few weeks ago: https://github.com/newhouseb/clownfish

I think the main differentiating factor here is that this is better if you have a simpler JSON schema without enums or oneOf constraints. If you do have these constraints, i.e. let's say you wanted an array of different types that represented a items on a menu { kind: pizza, toppings: [pepperoni] } or { kind: ice_cream, flavor: vanilla | strawberry } then you would need something more sophisticated like clownfish that can ask the LLM to pick specific properties (and an ability to do some backtracking so you can do proper beam search).

For completeness, another common approach can be found here: https://github.com/ShreyaR/guardrails which essentially boils down to "provide the schema in the prompt and ask the LLM to correct things if it fails to get the schema right the first time."

sudb · 2 years ago

Another very similar approach to guardrails which manages to avoid XML that I've been using with some success is langchain's OutputFixingParser: https://python.langchain.com/en/latest/modules/prompts/outpu...

gordian-not · 2 years ago

I have stumbled upon your repository a week ago and I have to say, great work and great ideas!

Another thing I thought about is integrating formatting for fields using a similar system. ISO-8601 dates comes immediately to mind but also number and currency formatting are other examples.

Probabilistic enums is another thing that I can think of that might be useful for fallback values, I am pretty sure there's a lot of work that can be done in this area, also for other parser kinds

related and highly recommended resource is https://github.com/mkuchnik/relm and https://arxiv.org/abs/2211.15458. It is a similar system used to validate LLMs using regexes, however built for completely different use cases. I imagine integrating regex checks to the output fields can also have a lot of use cases.

newhouseb · 2 years ago

Thank you! ReLM is a great find! I like that it drives the generation itself so that it can explore different beams more intentionally. And to do the JSON Parsing well against enums/unions/oneOf, you really have to support backtracking in a way that works basically the same as it does for regex so I'm looking forward to digging into their implementation.

killthebuddha · 2 years ago

One thing that I really like about the approach you took with clownfish is that it doesn't constrain or modify the structure of the prompt.

One of the primary difficulties with writing LLM applications is that prompts are basically not composable, and any LLM library that modifies your prompt is going to be a nightmare to work with.

killthebuddha · 2 years ago

Follow-up thought I just had: It seems that prompt structure standards are going to have to emerge if any of these tools have a shot at interoperability. I don't have hard data, but IME if a prompt is structured

MEMORY EXAMPLE INSTRUCTION [COMPLETION]

it will basically not work to wrap it in a prompt that's structured

INSTRUCTION MEMORY EXAMPLE [COMPLETION]

gamegoblin · 2 years ago

I hate that gpt-3.5-turbo is so cheap that using systems like guardrails is a sane thing to do. I can almost always prompt davinci-003 without guardrails in a way to get my exact schema 1-shot, whereas guardrails + 3.5-turbo will often consume 2-4x more tokens, but that still makes it significantly cheaper.

brigadier132 · 2 years ago

The problem people are having is hitting the rate limits for chat gpt.

joshuanapoli · 2 years ago

Thank you for the really clear and complete description of "ControLogit"s and your approach in clownfish!

wjessup · 2 years ago

Thank you for your README, I'm sharing it with my team.

sundarurfriend · 2 years ago

> Bulletproof JSON generation: Jsonformer ensures that the generated JSON is always syntactically correct and conforms to the specified schema.

This is an important definition to take note of: "bulletproof" doesn't mean that you'll get good or correct data. It only means that it'll be valid JSON and in a particular schema that you specify (because the LLM isn't building the JSON in the first place, the library is).

It's an interesting idea. But it's not clear if they've validated the heuristics they use, to see how well it performs in terms of accuracy against, say, some kind of BeautifulSoup-like attempt to make sense of the JSON-ish that the LLM produces and correct that to be valid JSON, or any other approach to the problem.

dragonwriter · 2 years ago

I wonder if LLMs are at the point where reprompting the LLM with a very similar error message to what you would output to a human from a user-friendly JSON processing tool for the error would usually be a good way to fix errors.

newhouseb · 2 years ago

Sometimes, but it very much depends on the context (no pun intended). If it's a pure syntax issue, OpenAI models will almost certainly make the right correction. If it's more abstract, like the LLM has hallucinated a property that is invalid as part of some larger schema you can quickly descend into the LLM gaslighting you into saying that it has fixed things when it hasn't.

execveat · 2 years ago

Yeah, but that could require multiple queries, which isn't very efficient. Training model just to fix JSON would be better.

Der_Einzige · 2 years ago

Love to see further work on constrained decoding like this and other systems introduced in the comments!

See my work and the paper about it. I've got a lot of y'all beat on this (constrained decoding, not the templating and structuring) by about a year:

https://github.com/hellisotherpeople/constrained-text-genera...

andrewcamel · 2 years ago

Seen a lot of things trying to do this by pressure testing the outputs, but all feel like anti-patterns. This is the first that seems like the "right" way to do it. Better to manage how the model is generating vs creating one more potentially faulty "glue" layer.

tysam_and · 2 years ago

Mathematically it requires less information to impose a certain prior on data in the process of generation than it does to read the data, do error detection and correction according to a prior, and then return it, if I understand correctly.

Something always felt incredibly icky to me about any kind of ad-hoc 'fixer' scripts that were part of a pipeline that was fully controlled by a user.

lt · 2 years ago

Can you elaborate about what you mean by pressure testing? Haven't heard this term yet.

andrewcamel · 2 years ago

Maybe not the right term... Just that a lot of other libs act like guardrails, i.e. let the model generate what it does (in full form text / GPT output), and try to then parse out what you want, error if output doesn't conform to standard format. As opposed to basically only allowing the model to generate into the already-built JSON form fields. Understandable why this guardrails/parsing approach is so popular though... can't do what this library is doing with OpenAI API. Need to be able to manipulate the token generation; otherwise you're forced to take full text output and try to parse it.

motoboi · 2 years ago

I found it rather strange that the new AndrewNG course about prompting, that features an OpenAI employee, says nothing about templated output.

To me this is a killer feature of GPT, being able to turn a document into a json or any other template.

The kind of prompt is just amazing for GPT (try it with a blog post, document or any other thing): "Analyze this document and transform it into the following format:

<title>

<content_item 1>

<content_item 2>

<content_item N>"

Also you can ask the same prompt in a json and GPT will gladly transform a PDF into a JSON.

ryan-allen · 2 years ago

Templated output means you could build flexible user interfaces to interact in novel ways beyond a mere text input. What I find absolutely incredible right now is that any novel idea I have about "it would be nice if you could do X" is only taking a few days to reach any mainstream tech news source. I used to think the same thing about RubyGem ideas in the late 2000s, and within 6 months a useful package would come out. I put it down to many people consuming the same information at the same time coming up with the same ideas. It's happening much faster this time. 12 months from now who knows what's going to happen!

tough · 2 years ago

I knew a similar one called GPTyped, just posted it on HN https://news.ycombinator.com/item?id=35793056#35793057

benob · 2 years ago

How about going one step further and constrain transformer output with a context-free grammar? That way you can generate more conformant code such as Python or C.

Der_Einzige · 2 years ago

This may be possible as constraints using constrained beam search, which huggingface has quietly supported for a long time.

gamegoblin · 2 years ago

Wouldn't even need to beam search if you restrict it to deterministic context free grammars, which would satisfy > 95% of these "generate some JSON schema" use-cases. For DCFGs you can just zero-out the probability for any token that is invalid in the context, no lookahead or search needed. Wouldn't work for truly context free things like most programming languages, though.