Structured Outputs in the API

By using JSON mode, GPT-4{o} has been able to do this reliably for months (100k+ calls).

We use GPT-4o to build dynamic UI+code[0], and almost all of our calls are using JSON mode. Previously it mostly worked, but we had to do some massaging on our end (backtick removal, etc.).

With that said, this will be great for GPT-4o-mini, as it often struggles/forgets to format things as we ask.

Note: we haven't had the same success rate with function calling compared to pure JSON mode, as the function calling seems to add a level of indirection that can reduce the quality of the LLMs output YMMV.

Anyhow, excited for this!

[0]https://magicloops.dev

geepytee · 2 years ago

This model appears to be full of surprises.

The 50% drop in price for inputs and 33% for outputs vs. the previous 4o model is huge.

It also appears to be topping various benchmarks, ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.

Shameless plug, I'm the co-founder of Double.bot (YC W23). After seeing the leaderboard above we actually added it to our copilot for anyone to try for free [2]. We try to add all new models the same day they are released

[0]https://huggingface.co/spaces/allenai/ZeroEval

[1]https://crux-eval.github.io/

[2]https://double.bot/

usaar333 · 2 years ago

> ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.

The previous version of 4o also beat 3.5 Sonnet on Crux.

qwertox · 2 years ago

What a cool product! I was about to recommend you to submit it as a "Show HN", but it turns out that it already got submitted one year ago.

Would you mind sharing a bit on how things have evolved?

jumploops · 2 years ago

Thanks and great question :)

When we first launched, the tool was very manual; you had to generate each step via the UI. We then added a "Loop Creator agent" that now builds Loops for you without intervention. Over the past few months we've mostly been fixing feature gaps and improving the Loop Creator.

Based on recent user feedback, we've put a few things in motion:

- Form generator (for manual loops)

- Chrome extension (for local automations)

- In-house Google Sheets integration

- Custom outputs (charts, tables, etc.)

- Custom Blocks (shareable with other users)

With these improvements, you'll be able to create "single page apps" like this one I made for my wife's annual mango tasting party[0].

In addition to those features, we're also launching a new section for Loop templates + educational content/how-tos, in an effort to help people get started.

To be super candid, the Loop Creator has been a pain. We started at an 8% success rate and we're only just now at 25%. Theoretically we should be able to hit 80%+ based on existing loop requests, but we're running into limits with the current state of LLMs.

[0]https://mangota.ngo

__jl__ · 2 years ago

Thanks for sharing your experience.

We also get pretty reliable JSON output (on a smaller scale though) even without JSON mode. We usually don't use JSON mode because we often include a chain of thought part in <brainstorming> and then ask for JSON in <json> tags. With some prompt engineering, we get over 98% valid JSON in complex prompts (with long context and modestly complex JSON format). We catch the rest with json5.loads, which is only used as a fallback if json.loads fails.

4o-mini has been less reliable for us particularly with large context. The new structured output might make it possible to use mini in more situations.

JimDabell · 2 years ago

The linked article includes a section on this, under “Separating a final answer from supporting reasoning or additional commentary”. They suggest defining a JSON schema with a reasoning field and an answer field.

borsch · 2 years ago

my recommendation? use chain of thought, then feed that into a second prompt asking for json

Deleted Comment

turnsout · 2 years ago

Had the same experience with function calling—we get much better results simply asking for JSON. With simple schemas (basically dictionaries), gpt-4 and 4o are basically bulletproof.

tomcam · 2 years ago

Very interesting. Did you build magicloops using this tech?

jumploops · 2 years ago

We first built Magic Loops with GPT-4, about a year ago, well before JSON mode was a thing.

We had to a do a bunch of extra prompting to make it work, as GPT would often include backticks or broken JSON (most commonly extra commas). At the time, YAML was a much better approach.

Thankfully we've been able to remove most of these hacks, but we still use a best effort JSON parser[0] to help stream partial UI back to the client.

[0]https://www.npmjs.com/package/best-effort-json-parser

diego_sandoval · 2 years ago

Can I use Magic Loops to generate Magic Loops for me?

jumploops · 2 years ago

Technically yes, but it would require reverse-engineering some of our APIs.

Practically speaking, we have quite a few use-cases where users call Loops from other Loops, so we're investigating a first-class API to generate Loops in one go.

Similar to regular software engineering, what you put in is what you get out, so we've been hesitant to launch this with the current state of LLMs/the Loop Creator as it will fail more often than not.

It's so wild that the bar for AI performance is both absurdly high and absurdly low at the same time. To specify an output format (language or grammar) for solving a computational problem is one of the oldest exercises around. On the one hand, it's breathtakingly mundane that the model can now do the most basic of tasks: conform to an output specification. It's weird reading the kind of self-congratulating blogpost about this, like OpenAI has just discovered flint knives. On the other hand, a computer system can process natural language with extremely ambiguous, open-ended problems, compute solutions to said problems, even correct its own mistakes--and then it can format the output correctly. And then on yet another hand, it only took about 10^25 floating point operations (yeah, just ten million trillion trillion, right!?) to get this outcome.

thruway516 · 2 years ago

I dont understand your complaint at all. If you develop a new revolutionary technology called an automobile, developing steering, brakes, starter, mufflers for it is a pretty big deal even if reins, clamps, mufflers and keys are mundane and have existed for decades. Structured outputs are a pretty big step in making this magic actually usable by developers as opposed to generating impressive cat pictures or whatever has captured the public imagination.

Bjartr · 2 years ago

I don't think it was an complaint, just a observation.

jappgar · 2 years ago

Structured outputs are hard... but they claimed to have solved this a year ago.

They were lying, of course, and meanwhile charged output tokens for malformed JSON.

throwawaymaths · 2 years ago

> On the one hand, it's breathtakingly mundane that the model can now do the most basic of tasks: conform to an output specification.

I highly doubt it's the model that does this... It's very likely code injected into the token picker. You could put this into any model all the way down to gpt-2.

crowcroft · 2 years ago

I wonder if you get 90% of the way with prompt engineering, and then the last 10% is just brute force, validate output, if it fails, rerun the prompt.

My assumption is if that's all this is they would have done it a long time ago though.

ramraj07 · 2 years ago

This is like saying “we shouldn’t be celebrating a computer that can talk, my parrot can do that!”

Deleted Comment

raincole · 2 years ago

I don't know, it doesn't sound wild at all to me. Human languages are very imprecise, vague and error-tolerant, which is the opposite of an output format like JSON. So the a model can't do these two things well at the same time is quite an intuitive conclusion.

The wild part is that a model trained with so much human language text can still outputs mostly compilable code.

scarmig · 2 years ago

I have struggled writing valid YAML before (my tokenizer doesn't handle whitespace very well). And it probably takes me a quadrillion operations on the reals to get a minimal YAML file (I think your 10^25 fp ops is an overestimate--I think it's more like 10^18-10^19).

It's kind of like an inverse Moravec's paradox.

theturtle32 · 2 years ago

Relatable!!

codingwagie · 2 years ago

I think it will take a long time for the world at large to realize and then operationalize the potential of this "mundane" technology. It is revolutionary, and also sitting in plain sight. Such a huge technological shift that was considered decades out only a few years ago

ben_w · 2 years ago

Although I am an optimist* about what this can do, I am very much aware — from personal experience — how easy it is to see more than is really there.

The realisation of the tech might be fantastic new things… or it might be that people like me are Clever Hans-ing the models.

* that may be the wrong word; "strong capabilities" is what I think is present, those can be used for ill effects which is pessimistic.

tommica · 2 years ago

For some reason it reminds me of my civilization runs - rush to certain high level tech and then after that discovery writing :D

m3kw9 · 2 years ago

It’s doing more, it is allowing user to input using natural language and the output is the json format the API that is defined

berkes · 2 years ago

I am so often surprised by "The AI Communities" software. Often unpleasantly surprised, often just eye-rolling.

When we first started using the OpenAI API's, the first thing I reached for was some way "to be certain that the response is properly formatted". There wasn't one. A common solution was (is?) "just run the query again, untill you can parse the JSON". Really? After decades of software engineering, we still go for the "have you tried turning it off and on again" on all levels.

Then I reached for common, popular tools: everyone's using them, they ought to be good, right? But many of these tools, from langchain's to dify to promptflow are a mess (Edit: to alter the tone: I'm honestly impressed by the breadth and depth of these tools. I'm just suprised about the stability - lack thereof, of them). Nearly all of them suffer from always-outdated-documentation. Several will break right after installing it, due to today's ecosystem updates that haven't been incorporated entirely yet. Understandably: they operate in an ecosystem that changes by the hour. But after decades of software engineering, I want stuff that's stable, predictable, documented. If that means I'm running LLM models from a year ago: fine. At least it'll work. Sure, this constant-state-of-brokeness is fine for a PoC, a demo, or some early stage startup. But terrible for something that I want to ensure to still work in 12 months, or 4 years even without a weekly afternoon of upgrade-all-dependencies-and-hope-it-works-the-update-my-business-logic-code-to-match-the-now-deprecated-apis.

KoolKat23 · 2 years ago

Well the simple answer is don't use it then.

In the same way people revert to older stable releases. You're welcome to revert to writing boilerplate code yourself.

The reason people are excited and use it, is because they show promise, it already offers significant benefits even if it isn't "stable".

srcreigh · 2 years ago

If I wanted to be a silly pedant, I’d say that Turing machines are language specifications and thus it’s theoretically impossible for an LLM or any program to validate output formats in general.

Deleted Comment

jes5199 · 2 years ago

in _general_ sure, but if you restricted each token to conform to a Kleene-star grammar you should be able to guarantee that you get something that parses according to a context-free grammar

There is another big change in gpt-4o-2024-08-06: It supports 16k output tokens compared to 4k before. I think it was only available in beta before. So gpt-4o-2024-08-06 actually brings three changes. Pretty significant for API users

1. Reliable structured outputs 2. Reduced costs by 50% for input, 33% for output 3. Up to 16k output tokens compared to 4k

https://platform.openai.com/docs/models/gpt-4o

santiagobasulto · 2 years ago

I’ve noticed that lately GPT has gotten more and more verbose. I’m wondering if it’s a subtle way to “raise prices”, as the average response is going to incur I more tokens, which makes any API conversation to keep growing in tokens of course (each IN message concatenates the previous OUT messages).

tedsanders · 2 years ago

GPT has indeed been getting more verbose, but revenue has zero bearing on that decision. There's always a tradeoff here, and we do our imperfect best to pick a default that makes the most people happy.

I suspect the reason why most big LLMs have ended up in a pretty verbose spot is that it's easier for users to scroll & skim than to ask follow-up questions (which requires formulation + typing + waiting for a response).

With regard to this new gpt-4o model: you'll find it actually bucks the recent trend and is less verbose than its predecessor.

sophiabits · 2 years ago

I’ve especially noticed this with gpt-4o-mini [1], and it’s a big problem. My particular use case involves keeping a running summary of a conversation between a user and the LLM, and 4o-mini has a really bad tendency of inventing details in order to hit the desired summary word limit. I didn’t see this with 4o or earlier models

Fwiw my subjective experience has been that non-technical stakeholders tend to be more impressed with / agreeable to longer AI outputs, regardless of underlying quality. I have lost count of the number of times I’ve been asked to make outputs longer. Maybe this is just OpenAI responding to what users want?

[1] https://sophiabits.com/blog/new-llms-arent-always-better#exa...

throwaway48540 · 2 years ago

It's a subtle way to make it smarter. Making it write out the "thinking process" and decisions has always helped with reliability and quality.

sashank_1509 · 2 years ago

they also spend more to generate more tokens. The more obvious reason is it seems like people rate responses better the longer they are. Lmsys demonstrated that GPT tops the leaderboard because it tends to give much longer and more detailed answers, and it seems like OpenAI is optimizing or trying to maximize lmsys.

Culonavirus · 2 years ago

That's actually pretty impressive... if they didn't dumb it down that is, which only time will tell.

bilater · 2 years ago

I have not been able to get it to output anywhere close to the max though (even setting max tokens high). Are there any hacks to use to coax the model to produce longer outputs?

gamegoblin · 2 years ago

I'm glad they gave up on their "fine-tuning is all you need" approach to structured output. It's possible fine-tuning will work in the long term, but in the short-term, people are trying to build things, and fine-tuning wasn't cutting it.

Surprised it took them so long — llama.cpp got this feature 1.5 years ago (actually an even more general version of it that allows the user to provide any context free grammar, not just JSON schema)

BoorishBears · 2 years ago

I was surprised it took so long until I reached this line:

> The model can fail to follow the schema if the model chooses to refuse an unsafe request. If it chooses to refuse, the return message will have the refusal boolean set to true to indicate this.

I'm not sure how they implemented that, maybe they've figured out a way to give the grammar a token or set of tokens that are always valid mid generation and indicate the model would rather not continue generating.

Right now JSON generation is one of the most reliable ways to get around refusals, and they managed not to introduce that weakness into their model

tcdent · 2 years ago

GPT is still a language model, so at some point it's still just tokens.

Is this just a schema validation layer on their end to avoid the round trip (and cost) of repeating the call?

Language models like GPT output a large vector of probabilities for the next token. Then a sampler decides which of those tokens to pick.

The simplest algorithm for getting good quality output is to just always pick the highest probability token.

If you want more creativity, maybe you pick randomly among the top 5 highest probability tokens or something. There are a lot of methods.

All that grammar-constrained decoding does is zero out the probability of any token that would violate the grammar.

Der_Einzige · 2 years ago

For many things, fine-tuning as we know of it will NEVER fully solve it, there's no hope. Even fine-tuning a model to not use the letter "e" to an overwhelming degree doesn't entirely prevent it, only reduces its chances to increasingly small amounts. Shamesless self plug, and from before the ChatGPT era too! https://paperswithcode.com/paper/most-language-models-can-be...

chhabraamit · 2 years ago

How does llama.cpp’s grammar adherence work?

Does it keep validating the predicted tokens and backtrack when it’s not valid?

It's essentially an Earley Parser[0]. It maintains a set of all possible currently valid parses, and zeroes out the probability of any token that isn't valid in at least 1 of the current potential parse trees.

There are contrived grammars you can give it that will make it use exponential memory, but in practice most real-world grammars aren't like this.

[0] https://en.wikipedia.org/wiki/Earley_parser

roseway4 · 2 years ago

We extensively use vLLM's support for Outlines Structured Output with small language models (llama3 8B, for example) in Zep[0][1]. OpenAI's Structured Output is a great improvement on JSON mode, but it is rather primitive compared to vLLM and Outlines.

# Very Limited Field Typing

OpenAI offers a very limited set of types[2] (String, Number, Boolean, Object, Array, Enum, anyOf) without the ability to define patterns and max/min lengths. Outlines supports defining arbitrary RegEx patterns, making extracting currencies, phone numbers, zip codes, comma-separated lists, and more a trivial exercise.

# High Schema Setup Cost / Latency

vLLM and Outlines offer near-zero cost schema setup: RegEx finite state machine construction is extremely cheap on the first inference call. While OpenAI's context-free grammar generation has a significant latency penalty of "under ten seconds to a minute". This may not impact "warmed-up" inference but could present issues if schemas are more dynamic in nature.

Right now, this feels like a good first step, focusing on ensuring the right fields are present in schema-ed output. However, it doesn't yet offer the functionality to ensure the format of field contents beyond a primitive set of types. It will be interesting to watch where OpenAI takes this.

[0] https://help.getzep.com/structured-data-extraction

[1] https://help.getzep.com/dialog-classification

[2] https://platform.openai.com/docs/guides/structured-outputs/s...

titzer · 2 years ago

nichochar · 2 years ago

I'm a little confused why you have to specify "strict: true" to get this behavior. It is obviously always desired, I would be surprised for people to ever specify "strict: false". That API design leaves to be desired.

I also learned about constrainted decoding[1], that they give a brief explanation about. This is a really clever technique! It will increase reliability as well as reduce latency (less tokens to pick from) once the initial artifacts are loaded.

[1] https://www.aidancooper.co.uk/constrained-decoding/

athyuttamre · 2 years ago

Hi, I work on the OpenAI API — structured outputs schemas have limitations (e.g. all fields must be required, no additional properties allowed): https://platform.openai.com/docs/guides/structured-outputs/s....

If your schema is not supported, but you still want to use the model to generate output, you would use `strict: false`. Unfortunately we cannot make `strict: true` the default because it would break existing users. We hope to make it the default in a future API version.

You should also mention that before you had done custom alignment accounting for this feature, that it was an excellent alignment breaker (therefor a big no-no to release too early)

For example, if I ask an LLM to generate social security numbers, it will give the whole "I'm sorry Hal, I can't do that". If I ban all tokens except numbers and hyphens, prior to your "refusal = True" approach, it was guaranteed that even "aligned" models would generate what appeared to be social security numbers.

thrance · 2 years ago

Hi, if you're allowed to answer, are there any future plans to support custom CFGs through the API ? Like llama.cpp does with it's GBNF format.

dgellow · 2 years ago

Could you develop a bit re: the API? What do you dislike other than the “strict: true”?

pton_xd · 2 years ago

Isn't "we hardcoded JSON into the latest model" kind of the opposite direction, strategically, from "we're on the way to AGI and I need 7 trillion to get there?"

isoprophlex · 2 years ago

You are witnessing the final stages in the evolution of OpenAI from a messianic hype machine to Yet Another Product Company.

Hence all the people leaving, too.

gardenhedge · 2 years ago

I am ootl, employees are leaving openai?

chamomeal · 2 years ago

I mean it’s not groundbreaking, but it makes it much easier to make simple AI tools that aren’t chat-based. It definitely has me interested.

GPT-4 has been so mind-blowingly cool, but most of the interesting applications I can think of involve 10 steps of “ok now make sure GPT has actually formatted the question as a list of strings… ok now make sure GPT hasn’t responded with a refusal to answer the question…”

Idk what the deal is with their weird hype persona thing, but I’m stoked about this release

nsonha · 2 years ago

AGI is useless if you can't figure out how to employ it as part of a system, instead of just chit chat

KaiMagnus · 2 years ago

Yeah, definitely a way to end up with a Siri like mess if you do this long enough. The use case is there and it’s going to be very useful, but the magic is wearing off.

cvhc · 2 years ago

I wonder why the top level has to be an object instead of an array... I have some pretty normal use cases where I expect the model to extract a list of objects from the text.

``` openai.BadRequestError: Error code: 400 - {'error': {'message': 'Invalid schema for response_format \'PolicyStatements\': schema must be a JSON Schema of \'type: "object"\', got \'type: "array"\'.', 'type': 'invalid_request_error', 'param': 'response_format', 'code': None}} ```

I know I can always put the array into a single-key object but it's just so annoying I also have to modify the prompts accordingly to accomodate this.

moritzwarhier · 2 years ago

It's a relatively common convention for JSON APIs.

Possible reasons:

- Extensibility without breaking changes

- Forcing an object simplifies parsing of API responses, ideally the key should describe the contents, like additional metadata. It also simplifies validation, if considered separate from parsing

- Forcing the root of the API response to be an object makes sure that there is a single entry point into consuming it. There is no way to place non-descript heterogenous data items next to each other

- Imagine that you want to declare types (often generated from JSON schemas) for your API responses. That means you should refrain from placing different types, or a single too broad type in an array. Arrays should be used in a similar way to stricter languages, and not contain unexpected types. A top-level array invites dumping unspecified data to the client that is expensive and hard to process

- The blurry line between arrays and objects in JS does not cleanly map to other languages, not even very dynamic ones like PHP or Python. I'm aware that JSON and JS object literals are not the same. But even the JSON subset of JS (apart from number types, where it's not a subset AFAIK) already creates interesting edge cases for serialization and deserialization

manquer · 2 years ago

I can't say for OpenAI, but in general I have seen and used this design pattern to keep consistency of root object output and remove a lot of unnecessary validations and branching flows

Otherwise you will to handle the scenarios in code everywhere if you don't know if the root is object or array. If the root has a key that confirms to a known schema then validation becomes easier to write for that scenario,

Similar reasons to why so many APIs wrap with a key like 'data', 'value' or 'error' all responses or in RESTful HTTP endpoints collection say GET /v1/my-object endpoints do no mix with resource URIs GET /v1/my-object/1 the former is always an array the latter is always an object.

simonw · 2 years ago

I've regretted designing APIs that return an array rather than an object in the past.

It's all about the extensibility. If you return an object you can add extra keys, for things like "an error occurred, here are the details", or "this is truncated, here's how to paginate it", or a logs key for extra debug messages, or information about the currently authenticated user.

None of those are possible if the root is an array.

heliophobicdude · 2 years ago

Back in the old days, top level arrays were a security risk because the array constructor in JS could be redefined and do bad-guy stuff. I cannot think of any json parsing clients that are vulnerable to this.

tomComb · 2 years ago

Well, this wouldn’t be a very satisfying explanation, but these JSON objects are often represented as Python dictionaries and those can’t have top level arrays.

Too · 2 years ago

Try json.loads("[1,2,3]"), you'll get a list back.

The reasons others already posted about extensibility are more correct.