Parsing JSON Is a Minefield (2018)

userbinator · 7 years ago

I suppose you could saw that parsing any text-based protocol in general "Is a Minefield". They look so simple and "readable", which is why they're appealing initially, but parsing text always involves lots of corner-cases and I've always thought it a huge waste of resources to use text-based protocols for data that's not actually meant for human consumption the vast majority of the time.

Consider something as simple as parsing an integer in a text-based format; there may be whitespace to skip, an optional sign character, and then a loop to accumulate digits and convert them (itself a subtraction, multiply, and add), and there's still the questions of all the invalid cases and what they should do. In contrast, in a binary format, all that's required is to read the data, and the most complex thing which might be required is endianness conversion. Length-prefixed binary formats are almost trivial to parse, on par with reading a field from a struture.

robocat · 7 years ago

Binary formats have their own serious problems.

> Length-prefixed binary formats are almost trivial to parse

They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.

> the most complex thing which might be required is endianness conversion

That is a gross simplification. When you look at the details of binary representations, things get complex, and you end up with corner cases.

Let's look at floating point numbers: with a binary format you can transmit NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do not have the same binary representation. You have to choose single or double precision (maybe a benefit, not always). Etc.

Similarly in JSON integers or arrays of integers are nothing special. It is mostly a benefit not to have to specify UInt8Array.

JSON is one of many competitors within an ecology of programming, including binary formats, and yet JSON currently dominates large parts of that ecology. So far a binary format mutation hasn't beaten JSON, which is telling since binary had the early advantage (well: binary definitely wins in parts of the ecology, just as JSON wins in other parts).

magila · 7 years ago

> They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.

I assume you're mainly referring to buffer overflows, which are a problem with text based formats too. See for example the series of overflow vulnerabilities in IIS's HTTP parser which lead to some of the most disruptive worms in history like Code Red. Really this is more of a problem with memory-unsafe languages than serialization formats.

> Let's look at floating point numbers: with a binary format you can transmit NaN, Infinity, -infinity, and -0

Depending on the use case being able to encode these values may be a requirement, in which case binary is no worse than text.

> You can also create two NaN numbers that do not have the same binary representation.

This is specific to IEEE 754, not all binary representations have this issue. Text based formats also have far more pervasive problems with lacking a canonical representation so it's hard to count this as a point against binary.

> JSON is one of many competitors within an ecology of programming, including binary formats, and yet JSON currently dominates large parts of that ecology.

This is just an appeal to popularity fallacy.

recursive · 7 years ago

> Similarly in JSON integers or arrays of integers are nothing special.

JSON is perfectly capable of representing integers which cannot be represented in IEEE-754 double precision floating point. That seems at least a little special to me.

kentonv · 7 years ago

> with a binary format you can transmit NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do not have the same binary representation. You have to choose single or double precision (maybe a benefit, not always). Etc.

Boy you are really stretching to make this sound complicated. It's not. You transmit 4 bytes or 8 bytes. Serialization is a memcpy().

You don't have to think about NaNs and Infinities because they Just Work -- unlike with textual formats where you need to have special representations for them and you have to worry about whether you are possibly losing data by dropping those NaN bits. If you want to drop the NaN bits in a binary format, it's another one-liner to do so.

It's funny that you choose to pick on floating-point numbers here, because converting floating-points to decimal text and back is insanely complicated. One of the best-known implementations of converting FP to text is dtoa(), based on the paper (yes, a whole paper) called "How to Print Floating-Point Numbers Accurately". Here's the code:

http://www.netlib.org/fp/dtoa.c

Go take a look. I'll wait.

dtoa() is not even the state of the art anymore. Just in the last few years there have been significant advances, e.g. Grisu2, Grisu3, and Dragon4...

Again, in binary formats, all that is replaced by a memcpy() of 4 or 8 bytes.

(A previous rant of mine on this subject: https://news.ycombinator.com/item?id=17277560 )

> > Length-prefixed binary formats are almost trivial to parse

> They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.

Injection (forgetting to escape embedded text) is the root cause of a huge number of security flaws for text formats. Length-prefixed formats do not suffer from this.

What "huge number of security flaws" are you referring to that affect length-delimited values? Buffer overflows? Those aren't caused by values being length-delimited, they are caused by people deserializing variable-length values into fixed-length buffers without a check. That mistake can be made just as easily with a text format as with a binary format. In fact I've seen it much more often with text.

> JSON currently dominates large parts of that ecology.

JSON wins for one simple reason: it's easy for human developers to think about, because they can see what it looks like. This is very comforting. It's wasteful and full of pitfalls ("oops, my int64 silently lost a bunch of bits because JSON numbers are all floating point"), but comforting. Even I find it comforting.

Ironically, writing a full JSON parser from scratch is much more complicated that writing a full Protobuf parser. But developers are more comfortable with the parser being a black box than with the data format itself being a black box. ¯\_(ツ)_/¯

(Disclosure: I am the author of Protobuf v2 and Cap'n Proto. In addition to binary native formats, both have text-based alternate formats for which I wrote parsers and serializers, and I've also written a few JSON parsers in my time...)

dwaite · 7 years ago

> You can also create two NaN numbers that do not have the same binary representation.

Way worse than that. The interpretation of those different values of NaN is software specific. You can also have signalling NaN values - where the recipient can now have their number handling code trap in completely unexpected scenarios.

juliusmusseau · 7 years ago

Consider only this: "1.001"

I'll use JavaScript numeric literals here as my translation medium (ironic!):

Norway locale parses it to: 1001

USA locale parses it to: 1.001

France locale parses it to: NaN

https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/ind...

mort96 · 7 years ago

No. In a programming context, any norwegian or french programmer would expect that to evaluate to 1.001, not 1001 or NaN.

skywhopper · 7 years ago

This is a little too generous about the benefits of binary formats vs text formats. Ultimately, any data exchange between disparate systems is going to be a challenging task, no matter what format you choose. Both sides have to implement it in a compatible way. And ultimately, every format is a binary format. Encoding machine-level data structures direct on the wire sounds good, but it quickly gets complicated when you have to deal with multiple architectures and languages. And you don't have the benefit of the gradually accreted de-facto conventions like using UTF-8 encoding for text-based formats to fall back on, much less the ability for humans to troubleshoot by being able to read the wire protocol.

With sufficient discipline and rigor, and a good suite of tests, developed over years of practical experience, you can evolve a good general binary wire protocol, but by then it will turn out to be so complicated and heavyweight to use, that some upstart will come up with a NEW FANTASTIC format that doesn't have any of the particular annoyances of your rigorous protocol, and developers will flock to this innovative and efficient new format because it will help them get stuff done much faster, and most of them won't run into the edge cases the new format doesn't cover for years, and then some of them will write articles like this one and comments like yours and we can repeat the cycle every 10-20 years, just like we've been doing.

IloveHN84 · 7 years ago

Wait wait. XML with XSD Schemas are a easy problem. You can't fail with a XSD schema on place

pnx · 7 years ago

>Consider something as simple as parsing an integer in a text-based format; there may be whitespace to skip, an optional sign character, and then a loop to accumulate digits and convert them (itself a subtraction, multiply, and add), and there's still the questions of all the invalid cases and what they should do.

    ^[ *][-?][0-9][0-9]*[ *]$

You're welcome. Anything that passes that regex is a valid number. Now using that as a basis of a lexer means that you can store any int in whatever precision you feel like.

It's unfortunate that the majority of programmers these days are so computer illiterate that they can't write a parser for matching parens and call you an elitist for pointing out this is something anyone with a year of programming should be able to do in their sleep.

Deleted Comment

userbinator · 7 years ago

Just matching that regex alone is going to take a pretty large number of instructions (at least dozens.) That's not "simple" by any measure.

juliusmusseau · 7 years ago

Because of this article (which I encountered a year ago) I would say Parsing JSON is no longer a minefield.

I had to write my own JSON parser/formatter a year ago (to support Java 1.2 - don't ask) and this article and its supporting github repo (https://github.com/nst/JSONTestSuite) was an unexpected gift from the heavens.

AgentOrange1234 · 7 years ago

Wait. How is this no longer a minefield just because there is a test suite that identifies some tricky cases?

Doesn’t the test suite’s matrix demonstrate that there are tons of cases that aren’t handled consistently across these parsers?

juliusmusseau · 7 years ago

Good point. I am presuming the test suite is comprehensive. Does it cover 100% of all JSON mines? Probably not. But it surfaced about 30 bugs in my own implementation - things I would have never dreamed of.

So it certainly helped me. And just based on how thorough and insane the test suite is, I think I'm in good hands. Not perfect hands - but definitely a million times better than anything I would have come up with on my own.

The test suite made my parser blow up many times, and for each blow up I got to make a conscious decision in my bugfix: how do I want to handle this?

(I decided to let the 10,000 depth nested {{{{{{{{{{{{{{{{{{"key","value"}}}}}}}}}}}}}}}}}}} guy blow up even though it is legal. Yes, I'm too lazy to implement my own stack.) :-)

Dylan16807 · 7 years ago

It's a very clear list of mistakes to avoid and areas where you can choose to be lenient in parsing.

If you're emitting JSON you can skim the list and avoid all of them.

Either way the minefield proper is no longer your problem.

Multicomp · 7 years ago

This might be throwing a lit match into a gasoline refinery, but why not opt for XML in some circumstances?

Between its strong schema and wsdl support for internet standards like soap web services, XML covers a lot of ground that Json encoding doesn't necessarily have without add-ons.

I say this knowing this is an unfashionable opinion and XML has its own weaknesses, but in the spirit of using web standards and LoC approved "archivable formats", IMO there is still a place for XML in many serialization strategies around the computing landscape.

Json is perfect for serializing between client and server operations or in progressive web apps running in JavaScript. It is quite serviceable in other places as well such as microservice REST APIs, but in other areas of the landscape like middleware, database record excerpts, desktop settings, data transfer files, Json is not much better or sometimes even slightly worse than XML.

AtlasBarfed · 7 years ago

XML cannot be parsed into nested maps/dictionaries/lists/arrays without guidance from a type or a restricted xml structure.

JSON can do that. It also maps pretty seamlessly to types/classes in most languages without annotations, attributes, or other serialization guides.

It also has explicit indicators for lists vs subdocuments vs values for keys, which xml does not. XML tags can repeat, can have subtags, and then there are tag attributes. A JSON document can also be a list, while XML documents must be a tree with a root document.

XML may be acceptable for documents. But seeing as how XHTML was a complete dud, I doubt it is useful even for that.

And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.

falcolas · 7 years ago

> And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.

So, that’s why we’re adding all of this “junk” back into JSON? Transformers, XPath for JSON, validation, schemas, namespaces (JSON-LD, JSON prefixes) it’s all there.

History repeating itself (and here’s the important part) because this complexity is needed. Not every application will need every complication, but every complication is needed by some application.

zvrba · 7 years ago

> XML cannot be parsed into nested maps/dictionaries/lists/arrays without guidance from a type or a restricted xml structure.

? Using XML without a schema is slightly worse than JSON because the content of each node is just "text". XML with schema is far more powerful, also because of a richer type-system. JSON dictionaries are most of the time used to encode structs, but for that you have `complexType` and `sequence` in the XML schema.

I've been using XML with strongly-typed schemas for serialization for the last couple of years and couldn't be happier. I have ~100 classes in the schema, yet I've needed a true dictionary like 2 or 3 times.

> And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.

Validation is junk? Isn't it valuable to know that 1) if your schema requires a certain element, and 2) if the document has passed validation, then navigating to that element and parsing it according to its schema type won't throw a run-time exception?

Namespaces are junk? They serve the same purpose as in programming languages. How else would you put two elements of the same name but of different semantics (coming from different sources) into the same document? You can fake this in JSON "by convention", but in XML it's standardized.

crispyambulance · 7 years ago

XML is a perfectly serviceable data exchange format. The parsers and serializers work great when used properly. It's nice to have schema.

But I think people just got sick of XML because it was abused so badly with "web services", SOAP, wsdl and all those horrible technologies from the early naughts. Over-complicated balls of mud that made people miserable.

tootie · 7 years ago

When I was doing XML/Java stuff 10 years ago, you take your XSD and generate domain classes as a build step. It was more complicated but it was also 100% reliable because the tools were all rock solid. Written by the guy who made Jenkins.

hombre_fatal · 7 years ago

Not the best context to suggest XML superiority: https://cheatsheetseries.owasp.org/cheatsheets/XML_Security_...

If parsing JSON is bad, XML is a clusterfuck.

Nicksil · 7 years ago

> Not the best context to suggest XML superiority

where was it insinuated XML is superior? It was a very reasonable response.

specialist · 7 years ago

Syntax aside, I think the original mistake is IDLs, schemas, and other attempts at formalism.

WSDL, SOAP, and all their precursors were attempted in spite of Postel's Law.

Repeating myself:

Back when I was doing electronic medical records, my two-person team ran circles around our (much larger) partners by abandoning the schema tool stack. We were able to detect, debug, correct interchange problems and deploy fixes in near realtime. Whereas our partners would take days.

Just "screen scrap" inbound messages, use templates to generate outbound messages.

I'd dummy up working payloads using tools like SoapUI. Convert those known good "reference" payloads into templates. (At the time, I preferred Velocity.) Version every thing. To troubleshoot, rerun the reference messages, diff the captured results. Massage until working.

Our partners, and everyone I've told since, just couldn't grok this approach. No, no, no, we need schemas, code generators, etc.

There's a separate HN post about Square using DSLs to implement OpenAPI endpoints. That's maybe 1/4th of the way to our own home made solution.

Zarel · 7 years ago

I personally like XML a lot for rich text (I like HTML better than TeX) and layout (like in JSX for React), and it's not horrible if you want a readable representation for a tree, but I can't imagine using it for any other purpose.

JSON is exactly designed for object serialization. XML can be used for that purpose but it's awkward and requires a lot of unnecessary decisions (what becomes a tag? what becomes an attribute? how do you represent null separately from the empty string?) which just have an easy answer in JSON. And I can't think of any advantage XML has to make up for that flaw. Sure, XML can have schemas, but so can JSON.

I will agree that JSON is horrible for config files for humans to edit, but XML is quite possibly even worse at that. I don't really like YAML, either. TOML isn't bad, but I actually rather like JSON5 for config files - it's very readable for everyone who can read JSON, and fixes all the design decisions making it hard for humans to read and edit.

taftster · 7 years ago

One of the biggest advantages for XML are attributes and namespaces. I miss these in JSON.

As AtlasBarfed mentioned, JSON has a native map and list structure in its syntax, which is sorely missed in XML. You have to rely on an XML Schema to know that some tag is expected to represent a map or list.

JSON with attributes and namespaces would be my ideal world.

beatgammit · 7 years ago

Why do you want those? Attributes and namespaces just make in memory representation complicated. They're quite useful for markup, but I don't really know why you'd want them in a data format.

Use JSON or a binary protocol for data, XML for markup.

legulere · 7 years ago

If JSON with its relative simplicity is already too complex and leading to a mine field, then XML is even worse by far.

twblalock · 7 years ago

XML manages to be difficult and complex for both computers and people to read. That's why it fell out of favor.

dwaite · 7 years ago

To be fair, there were a lot of very good ideas for a 2.x XML that solved a lot of the complexity. The problem was that none of the tools would be upgraded to support it.

You'd basically have to create a new independent format to have proper compatibility once you introduce breaking changes.

Deleted Comment

carapace · 7 years ago

Not to mention XSLT.

inopinatus · 7 years ago

Once you're parsed the first minefield, another crop emerges: interpreting the result. Even the range of values seen in the wild for a supposedly simple boolean attribute is just mind-boggling. Setting aside all the noise from jokers trying it on with fuzzing engines, we'll see all of these presented to various APIs:

    true
    false
    null
    0 | 1
    "true" | "false"    (with assorted variation by
    "yes"  | "no"        case and initial character)
    "" | "0" | "1"
    "\u2713"            (hi DHH)
    -1                  (with complements)
    "[object Object]"
    { "value": true }   (and friends)
                        (attribute not present)
    "敵牴"

That last looks like a doozy, but old lags will guess what's going on right away. It's the octets of the 8-bit string "true", misinterpreted as UCS-2 (16-bit wide character) code points and then spat out as UTF8. Google translates it, quite appropriately, as "Enemy".

Oddly though, according to my records, never seen a "NULL".

umvi · 7 years ago

I'm fine with a parser that doesn't get all of the corner cases as long as it fails gracefully.

Really, the only time it would matter is if you are parsing user-provided JSON and said user was trying to exploit your parser somehow.

But 99% of the time, I'm not parsing user-provided JSON, so I don't ever encounter these corner cases and parsing/serialization works great.

juliusmusseau · 7 years ago

What about the 2^63 corner-case?

Consider this JSON: {"key": 9223372036854775807}. With most parsers it never fails.

But... some JSON parsers (include JS.eval) parse it to 9223372036854776000 and continue on their merry way.

The problem isn't user-provided JSON here. The problem is user-provided data (or computer-provided data) that's inside the JSON.

rachelbythebay's take (http://rachelbythebay.com/w/2019/07/21/reliability/):

On the other hand, if you only need 53 bits of your 64 bit numbers, and enjoy blowing CPU on ridiculously inefficient marshaling and unmarshaling steps, hey, it's your funeral.

jacobolus · 7 years ago

> some JSON parsers (include JS.eval) parse it to 9223372036854776000 and continue on their merry way

This is correct behavior though...? Every number in JSON is implicitly a double-precision float. JSON doesn’t distinguish other number types.

If you want that big a string of digits in JSON, put it in a string.

Edit: let me make a more precise statement since several people seem to have a problem with the one above:

Every number that you send to a typical JavaScript JSON parser is implicitly a double-precision float, and it is correct behavior for a JavaScript JSON parser to treat a long string of digits as a double-precision float, even if that results in lost precision.

The JSON specification itself punts on the precise semantic meaning of numbers, leaving it up to producers and consumers of the JSON to coordinate their number interpretation.

shawnz · 7 years ago

In what situation would that create a problem that isn't noticed immediately during testing?

majewsky · 7 years ago

> 99% of the time, I'm not parsing user-provided JSON

I take it you've never implemented a service with a REST API.

Deleted Comment

umvi · 7 years ago

I have, but I work

nullwasamistake · 7 years ago

JSON sucks. Maybe half our REST bugs are directly related to JSON parsing.

Is that a long or an int? Boolean or the string "true"? Does my library include undefined properties in the JSON? How should I encode and decode this binary blob?

We tried using OpenApi specs on the server and generators to build the clients. In general, the generators are buggy as hell. We eventually gave up as about 1/4 of the endpoints generated directly from our server code didn't work. One look at a spec doc will tell you the complexity is just too high.

We are moving to gRPC. It just works, and takes all the fiddling out of HTTP. It saves us from dev slap fights over stupid cruft like whether an endpoint should be PUT or POST. And saves us a massive amount of time making all those decisions.

hu3 · 7 years ago

Off-topic but I'd want to work on a place where half the REST bugs are from JSON parsing.

craigds · 7 years ago

Yeah I don't believe I've ever seen a json parsing problem in 11 years of software development.

nullwasamistake · 7 years ago

Just get a boring webapp job in CRUD world :)

chairmanwow · 7 years ago

I have had the absolute joy of working with gRPC services recently. Static schemas and built in streaming mechanics are fantastic. It definitely removes a lot of my gripes with REST endpoints by design.

truth_seeker · 7 years ago

Just recently V8 the JS engine rewrote their JSON parsing code to achieve upto 2.7x faster parsing and also making it memory efficient.

Ref link:- https://v8.dev/blog/v8-release-76

majewsky · 7 years ago

Ah, so that's the source of that Chrome bug that we saw last week. Customers on Chrome for Windows (only that, not Chrome for Linux or macOS) were complaining that the search on our statically-generated documentation site was not working. The search is implemented by a JavaScript file that downloads a JSON containing a search index, and it turns out that this search index had too much nesting for Chrome on Windows's JSON parser. This would reliably produce a stack overflow:

  JSON.parse(Array(3000).join('[')+Array(3000).join(']'))

We were about to report a bug when we noticed that the problem was fixed in Chrome 76, and the users in question were still on Chrome 75.

zazagura · 7 years ago

Pretty weird for a JSON parser to be platform dependent.

Deleted Comment

iamleppert · 7 years ago

Check out the simd JSON project if you’re interested in a super fast JSON parser:

https://github.com/lemire/simdjson

I’ve been using to process and maintain giant JSON structures and it’s faster than any other parser I’ve tried. I was able to replace my previous batch job with this as it gives real-time performance.

calcifer · 7 years ago

This seems to have nothing to do with the article though?

iamleppert · 7 years ago

It’s a JSON parser?

kthejoker2 · 7 years ago

How does it do on the article's test suite?

glangdale · 7 years ago

[ Original designer of much of simdjson here ]

We haven't used that particular suite, but almost everything in that suite is something we've thought about. In many cases we do the right thing by not innovating and randomly allowing stuff that isn't in the spec.

I see exactly one thing we didn't think about, as our construction of a parse tree is pretty basic and we don't build an associative structure even when building up an object - thus we would not register an error when confronted with the malformed input listed under "2.4 Objects Duplicated Keys", but happily build a parse tree with duplicated keys (which will be built up strictly as a linear structure, not an associative one).

There seems to be leeway on this point as to what an implementation should do. It certainly doesn't fit our usage model very well to build a associative structure right there on the spot - some of our users wouldn't want that much complexity/overhead.

iamleppert · 7 years ago

I haven’t tested it but it parses all my JSON just fine