I suppose you could saw that parsing any text-based protocol in general "Is a Minefield". They look so simple and "readable", which is why they're appealing initially, but parsing text always involves lots of corner-cases and I've always thought it a huge waste of resources to use text-based protocols for data that's not actually meant for human consumption the vast majority of the time.
Consider something as simple as parsing an integer in a text-based format; there may be whitespace to skip, an optional sign character, and then a loop to accumulate digits and convert them (itself a subtraction, multiply, and add), and there's still the questions of all the invalid cases and what they should do. In contrast, in a binary format, all that's required is to read the data, and the most complex thing which might be required is endianness conversion. Length-prefixed binary formats are almost trivial to parse, on par with reading a field from a struture.
> Length-prefixed binary formats are almost trivial to parse
They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.
> the most complex thing which might be required is endianness conversion
That is a gross simplification. When you look at the details of binary representations, things get complex, and you end up with corner cases.
Let's look at floating point numbers: with a binary format you can transmit NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do not have the same binary representation. You have to choose single or double precision (maybe a benefit, not always). Etc.
Similarly in JSON integers or arrays of integers are nothing special. It is mostly a benefit not to have to specify UInt8Array.
JSON is one of many competitors within an ecology of programming, including binary formats, and yet JSON currently dominates large parts of that ecology. So far a binary format mutation hasn't beaten JSON, which is telling since binary had the early advantage (well: binary definitely wins in parts of the ecology, just as JSON wins in other parts).
> They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.
I assume you're mainly referring to buffer overflows, which are a problem with text based formats too. See for example the series of overflow vulnerabilities in IIS's HTTP parser which lead to some of the most disruptive worms in history like Code Red. Really this is more of a problem with memory-unsafe languages than serialization formats.
> Let's look at floating point numbers: with a binary format you can transmit NaN, Infinity, -infinity, and -0
Depending on the use case being able to encode these values may be a requirement, in which case binary is no worse than text.
> You can also create two NaN numbers that do not have the same binary representation.
This is specific to IEEE 754, not all binary representations have this issue. Text based formats also have far more pervasive problems with lacking a canonical representation so it's hard to count this as a point against binary.
> JSON is one of many competitors within an ecology of programming, including binary formats, and yet JSON currently dominates large parts of that ecology.
> Similarly in JSON integers or arrays of integers are nothing special.
JSON is perfectly capable of representing integers which cannot be represented in IEEE-754 double precision floating point. That seems at least a little special to me.
> with a binary format you can transmit NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do not have the same binary representation. You have to choose single or double precision (maybe a benefit, not always). Etc.
Boy you are really stretching to make this sound complicated. It's not. You transmit 4 bytes or 8 bytes. Serialization is a memcpy().
You don't have to think about NaNs and Infinities because they Just Work -- unlike with textual formats where you need to have special representations for them and you have to worry about whether you are possibly losing data by dropping those NaN bits. If you want to drop the NaN bits in a binary format, it's another one-liner to do so.
It's funny that you choose to pick on floating-point numbers here, because converting floating-points to decimal text and back is insanely complicated. One of the best-known implementations of converting FP to text is dtoa(), based on the paper (yes, a whole paper) called "How to Print Floating-Point Numbers Accurately". Here's the code:
> > Length-prefixed binary formats are almost trivial to parse
> They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.
Injection (forgetting to escape embedded text) is the root cause of a huge number of security flaws for text formats. Length-prefixed formats do not suffer from this.
What "huge number of security flaws" are you referring to that affect length-delimited values? Buffer overflows? Those aren't caused by values being length-delimited, they are caused by people deserializing variable-length values into fixed-length buffers without a check. That mistake can be made just as easily with a text format as with a binary format. In fact I've seen it much more often with text.
> JSON currently dominates large parts of that ecology.
JSON wins for one simple reason: it's easy for human developers to think about, because they can see what it looks like. This is very comforting. It's wasteful and full of pitfalls ("oops, my int64 silently lost a bunch of bits because JSON numbers are all floating point"), but comforting. Even I find it comforting.
Ironically, writing a full JSON parser from scratch is much more complicated that writing a full Protobuf parser. But developers are more comfortable with the parser being a black box than with the data format itself being a black box. ¯\_(ツ)_/¯
(Disclosure: I am the author of Protobuf v2 and Cap'n Proto. In addition to binary native formats, both have text-based alternate formats for which I wrote parsers and serializers, and I've also written a few JSON parsers in my time...)
> You can also create two NaN numbers that do not have the same binary representation.
Way worse than that. The interpretation of those different values of NaN is software specific. You can also have signalling NaN values - where the recipient can now have their number handling code trap in completely unexpected scenarios.
This is a little too generous about the benefits of binary formats vs text formats. Ultimately, any data exchange between disparate systems is going to be a challenging task, no matter what format you choose. Both sides have to implement it in a compatible way. And ultimately, every format is a binary format. Encoding machine-level data structures direct on the wire sounds good, but it quickly gets complicated when you have to deal with multiple architectures and languages. And you don't have the benefit of the gradually accreted de-facto conventions like using UTF-8 encoding for text-based formats to fall back on, much less the ability for humans to troubleshoot by being able to read the wire protocol.
With sufficient discipline and rigor, and a good suite of tests, developed over years of practical experience, you can evolve a good general binary wire protocol, but by then it will turn out to be so complicated and heavyweight to use, that some upstart will come up with a NEW FANTASTIC format that doesn't have any of the particular annoyances of your rigorous protocol, and developers will flock to this innovative and efficient new format because it will help them get stuff done much faster, and most of them won't run into the edge cases the new format doesn't cover for years, and then some of them will write articles like this one and comments like yours and we can repeat the cycle every 10-20 years, just like we've been doing.
>Consider something as simple as parsing an integer in a text-based format; there may be whitespace to skip, an optional sign character, and then a loop to accumulate digits and convert them (itself a subtraction, multiply, and add), and there's still the questions of all the invalid cases and what they should do.
^[ *][-?][0-9][0-9]*[ *]$
You're welcome. Anything that passes that regex is a valid number. Now using that as a basis of a lexer means that you can store any int in whatever precision you feel like.
It's unfortunate that the majority of programmers these days are so computer illiterate that they can't write a parser for matching parens and call you an elitist for pointing out this is something anyone with a year of programming should be able to do in their sleep.
Because of this article (which I encountered a year ago) I would say Parsing JSON is no longer a minefield.
I had to write my own JSON parser/formatter a year ago (to support Java 1.2 - don't ask) and this article and its supporting github repo (https://github.com/nst/JSONTestSuite) was an unexpected gift from the heavens.
Good point. I am presuming the test suite is comprehensive. Does it cover 100% of all JSON mines? Probably not. But it surfaced about 30 bugs in my own implementation - things I would have never dreamed of.
So it certainly helped me. And just based on how thorough and insane the test suite is, I think I'm in good hands. Not perfect hands - but definitely a million times better than anything I would have come up with on my own.
The test suite made my parser blow up many times, and for each blow up I got to make a conscious decision in my bugfix: how do I want to handle this?
(I decided to let the 10,000 depth nested {{{{{{{{{{{{{{{{{{"key","value"}}}}}}}}}}}}}}}}}}} guy blow up even though it is legal. Yes, I'm too lazy to implement my own stack.) :-)
This might be throwing a lit match into a gasoline refinery, but why not opt for XML in some circumstances?
Between its strong schema and wsdl support for internet standards like soap web services, XML covers a lot of ground that Json encoding doesn't necessarily have without add-ons.
I say this knowing this is an unfashionable opinion and XML has its own weaknesses, but in the spirit of using web standards and LoC approved "archivable formats", IMO there is still a place for XML in many serialization strategies around the computing landscape.
Json is perfect for serializing between client and server operations or in progressive web apps running in JavaScript. It is quite serviceable in other places as well such as microservice REST APIs, but in other areas of the landscape like middleware, database record excerpts, desktop settings, data transfer files, Json is not much better or sometimes even slightly worse than XML.
XML cannot be parsed into nested maps/dictionaries/lists/arrays without guidance from a type or a restricted xml structure.
JSON can do that. It also maps pretty seamlessly to types/classes in most languages without annotations, attributes, or other serialization guides.
It also has explicit indicators for lists vs subdocuments vs values for keys, which xml does not. XML tags can repeat, can have subtags, and then there are tag attributes. A JSON document can also be a list, while XML documents must be a tree with a root document.
XML may be acceptable for documents. But seeing as how XHTML was a complete dud, I doubt it is useful even for that.
And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.
> And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.
So, that’s why we’re adding all of this “junk” back into JSON? Transformers, XPath for JSON, validation, schemas, namespaces (JSON-LD, JSON prefixes) it’s all there.
History repeating itself (and here’s the important part) because this complexity is needed. Not every application will need every complication, but every complication is needed by some application.
> XML cannot be parsed into nested maps/dictionaries/lists/arrays without guidance from a type or a restricted xml structure.
? Using XML without a schema is slightly worse than JSON because the content of each node is just "text". XML with schema is far more powerful, also because of a richer type-system. JSON dictionaries are most of the time used to encode structs, but for that you have `complexType` and `sequence` in the XML schema.
I've been using XML with strongly-typed schemas for serialization for the last couple of years and couldn't be happier. I have ~100 classes in the schema, yet I've needed a true dictionary like 2 or 3 times.
> And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.
Validation is junk? Isn't it valuable to know that 1) if your schema requires a certain element, and 2) if the document has passed validation, then navigating to that element and parsing it according to its schema type won't throw a run-time exception?
Namespaces are junk? They serve the same purpose as in programming languages. How else would you put two elements of the same name but of different semantics (coming from different sources) into the same document? You can fake this in JSON "by convention", but in XML it's standardized.
XML is a perfectly serviceable data exchange format. The parsers and serializers work great when used properly. It's nice to have schema.
But I think people just got sick of XML because it was abused so badly with "web services", SOAP, wsdl and all those horrible technologies from the early naughts. Over-complicated balls of mud that made people miserable.
When I was doing XML/Java stuff 10 years ago, you take your XSD and generate domain classes as a build step. It was more complicated but it was also 100% reliable because the tools were all rock solid. Written by the guy who made Jenkins.
Syntax aside, I think the original mistake is IDLs, schemas, and other attempts at formalism.
WSDL, SOAP, and all their precursors were attempted in spite of Postel's Law.
Repeating myself:
Back when I was doing electronic medical records, my two-person team ran circles around our (much larger) partners by abandoning the schema tool stack. We were able to detect, debug, correct interchange problems and deploy fixes in near realtime. Whereas our partners would take days.
Just "screen scrap" inbound messages, use templates to generate outbound messages.
I'd dummy up working payloads using tools like SoapUI. Convert those known good "reference" payloads into templates. (At the time, I preferred Velocity.) Version every thing. To troubleshoot, rerun the reference messages, diff the captured results. Massage until working.
Our partners, and everyone I've told since, just couldn't grok this approach. No, no, no, we need schemas, code generators, etc.
There's a separate HN post about Square using DSLs to implement OpenAPI endpoints. That's maybe 1/4th of the way to our own home made solution.
I personally like XML a lot for rich text (I like HTML better than TeX) and layout (like in JSX for React), and it's not horrible if you want a readable representation for a tree, but I can't imagine using it for any other purpose.
JSON is exactly designed for object serialization. XML can be used for that purpose but it's awkward and requires a lot of unnecessary decisions (what becomes a tag? what becomes an attribute? how do you represent null separately from the empty string?) which just have an easy answer in JSON. And I can't think of any advantage XML has to make up for that flaw. Sure, XML can have schemas, but so can JSON.
I will agree that JSON is horrible for config files for humans to edit, but XML is quite possibly even worse at that. I don't really like YAML, either. TOML isn't bad, but I actually rather like JSON5 for config files - it's very readable for everyone who can read JSON, and fixes all the design decisions making it hard for humans to read and edit.
One of the biggest advantages for XML are attributes and namespaces. I miss these in JSON.
As AtlasBarfed mentioned, JSON has a native map and list structure in its syntax, which is sorely missed in XML. You have to rely on an XML Schema to know that some tag is expected to represent a map or list.
JSON with attributes and namespaces would be my ideal world.
Why do you want those? Attributes and namespaces just make in memory representation complicated. They're quite useful for markup, but I don't really know why you'd want them in a data format.
Use JSON or a binary protocol for data, XML for markup.
To be fair, there were a lot of very good ideas for a 2.x XML that solved a lot of the complexity. The problem was that none of the tools would be upgraded to support it.
You'd basically have to create a new independent format to have proper compatibility once you introduce breaking changes.
Once you're parsed the first minefield, another crop emerges: interpreting the result. Even the range of values seen in the wild for a supposedly simple boolean attribute is just mind-boggling. Setting aside all the noise from jokers trying it on with fuzzing engines, we'll see all of these presented to various APIs:
That last looks like a doozy, but old lags will guess what's going on right away. It's the octets of the 8-bit string "true", misinterpreted as UCS-2 (16-bit wide character) code points and then spat out as UTF8. Google translates it, quite appropriately, as "Enemy".
Oddly though, according to my records, never seen a "NULL".
On the other hand, if you only need 53 bits of your 64 bit numbers, and enjoy blowing CPU on ridiculously inefficient marshaling and unmarshaling steps, hey, it's your funeral.
> some JSON parsers (include JS.eval) parse it to 9223372036854776000 and continue on their merry way
This is correct behavior though...? Every number in JSON is implicitly a double-precision float. JSON doesn’t distinguish other number types.
If you want that big a string of digits in JSON, put it in a string.
Edit: let me make a more precise statement since several people seem to have a problem with the one above:
Every number that you send to a typical JavaScript JSON parser is implicitly a double-precision float, and it is correct behavior for a JavaScript JSON parser to treat a long string of digits as a double-precision float, even if that results in lost precision.
The JSON specification itself punts on the precise semantic meaning of numbers, leaving it up to producers and consumers of the JSON to coordinate their number interpretation.
JSON sucks. Maybe half our REST bugs are directly related to JSON parsing.
Is that a long or an int? Boolean or the string "true"? Does my library include undefined properties in the JSON? How should I encode and decode this binary blob?
We tried using OpenApi specs on the server and generators to build the clients. In general, the generators are buggy as hell. We eventually gave up as about 1/4 of the endpoints generated directly from our server code didn't work. One look at a spec doc will tell you the complexity is just too high.
We are moving to gRPC. It just works, and takes all the fiddling out of HTTP. It saves us from dev slap fights over stupid cruft like whether an endpoint should be PUT or POST. And saves us a massive amount of time making all those decisions.
I have had the absolute joy of working with gRPC services recently. Static schemas and built in streaming mechanics are fantastic. It definitely removes a lot of my gripes with REST endpoints by design.
Ah, so that's the source of that Chrome bug that we saw last week. Customers on Chrome for Windows (only that, not Chrome for Linux or macOS) were complaining that the search on our statically-generated documentation site was not working. The search is implemented by a JavaScript file that downloads a JSON containing a search index, and it turns out that this search index had too much nesting for Chrome on Windows's JSON parser. This would reliably produce a stack overflow:
I’ve been using to process and maintain giant JSON structures and it’s faster than any other parser I’ve tried. I was able to replace my previous batch job with this as it gives real-time performance.
We haven't used that particular suite, but almost everything in that suite is something we've thought about. In many cases we do the right thing by not innovating and randomly allowing stuff that isn't in the spec.
I see exactly one thing we didn't think about, as our construction of a parse tree is pretty basic and we don't build an associative structure even when building up an object - thus we would not register an error when confronted with the malformed input listed under "2.4 Objects Duplicated Keys", but happily build a parse tree with duplicated keys (which will be built up strictly as a linear structure, not an associative one).
There seems to be leeway on this point as to what an implementation should do. It certainly doesn't fit our usage model very well to build a associative structure right there on the spot - some of our users wouldn't want that much complexity/overhead.
Consider something as simple as parsing an integer in a text-based format; there may be whitespace to skip, an optional sign character, and then a loop to accumulate digits and convert them (itself a subtraction, multiply, and add), and there's still the questions of all the invalid cases and what they should do. In contrast, in a binary format, all that's required is to read the data, and the most complex thing which might be required is endianness conversion. Length-prefixed binary formats are almost trivial to parse, on par with reading a field from a struture.
> Length-prefixed binary formats are almost trivial to parse
They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.
> the most complex thing which might be required is endianness conversion
That is a gross simplification. When you look at the details of binary representations, things get complex, and you end up with corner cases.
Let's look at floating point numbers: with a binary format you can transmit NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do not have the same binary representation. You have to choose single or double precision (maybe a benefit, not always). Etc.
Similarly in JSON integers or arrays of integers are nothing special. It is mostly a benefit not to have to specify UInt8Array.
JSON is one of many competitors within an ecology of programming, including binary formats, and yet JSON currently dominates large parts of that ecology. So far a binary format mutation hasn't beaten JSON, which is telling since binary had the early advantage (well: binary definitely wins in parts of the ecology, just as JSON wins in other parts).
I assume you're mainly referring to buffer overflows, which are a problem with text based formats too. See for example the series of overflow vulnerabilities in IIS's HTTP parser which lead to some of the most disruptive worms in history like Code Red. Really this is more of a problem with memory-unsafe languages than serialization formats.
> Let's look at floating point numbers: with a binary format you can transmit NaN, Infinity, -infinity, and -0
Depending on the use case being able to encode these values may be a requirement, in which case binary is no worse than text.
> You can also create two NaN numbers that do not have the same binary representation.
This is specific to IEEE 754, not all binary representations have this issue. Text based formats also have far more pervasive problems with lacking a canonical representation so it's hard to count this as a point against binary.
> JSON is one of many competitors within an ecology of programming, including binary formats, and yet JSON currently dominates large parts of that ecology.
This is just an appeal to popularity fallacy.
JSON is perfectly capable of representing integers which cannot be represented in IEEE-754 double precision floating point. That seems at least a little special to me.
Boy you are really stretching to make this sound complicated. It's not. You transmit 4 bytes or 8 bytes. Serialization is a memcpy().
You don't have to think about NaNs and Infinities because they Just Work -- unlike with textual formats where you need to have special representations for them and you have to worry about whether you are possibly losing data by dropping those NaN bits. If you want to drop the NaN bits in a binary format, it's another one-liner to do so.
It's funny that you choose to pick on floating-point numbers here, because converting floating-points to decimal text and back is insanely complicated. One of the best-known implementations of converting FP to text is dtoa(), based on the paper (yes, a whole paper) called "How to Print Floating-Point Numbers Accurately". Here's the code:
http://www.netlib.org/fp/dtoa.c
Go take a look. I'll wait.
dtoa() is not even the state of the art anymore. Just in the last few years there have been significant advances, e.g. Grisu2, Grisu3, and Dragon4...
Again, in binary formats, all that is replaced by a memcpy() of 4 or 8 bytes.
(A previous rant of mine on this subject: https://news.ycombinator.com/item?id=17277560 )
> > Length-prefixed binary formats are almost trivial to parse
> They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.
Injection (forgetting to escape embedded text) is the root cause of a huge number of security flaws for text formats. Length-prefixed formats do not suffer from this.
What "huge number of security flaws" are you referring to that affect length-delimited values? Buffer overflows? Those aren't caused by values being length-delimited, they are caused by people deserializing variable-length values into fixed-length buffers without a check. That mistake can be made just as easily with a text format as with a binary format. In fact I've seen it much more often with text.
> JSON currently dominates large parts of that ecology.
JSON wins for one simple reason: it's easy for human developers to think about, because they can see what it looks like. This is very comforting. It's wasteful and full of pitfalls ("oops, my int64 silently lost a bunch of bits because JSON numbers are all floating point"), but comforting. Even I find it comforting.
Ironically, writing a full JSON parser from scratch is much more complicated that writing a full Protobuf parser. But developers are more comfortable with the parser being a black box than with the data format itself being a black box. ¯\_(ツ)_/¯
(Disclosure: I am the author of Protobuf v2 and Cap'n Proto. In addition to binary native formats, both have text-based alternate formats for which I wrote parsers and serializers, and I've also written a few JSON parsers in my time...)
Way worse than that. The interpretation of those different values of NaN is software specific. You can also have signalling NaN values - where the recipient can now have their number handling code trap in completely unexpected scenarios.
I'll use JavaScript numeric literals here as my translation medium (ironic!):
Norway locale parses it to: 1001
USA locale parses it to: 1.001
France locale parses it to: NaN
https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/ind...
With sufficient discipline and rigor, and a good suite of tests, developed over years of practical experience, you can evolve a good general binary wire protocol, but by then it will turn out to be so complicated and heavyweight to use, that some upstart will come up with a NEW FANTASTIC format that doesn't have any of the particular annoyances of your rigorous protocol, and developers will flock to this innovative and efficient new format because it will help them get stuff done much faster, and most of them won't run into the edge cases the new format doesn't cover for years, and then some of them will write articles like this one and comments like yours and we can repeat the cycle every 10-20 years, just like we've been doing.
It's unfortunate that the majority of programmers these days are so computer illiterate that they can't write a parser for matching parens and call you an elitist for pointing out this is something anyone with a year of programming should be able to do in their sleep.
Deleted Comment
I had to write my own JSON parser/formatter a year ago (to support Java 1.2 - don't ask) and this article and its supporting github repo (https://github.com/nst/JSONTestSuite) was an unexpected gift from the heavens.
Doesn’t the test suite’s matrix demonstrate that there are tons of cases that aren’t handled consistently across these parsers?
So it certainly helped me. And just based on how thorough and insane the test suite is, I think I'm in good hands. Not perfect hands - but definitely a million times better than anything I would have come up with on my own.
The test suite made my parser blow up many times, and for each blow up I got to make a conscious decision in my bugfix: how do I want to handle this?
(I decided to let the 10,000 depth nested {{{{{{{{{{{{{{{{{{"key","value"}}}}}}}}}}}}}}}}}}} guy blow up even though it is legal. Yes, I'm too lazy to implement my own stack.) :-)
If you're emitting JSON you can skim the list and avoid all of them.
Either way the minefield proper is no longer your problem.
Between its strong schema and wsdl support for internet standards like soap web services, XML covers a lot of ground that Json encoding doesn't necessarily have without add-ons.
I say this knowing this is an unfashionable opinion and XML has its own weaknesses, but in the spirit of using web standards and LoC approved "archivable formats", IMO there is still a place for XML in many serialization strategies around the computing landscape.
Json is perfect for serializing between client and server operations or in progressive web apps running in JavaScript. It is quite serviceable in other places as well such as microservice REST APIs, but in other areas of the landscape like middleware, database record excerpts, desktop settings, data transfer files, Json is not much better or sometimes even slightly worse than XML.
JSON can do that. It also maps pretty seamlessly to types/classes in most languages without annotations, attributes, or other serialization guides.
It also has explicit indicators for lists vs subdocuments vs values for keys, which xml does not. XML tags can repeat, can have subtags, and then there are tag attributes. A JSON document can also be a list, while XML documents must be a tree with a root document.
XML may be acceptable for documents. But seeing as how XHTML was a complete dud, I doubt it is useful even for that.
And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.
So, that’s why we’re adding all of this “junk” back into JSON? Transformers, XPath for JSON, validation, schemas, namespaces (JSON-LD, JSON prefixes) it’s all there.
History repeating itself (and here’s the important part) because this complexity is needed. Not every application will need every complication, but every complication is needed by some application.
? Using XML without a schema is slightly worse than JSON because the content of each node is just "text". XML with schema is far more powerful, also because of a richer type-system. JSON dictionaries are most of the time used to encode structs, but for that you have `complexType` and `sequence` in the XML schema.
I've been using XML with strongly-typed schemas for serialization for the last couple of years and couldn't be happier. I have ~100 classes in the schema, yet I've needed a true dictionary like 2 or 3 times.
> And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.
Validation is junk? Isn't it valuable to know that 1) if your schema requires a certain element, and 2) if the document has passed validation, then navigating to that element and parsing it according to its schema type won't throw a run-time exception?
Namespaces are junk? They serve the same purpose as in programming languages. How else would you put two elements of the same name but of different semantics (coming from different sources) into the same document? You can fake this in JSON "by convention", but in XML it's standardized.
But I think people just got sick of XML because it was abused so badly with "web services", SOAP, wsdl and all those horrible technologies from the early naughts. Over-complicated balls of mud that made people miserable.
If parsing JSON is bad, XML is a clusterfuck.
where was it insinuated XML is superior? It was a very reasonable response.
WSDL, SOAP, and all their precursors were attempted in spite of Postel's Law.
Repeating myself:
Back when I was doing electronic medical records, my two-person team ran circles around our (much larger) partners by abandoning the schema tool stack. We were able to detect, debug, correct interchange problems and deploy fixes in near realtime. Whereas our partners would take days.
Just "screen scrap" inbound messages, use templates to generate outbound messages.
I'd dummy up working payloads using tools like SoapUI. Convert those known good "reference" payloads into templates. (At the time, I preferred Velocity.) Version every thing. To troubleshoot, rerun the reference messages, diff the captured results. Massage until working.
Our partners, and everyone I've told since, just couldn't grok this approach. No, no, no, we need schemas, code generators, etc.
There's a separate HN post about Square using DSLs to implement OpenAPI endpoints. That's maybe 1/4th of the way to our own home made solution.
JSON is exactly designed for object serialization. XML can be used for that purpose but it's awkward and requires a lot of unnecessary decisions (what becomes a tag? what becomes an attribute? how do you represent null separately from the empty string?) which just have an easy answer in JSON. And I can't think of any advantage XML has to make up for that flaw. Sure, XML can have schemas, but so can JSON.
I will agree that JSON is horrible for config files for humans to edit, but XML is quite possibly even worse at that. I don't really like YAML, either. TOML isn't bad, but I actually rather like JSON5 for config files - it's very readable for everyone who can read JSON, and fixes all the design decisions making it hard for humans to read and edit.
As AtlasBarfed mentioned, JSON has a native map and list structure in its syntax, which is sorely missed in XML. You have to rely on an XML Schema to know that some tag is expected to represent a map or list.
JSON with attributes and namespaces would be my ideal world.
Use JSON or a binary protocol for data, XML for markup.
You'd basically have to create a new independent format to have proper compatibility once you introduce breaking changes.
Deleted Comment
Oddly though, according to my records, never seen a "NULL".
Really, the only time it would matter is if you are parsing user-provided JSON and said user was trying to exploit your parser somehow.
But 99% of the time, I'm not parsing user-provided JSON, so I don't ever encounter these corner cases and parsing/serialization works great.
Consider this JSON: {"key": 9223372036854775807}. With most parsers it never fails.
But... some JSON parsers (include JS.eval) parse it to 9223372036854776000 and continue on their merry way.
The problem isn't user-provided JSON here. The problem is user-provided data (or computer-provided data) that's inside the JSON.
rachelbythebay's take (http://rachelbythebay.com/w/2019/07/21/reliability/):
On the other hand, if you only need 53 bits of your 64 bit numbers, and enjoy blowing CPU on ridiculously inefficient marshaling and unmarshaling steps, hey, it's your funeral.
This is correct behavior though...? Every number in JSON is implicitly a double-precision float. JSON doesn’t distinguish other number types.
If you want that big a string of digits in JSON, put it in a string.
Edit: let me make a more precise statement since several people seem to have a problem with the one above:
Every number that you send to a typical JavaScript JSON parser is implicitly a double-precision float, and it is correct behavior for a JavaScript JSON parser to treat a long string of digits as a double-precision float, even if that results in lost precision.
The JSON specification itself punts on the precise semantic meaning of numbers, leaving it up to producers and consumers of the JSON to coordinate their number interpretation.
I take it you've never implemented a service with a REST API.
Deleted Comment
Is that a long or an int? Boolean or the string "true"? Does my library include undefined properties in the JSON? How should I encode and decode this binary blob?
We tried using OpenApi specs on the server and generators to build the clients. In general, the generators are buggy as hell. We eventually gave up as about 1/4 of the endpoints generated directly from our server code didn't work. One look at a spec doc will tell you the complexity is just too high.
We are moving to gRPC. It just works, and takes all the fiddling out of HTTP. It saves us from dev slap fights over stupid cruft like whether an endpoint should be PUT or POST. And saves us a massive amount of time making all those decisions.
Ref link:- https://v8.dev/blog/v8-release-76
Deleted Comment
https://github.com/lemire/simdjson
I’ve been using to process and maintain giant JSON structures and it’s faster than any other parser I’ve tried. I was able to replace my previous batch job with this as it gives real-time performance.
We haven't used that particular suite, but almost everything in that suite is something we've thought about. In many cases we do the right thing by not innovating and randomly allowing stuff that isn't in the spec.
I see exactly one thing we didn't think about, as our construction of a parse tree is pretty basic and we don't build an associative structure even when building up an object - thus we would not register an error when confronted with the malformed input listed under "2.4 Objects Duplicated Keys", but happily build a parse tree with duplicated keys (which will be built up strictly as a linear structure, not an associative one).
There seems to be leeway on this point as to what an implementation should do. It certainly doesn't fit our usage model very well to build a associative structure right there on the spot - some of our users wouldn't want that much complexity/overhead.