XML is notoriously expensive to properly parse in many languages. Basically, the entire world centers around 3 open source implementations (libxml2, expat and Xerces), if you want to get anywhere close to actual compliance. Even with them, you might hit challenges (libxml2 was largely unmaintained recently, yet it is the basis for many bindings in other languages).
The main property of SGML-derived languages is that they make "list" a first class object, and nesting second class (by requiring "end" tags), and have two axes for adding metadata: one being the tag name, another being attributes.
So while it is a suitable DSL for many things (it is also seeing new life in web components definition), we are mostly only talking about XML-lookalike language, and not XML proper. If you go XML proper, you need to throw "cheap" out the window.
Another comment to make here is that you can have an imperative looking DSL that is interpreted as a declarative one: nothing really stops you from saying that
means exactly the same as the XML-alike DSL you've got.
One declarative language looking like an imperative language but really using "equations" which I know about is METAFONT. See eg. https://en.wikipedia.org/wiki/Metafont#Example (the example might not demonstrate it well, but you can reorder all equations and it should produce exactly the same result).
I keep seeing people make the same mistake as XML made over and over; without learning from it. I will clarify the problem thusly:
> The more capabilities you add to a interchange format, the harder that format is to parse.
There is a reason why JSON is so popular, it supports so little, that it is legitimately easy to import. Whereas XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.
There was a thread here the other day about using Sqlite as an interchange format to REDUCE complexity. Look, I love Sqlite, as an application specific data-store. But much like XML it has a ton of capabilities, which is good for a data-store, but awful for an interchange format with multiple producers/consumers with their own ideas.
CSV may be under-specified, but it remains popular largely due to its simplicity to produce/consume. Unfortunately, we're seeing people slowly ruin JSON by adding e.g. commands to the format, with others than using those "comments" to hold data (e.g. type information), which must be parsed. Which is a bad version of an XML Attribute.
I think JSON has the opposite problem, it is too simple, the lack of comments in particular is particularly bad for many common usages of the format today.
I know some implementations of JSON support comments and other things, but is is not true JSON, in the same way that most simple XML implementations are not true XML. That's what I say "opposite problem", XML is too complex, and most practical uses of XML use incomplete implementations, while many practical uses of JSON use extended implementations.
By the way, this is not a problem for what JSON was designed for: a text interchange format, with JS being the language of choice, but it has gone beyond its design: configuration files, data stores, etc...
I've been working on an XML parser of my own recently and, to be honest, as long as you're fine with a non-validating parser (which are still compliant), it's really not that bad. You have to parse DTDs, but you don't need to actually _do_ anything with them. Namespaces are annoying but they're not in the main spec. CDATA sections aren't all that useful, but they're easy to parse. As far as I'm aware, parsers don't actually need to handle xml:lang/xml:space/etc themselves - they're for use by applications using the parser. Really the only thing that's been particularly frustrating for me is entity expansion.
If you want to support the wider XML ecosystem, with all the complex auxiliary standards, then yes, it's a lot of work, but the language itself isn't that awful to parse. It's a little messy, but I appreciate it at least being well-specified, which JSON is absolutely not.
CSTML is my attempt to fix all these issues with XML and revive the idea of HTML as a specific subset of a general data language.
As you mention one of the major learnings from the success of JSON was to keep the syntax stupid-simple -- easy to parse, easy to handle. Namespaces were probably the feature to get the most rework.
In theory it could also revive the ability we had with XHTML/XSLT to describe a document in a minimal, fully-semantic DSL, only generating the HTML tag structure as needed for presentation.
The problem is that engineers of data formats have ignored the concept of layers. With network protocols, you make one layer (Ethernet), you add another layer (IP), then another (TCP), then another (HTTP). Each one fits inside the last, but is independent, and you can deal with them separately or together. Each one has a specialty and is used for certain things. The benefits are 1) you don't need "a kitchen sink", 2) you can replace layers as needed for your use-case, 3) you can ship them together or individually.
I don't think anyone designs formats this way, and I doubt any popular formats are designed for this. I'm not that familiar with enterprise/big-data formats so maybe one of them is?
For example: CSV is great, but obviously limited, and not specified all that well. A replacement table data format could be binary (it's 2026, let's stop "escaping quotes", and make room for binary data). Each row can have header metadata to define which columns are contained, so you can skip empty columns. Each cell can be any data format you want (specifically so you can layer!). The header at the beginning of the data format could (optionally) include an index of all the rows, or it could come at the end of the file. And this whole table data format could be wrapped by another format. Due to this design, you can embed it in other formats, you can choose how to define cells (pick a cell-data-format of your choosing to fit your data/type/etc, replace it later without replacing the whole table), you can view it out-of-order, you can stream it, and you can use an index.
Constant erosion of data formats into the shittiest DSLs in existence is annoying. "Oh, hey, instead of writing Python, how about you write in
* YAML, with magical keywords that turn data into conditions/commands
* template language for the YAML in places when that isn't enough
* ....Python, because you need to eventually write stuff that ingests the above either way
.... ansible is great isn't it?"
... and for some reason others decide "YES THIS IS AWESOME" and we now have a bunch of declarative YAML+template garbage.
> There was a thread here the other day about using Sqlite as an interchange format to REDUCE complexity. Look, I love Sqlite, as an application specific data-store. But much like XML it has a ton of capabilities, which is good for a data-store, but awful for an interchange format with multiple producers/consumers with their own ideas.
It's just a bunch of records put in tables with pretty simple data types. And it's trivial to convert into other formats while being compact and queryable on its own. So as far as formats go, you could do a whole lot worse.
> Whereas XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.
But you don't have to use all those things. Configure your parser without namespace support, DTD support, etc. I'd much rather have a tool with tons of capabilities that can be selectively disabled rather than a "simple" one that requires _me_ to bolt on said extra capabilities.
I consider CSV to be a signal of an unserious organization. The kind of place that uses thousand line Excel files with VBA macros instead of just buying a real CRM already. The kind of place that thinks junior developers are cheaper than senior developers. The kind of place where the managers brow beat you into working overtime by arguing from a single personal perspective that "this is just how business is done, son."
People will blithely parrot, "it's a poor Workman who blames his tools." But I think the saying, as I've always heard it used to suggest that someone who is complaining is a just bad at their job, is a backwards sentiment. Experts in their respective fields do not complain about their tools not because they are internalizing failure as their own fault. They don't complain because they insist on only using the best tools and thus have nothing to complain about.
> XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.
Ah, the old "throw a bag of nouns at the reader and hope he's intimidated" rhetorical flutist. These things are either non-issues (like QName), things a parser does for you, or optional standards adjacent to XML but not essential to it, e.g. XInclude.
Author here. I agree with all this, and I think it's important to note that nothing precludes you from doing a declarative specification that looks like imperative math notation, but it's also somewhat besides the point. Yes, you could make your own custom language, but then you have created the problem that the article is about: You need to port your parser to every single different place you want to use it.
That's to say nothing of all the syntax decisions you have to make now. If you want to do infix math notation, you're going to be making a lot of choices about operator precedence. The article is using a lot of simple functions to explain the domain, but we also have switch statements—how are those going to expressed? Ditto functions that don't have a common math notation, like stepwise multiply. All of these can be solved, but they also make your parser much more complicated and create a situation where you are likely to only have one implementation of it.
If you try to solve that by standardizing on prefix notations and parenthesis, well, now you have s-expressions (an option also discussed in the post).
That's what "cheap" means in this context: There's a library in every environment that can immediately parse it and mature tooling to query the document. Adding new ideas to your XML DSL does not at all increase the complexity of your parsing. That's really helpful on a small team! I agonized over the word "cheap" in the title and considered using something more obviously positive like "cost-effective" but I still think "cheap" is the right one. You're making a cost-cutting choice with the syntax, and that has expressiveness tradeoffs like OP notes, but it's a decision that is absolutely correct in many domains, especially one where you want people to be able to widely (and cheaply) build on the thing you're specifying.
But there's already multiple existing configuration languages that's far more legible and robust than custom languages implemented on top of XML. Take Nickel.
This:
let
totalOwed = totalTax - totalPayments,
totalTax = tentativeTaxNetNonRefundableCredits + totalOtherTaxes,
totalPayments = totalEstimatedTaxesPaid +
totalTaxesPaidOnSocialSecurityIncome +
totalRefundableCredits,
in
totalPayments
is easy to read, unlike XML. It's written in a small configuration language that's easy to learn. It's pure and declarative. It handles complex configurations well. It provides tools to quickly pinpoint configuration errors. It can be integrated into existing software and workflows. Compared to bespoke languages built on top of XML, it's an improvement in every way conceivable.
There are also varieties of other languages to choose from. Using a bespoke XML-based language will inflict needless suffering upon people.
You are right that your other examples (like s-expressions) are actually better than going with a fully custom language.
But as you note elsewhere, you were benefiting from the schema (DTD or XSD) being done elsewhere, which provided at least some validation: in my experience, building this layer (either in code or with a new DTD/XSD) without a proper XML schema is the hardest part in doing XML well.
By ignoring this cost, it appeared much cheaper than it really is.
I also think including proper XML parsing libraries (which are sometimes huge) is not always feasible either (think embedded devices, or even if you need to package it with your mobile app, the size will be relatively big).
mean? Is it a syntax error? Or does it subtract imaginary numbers? What about exponential notation?
You will have a parser anyway, whether you like it or not. Given that, perhaps "5-3" is the simpler notation after all, even though it requires a specialized (albeit trivial) parser to be carried along with it.
> XML is notoriously expensive to properly parse in many languages.
I'm glad this is the top comment. I have extensive experience in enterprise-y Java and XML and XML is anything but cheap. In fact, doing anything non-trivial with XML was regularly a memory and CPU bottleneck.
That's if you parse the into a DOM and work on that. If you use SAX parsing, it makes it much better regarding the memory footprint.
But of course, working with SAX parsing is yet another, very different, bag of snakes.
I still hope that json parsing had the same support for stream processing as XML (I know that there are existing solutions for that, but it's much less common than in the XML world)
In the context of the article, "cheap" means "easy to set up" not "computationally efficient." The article is making the argument that there are situations in which you benefit from sacrificing the latter in favor of the former. You're right that it's annoyingly slow to parse though and that does cause issues I'd like to fix.
If you want a parser that actually checks the XML spec and various edge cases, then parsing goes from human-readable config to O(n^2) string handling. The funny part is how often people silently accept partial or broken XML in prod because revisiting schema validation years later is a nightmare. If you want cheap parsing, you end up writing a regex or DOM walker and hoping for the best, which raises the question of why not just use JSON or invent a different DSL to start.
Much of XML’s complexity derives from either the desire to be round-trip compatible with any number of existing character and data encodings or the desire to be largely forward-compatible with SGML.
A parser that only had to support a specified “profile” of XML (say, UTF-8 only, no user-defined entities or DTD support generally) could be much simpler and more efficient while still capturing 99% of the value of the language expressed by this post.
That's besides the point of this post. You're welcome to enforce such a profile on your documents, but the point of this post is the ease from throwing the whole ecosystem of out-of-the-box XML tools at it, tools which don't assume any such profile.
(Now ITOT they may have implicit or explicit profiles of their own, e.g. where safe parsing, validation, and XSLT support are concerned, but they have a large overlap.)
I shipped 20MB of XML with a product back in 2014; we loaded it at startup, validated it against the XSD, and the performance for this use case was fine. It was big because we did something kinda like what TFA suggests: I designed a declarative XML "DSL" and then wrote a bunch of "code" in it. We had lots of performance problems in that project, but the XML DSL wasn't the cause of any of them; that part was fine. I think "expensive" can mean a lot of different things. It was cheap in terms of development time and the loading/validation time, even on 20MB of XML, was not a problem. Visual Studio ships a tool that generates C# classes from the XSDs which was handy. I just wrote the XSDs and the framework provided the parsing, validation, node classes, and tree construction. This is as "XML proper" as I think it's possible to get.
I don't believe that .NET's XML serializer uses any of the open source projects mentioned in your post, so maybe we just have especially good XML support in .NET. I think Java has its own XML serializer, too. I bet most XML generated and consumed in the world is not one of those three open source C/C++ libraries. I think Java alone might be responsible for more than half of it.
Your first counterpoint seems unnecessarily picky.
> So while it is a suitable DSL for many things (it is also seeing new life in web components definition), we are mostly only talking about XML-lookalike language, and not XML proper. If you go XML proper, you need to throw "cheap" out the window.
But the TWE did not embrace all that stuff. It’s not required for its purpose. And to call it “xml lookalike” on that basis seems odd. It’s objectively XML. It doesn’t use every xml feature, but it’s still XML.
It’s as if you’re saying, a school bus isn’t a bus, it’s just a bus-lookalike. Buses can have cup holders and school buses lack cup holders. Therefore a school bus is not really a bus.
Unless you are compiling really large systems of DSL specification, speed of parsing is not the operation you want to be optimizing. XML for this use case, even if you DOM it, is plenty fast.
What are more concerning are the issues that result in unbounded parses – but there are several ways to control for this.
FWIW, this is also one of the reasons MathML has never become the "input" language for mathematics, and the layout-focused (La)TeX remains the de-facto standard.
Ergonomics of input are important because they increase chances of it being correct, and you can usually still keep it strict and semantic enough (eg. LaTeX is less layout-focused than Plain TeX)
But there, as with any DSL, you are trading-off ease of expression with ease of processing (e.g. interiperability). Every embedded DSL, XML included, chooses some amount of ease of processing.
You don't even need to specify a DSL to make that code declarative. It can be real code that's manipulating expression objects instead of numbers (though not in JavaScript, where there's no operator overloading), with the graph of expression objects being the result.
Cheap here is semantically different from cheap in the article. Here it means "how hard it hits the CPU" and in the article is "how hard it is to specify and widely support your DSL".
You also posted a piece of code that the author himself acknowledged that is not bad and ommited the one pathological example where implementation details leak when translating to JavaScript.
It just seems like you didn't approach reading the article willing to understand what the author was trying to say, as if you already decided the author is wrong before reading.
Nope, not cheap in my comment means expensive to implement: defining the XML schema, which has been done by someone else, and then using that schema properly, is what makes use of XML expensive (it is a lot of things to learn for more than one engineer in the team).
While this can give a notation for the domain, you'd still need an engine to process it. Prolong+CLPFD perhaps meets it well (not too familiar with the tax domain) and one could perhaps paraphrase Greenspun's tenth rule to this combo too.
> The main property of SGML-derived languages is that they make "list" a first class object, and nesting second class (by requiring "end" tags) ...
I think you're missing the forrest for the trees ;)
The major point of SGML in this context is that elements have content models defined by regular expressions, just like any other grammar productions eg. BNF.
> The main property of SGML-derived languages is that they make "list" a first class object, and nesting second class (by requiring "end" tags),
As opposed to JSON, which famously lacks lists? What does "second class" even mean here? How is having an end-indicator somehow a demotion?
> talking about XML-lookalike language, and not XML proper. If you go XML proper, you need to throw "cheap" out the window.
libxml2 and expat are plenty fast. You can get ~120MB/s out of them and that's nowhere near the limit. Something like pugixml or VTD can do faster once you've detected you're not working with some kind of exotic document with DTD entities.
Or... you could just use a programming language that looks good and has great support for embedded domain-specific languages (eDSL), like Haskell, OCaml or Scala.
Or, y'know, use the language you have (JavaScript) properly, eg. add a `sum` abstraction instead of `.reduce((acc, val) => { return acc+val }, 0)`.
In particular, the problem of "all the calculations are blocked for a single user input" is solved by eg. applicatives or arrows (these are fairly trivial abstract algebraic concepts, but foreign to most programmers), which have syntactic support in the abovementioned languages.
(Of course, avoid the temptation to overcomplicate it with too abstract functional programming concepts.)
If you write an XML DSL:
1. You have to solve the problem of "what parts can I parallelize and evaluate independently" anyway. Except in this case, that problem has been solved a long time ago by functional programming / abstract algebra / category-theoretic concepts.
2. It looks ugly (IMHO).
3. You are inventing an entirely new vocabulary unreadable to fellow programmers.
4. You will very likely run into Greenspun's tenth rule if the domain is non-trivial.
> you could just use a programming language ... like Haskell, OCaml or Scala.
Then you run into the problem of finding developers who are competent in these languages. I'm probably not the smartest guy but I've been a competent programmer for nearly 30 years. Haskell is something that seriously kicked my ass the few times I tried to get into it.
"Looks good" might be something not everyone agrees on for Lisp, but once you've seen S-expressions, XML looks terrible. Disgustingly verbose and heavyweight.
Yes, "just", mind the context. Are you trying to imply that learning/using an advanced programming language is somehow more complicated than infinite XML slop engineering, which as I said ideally requires knowledge of the same concepts anyway?
Basically, a node is an object with one entry, whose key is the type and whose value is an array. It's a rather S-expressiony approach. if you really don't like using arrays for all the contents, you could always use more normal values at the leaves:
It has the nice property that you're always guaranteed to see the type before any of the contents, even if object keys get reordered, so you can do streaming decoding without having to buffer arbitrary amounts of JSON. Probably not important when parsing a tax code, but can be useful for big datasets.
Agreed. Any language that wants to use the fact graph is going to have to “interpret” the chosen DSL anyways, and JSON is more ubiquitous and far simpler to parse than XML. Also way cheaper in the sense that the article uses it (how many langs can you parse and walk an XML document in off the top of your head? what about JSON?)
To see why JSON is simpler, imagine what the sum total of all code needed to parse and interpret the fact graph without any dependencies would look like.
With XML you’re carrying complex state in hash maps and comparing strings everywhere to match open/close tags. Even more complexity depending on how the DSL uses attributes, child nodes, text content.
With JSON you just need to match open/close [] {} and a few literals. Then you can skim the declarative part right off the top of the resulting AST.
It’s easy to ignore all this complexity since XML libs hide it away, and sure it will get the job done. But like others pointed out, decisions like these pile up and result in latency getting worse despite computers getting exponentially faster.
What I don't like are all the freaking quotes. I look at json and just see noise. Like if you took a screenshot and did a 2d FFT, json would have tons of high frequency content relative to a lot of other formats. I'd sooner go with clojure's EDN.
Aesthetically, I consider such JSON structures degenerate. It's akin to building a ECMAScript app where every class and structure is only allowed to have one member.
If you want tagged data, why not just pick a representation that does that?
Because (imo) the goal should be to minimize overall complexity.
Pulling in XML and all of its additional complexity just to get a (debatably) cleaner way to express tagged unions doesn’t seem like a great tradeoff.
I also don’t buy the degenerate argument. XML is arguably worse here since you have to decide between attributes, child nodes, and text content for every piece of data.
While a great article, I actually found this linked post [0] to be even better, in which the author lays out how so much modern tooling for web dev exists simply because XML lost the browser war.
EDIT: obviously, JSON tooling sprang up because JSON became the lingua franca. I meant that it became necessary to address the shortcomings of JSON, which XML had solved.
I'm not sure what the author means by "(XML) was abandoned because JavaScript won. The browser won."
The browser supported XML as much as Javascript. Remember that the "X" in "AJAX" acronym stands for XML, as well as "XMLHttpRequest" which was originally intended to be used for fetching data on the fly in XML. It was later repurposed to grab JSON data.
Javascript was not a reason XML was abandoned. It was just that the developer community did not like XML at all (after trying to use it for a while).
As for whether the dev community was "right", it's hard to comment because the article you linked is heavy on the ranting but light on the contextual details. For example it admits that simpler formats like JSON might be appropriate where "small data transfers between cooperating services and scenarios where schema validation would be overkill". So are they talking about people storing "documents" and "files" in JSON form? I guess it happens, but is it really as common to use JSON as opposed to other formats like YAML (which is definitely not caused by Javascript in the browser winning)?
Personally I think XML was abandoned because inherent bad design (and maybe over-engineering). A simpler format with schema checking is probably more ideal IMHO.
XMLHttpRequest got its name due to Microsoft internal politics [0]:
> Meanwhile the IE project was just weeks away from beta 2 which was their last beta before the release. This was the good-old-days when critical features were crammed in just days before a release, but this was still cutting it close. I realized that the MSXML library shipped with IE and I had some good contacts over in the XML team who would probably help out- I got in touch with Jean Paoli who was running that team at the time and we pretty quickly struck a deal to ship the thing as part of the MSXML library. Which is the real explanation of where the name XMLHTTP comes from- the thing is mostly about HTTP and doesn't have any specific tie to XML other than that was the easiest excuse for shipping it so I needed to cram XML into the name (plus- XML was the hot technology at the time and it seemed like some good marketing for the component).
Most people never actually used XML within Ajax, usually it was either a HTML fragment or JSON.
I read both, but I feel like they both miss what it was like to work with APIs back in the bad old XML days.
Yes, XML is more descriptive. It's also much harder for programmers to work with. Every client or server speaking an XML-based protocol had to have their own encoder/decoder that could map XML strings into in-memory data structures (dicts, objects, arrays, etc) that made sense in that language. These were often large and non-trivial to maintain. There were magic libraries in languages like Java and C# that let you map XML to objects using a million annotations, but they only supported a subset of XML and if your XML didn't fit that shoe you'd get 95% of the way and then realize that there was no way you'd get the last 5% in, and had to rewrite the whole thing with some awful streaming XML parser like SAX.
JSON, while not perfect, maps neatly onto data structures that nearly every language has: arrays, objects and dictionaries. That it why it got popular, and no other reason. Definitely not "fashion" or something as silly as that. Hundreds of thousands of developers had simply gotten extremely tired of spending 20% of their working lives producing and then parsing XML streams. It was terrible.
And don't even get me started on the endless meetings of people trying to design their XML schemas. Should this here thing be an attribute or a child element? Will we allow mixing different child elements in a list or will we add a level of indirection so the parser can be simpler? Everybody had a different idea about what was the most elegant and none of it mattered. JSON did for API design what Prettier did for the tabs vs spaces debate.
Since you explicitly mentioned fashion, I assume you read this:
> There is a distinction that the industry refuses to acknowledge: developer convenience and correctness are different concerns. They are not opposed, necessarily, but they are not the same thing.
…
The rationalization is remarkable. "JSON is simpler", they say, while maintaining thousands of lines of validation code. "JSON is more readable", they claim, while debugging subtle bugs caused by typos in key names that a schema would have caught immediately. "JSON is lightweight", they insist, while transmitting megabytes of redundant field names that binary XML would have compressed away. This is not engineering. This is fashion masquerading as technical judgment.
I feel the same way about RDBMS. Every single time I have found a data integrity issue - which is nearly daily - the fix that is chosen is yet another validation check. When I propose actually creating a proper relational schema, or leaning on guarantees an RDBMS can provide (such as making columns that shouldn’t be NULL non-NULLable, or using foreign key constraints), I’m told that it would “break the developer mental model.”
Apparently, the desired mental model is “make it as simple as possible, but then slowly add layer upon layer of complex logic to handle all of the bugs.”
The 'much harder for programmers to work with' was that the official way of doing a lot of programming related to XML was to do it in... XML. E.g. transformations were done with XSLT, query processing with XQuery. There were even XML databases that you had to query with XML (typically XQuery).
All these XML DSLs were so dreadful to write and maintain for humans that most people despised them. I worked in a department where semantic web and all this stuff was fairly popular and I still remember remember one colleague, after another annoying XML programming session, saying fuck this, I'll rip out all the XSLT and XQuery and will just write a Python script (without the swearing, but that was certainly his sentiment). First it felt a bit like an offense for ditching the 'correct' way, but in the end everyone sympathized.
As someone who has lived through the whole XML mania: good riddance (mostly).
And don't even get me started on the endless meetings of people trying to design their XML schemas.
I have found that this attracts certain type of people who like to travel to meetings and talk about schemas and ontologies for days. I had to sit through some presentations, and I had no idea what they presented had to do anything, they were so detached from reality that they built a little world on their own. Sui generis.
It’s the usual case of “I can’t be bothered to learn the complicated thing, give me something simple.” Two years later, “Oh wait, I need more features, this problem is more complicated than I thought”.
As a devil’s advocate, it is extremely difficult to produce something that’s simple to understand, flexible, and not inherently prone to bugs.
I am not a dev; I’m ops that happens to know how to code. As such, I tend to write scripts more than large programs. I’ve been burned enough by bash and Python to know how to tame them (mostly, rigid insistence on linters and tests), but as one of my scripts blossomed into a 15K LOC monstrosity, I could see in real time how various decisions I made earlier became liabilities. Some of these were because I thought I wouldn’t need it, others were because I later had learned I might need flexibility, but didn’t have the fundamental knowledge to do it correctly.
For example, I initially was only using boolean return types. “It’s simpler,” I thought - either a function works, or it doesn’t, and it’s up to the caller to decide what to do with that. Soon, of course, I needed to have some kind of state and data manipulation, and I wound up with a hideous mix of side effects and callbacks.
Another: since I was doing a lot of boto3 calls in this script, some of which could kick off lengthy operations, it needed to gracefully handle timeouts, non-fatal exceptions, and mutations that AWS was doing (e.g. Blue/Green on a DB causes an endpoint name swap), while persisting state in a way that was crash-proof while also being able to resume a lengthy series of operations with dependencies, only some of which were idempotent.
I didn’t know enough of design patterns to do all of this elegantly, I just knew when what I had was broken, so I hacked around it endlessly until it worked. It did work (I even had tests), but it was confusing, ugly, and fragile.
The biggest technical learning I took away from that project was how incredibly useful true ADTs are, and how languages that have them can prevent entire classes of bugs from ever happening. I still love Python, but man, is it easy to introduce bugs.
S-expressions are a cheap dsl too. I use it in my desktop browser runtime that is powered by wasm that I’m developing
As the “HTML”^1 and CSS^2 in fact it works so well I use it also reused it to do the styling for html exports in my markup language designed to fight documentation drift^3.
S-expressions are great. They are trivial to implement parsers for. For a while I used S expression parsing and evaluation as a technical coding screen interview question because it is feasible to implement a functional (pun intended) programming language using S-expressions in the space of an interview.
While not the point of the interview, the best part for me was seeing a candidate’s face light up when they realized they implemented a working programming language.
XML is beloved by tax authorities. The Polish tax authorities really love their e-documents and online filing. Except their XML documents are completely human-unreadable, since the schemas are based on field numbers in paper forms. Even in the brand new National e-Invoicing System, designed from scratch, with no paper forms, most fields have names like ‹P_19N›1‹/P_19N›. You read the XML schema to find out it is a "Marker of lack of delivery of goods or provision of services exempt from tax under Article 43 paragraph 1 of the [VAT] Act, Article 113 paragraphs 1 and 9 of the Act or regulations issued under Article 82 paragraph 3 of the Act or under other provisions" (Google Translated, because of course everything is in Polish). So my invoice is saying "yes [1], I am not [N] exempt from tax under $allThatNonsense [P_19]".
In unrelated news, the main author of the VAT Act is offering tax consulting services, as Registered Tax Advisor #00001.
It's not a DSL. It's a generic lexer and parser. It takes the text and gives you an abstract syntax tree. The actual DSL is your spec, and the syntax you apply.
It's one of many equivalent such parser tools, a particularly verbose one. As such it's best for stuff not written by hand, but it's ok for generated text.
It has some advantages mostly stemming from its ubiquity, so it has a big tool kit. It has a lot of (somewhat redundant) features, making it complex compared to other options, but sometimes one of those features really fits your use case.
It was also about how easy it was to generate great XML.
Because it is complicated and everyone doesn't really agree on how to properly representative an idea or concept, you have to deal with varying output between producers.
I personally love well formed XML, but the std dev is huge.
Things like JSON have a much more tighter std dev.
The best XML I've seen is generated by hashdeep/md5deep. That's how XML should be.
Financial institutions are basically run on XML, but we do a tonne of work with them and my god their "XML" makes you pray and weep for a swift end.
Maybe rather: how easy it was to generate rotten XML. I feel you there.
The XML community, though, embraced the problem of different outputs between different producers, and assumed you'd want to enable interoperability in a Web-sized community where strict patterns to XML were infeasible. Hence all the work on namespaces, validation, transformation, search, and the Semantic Web, so that you could still get stuff done even when communities couldn't agree on their output.
The main property of SGML-derived languages is that they make "list" a first class object, and nesting second class (by requiring "end" tags), and have two axes for adding metadata: one being the tag name, another being attributes.
So while it is a suitable DSL for many things (it is also seeing new life in web components definition), we are mostly only talking about XML-lookalike language, and not XML proper. If you go XML proper, you need to throw "cheap" out the window.
Another comment to make here is that you can have an imperative looking DSL that is interpreted as a declarative one: nothing really stops you from saying that
means exactly the same as the XML-alike DSL you've got.One declarative language looking like an imperative language but really using "equations" which I know about is METAFONT. See eg. https://en.wikipedia.org/wiki/Metafont#Example (the example might not demonstrate it well, but you can reorder all equations and it should produce exactly the same result).
> The more capabilities you add to a interchange format, the harder that format is to parse.
There is a reason why JSON is so popular, it supports so little, that it is legitimately easy to import. Whereas XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.
There was a thread here the other day about using Sqlite as an interchange format to REDUCE complexity. Look, I love Sqlite, as an application specific data-store. But much like XML it has a ton of capabilities, which is good for a data-store, but awful for an interchange format with multiple producers/consumers with their own ideas.
CSV may be under-specified, but it remains popular largely due to its simplicity to produce/consume. Unfortunately, we're seeing people slowly ruin JSON by adding e.g. commands to the format, with others than using those "comments" to hold data (e.g. type information), which must be parsed. Which is a bad version of an XML Attribute.
I know some implementations of JSON support comments and other things, but is is not true JSON, in the same way that most simple XML implementations are not true XML. That's what I say "opposite problem", XML is too complex, and most practical uses of XML use incomplete implementations, while many practical uses of JSON use extended implementations.
By the way, this is not a problem for what JSON was designed for: a text interchange format, with JS being the language of choice, but it has gone beyond its design: configuration files, data stores, etc...
If you want to support the wider XML ecosystem, with all the complex auxiliary standards, then yes, it's a lot of work, but the language itself isn't that awful to parse. It's a little messy, but I appreciate it at least being well-specified, which JSON is absolutely not.
CSTML is my attempt to fix all these issues with XML and revive the idea of HTML as a specific subset of a general data language.
As you mention one of the major learnings from the success of JSON was to keep the syntax stupid-simple -- easy to parse, easy to handle. Namespaces were probably the feature to get the most rework.
In theory it could also revive the ability we had with XHTML/XSLT to describe a document in a minimal, fully-semantic DSL, only generating the HTML tag structure as needed for presentation.
I don't think anyone designs formats this way, and I doubt any popular formats are designed for this. I'm not that familiar with enterprise/big-data formats so maybe one of them is?
For example: CSV is great, but obviously limited, and not specified all that well. A replacement table data format could be binary (it's 2026, let's stop "escaping quotes", and make room for binary data). Each row can have header metadata to define which columns are contained, so you can skip empty columns. Each cell can be any data format you want (specifically so you can layer!). The header at the beginning of the data format could (optionally) include an index of all the rows, or it could come at the end of the file. And this whole table data format could be wrapped by another format. Due to this design, you can embed it in other formats, you can choose how to define cells (pick a cell-data-format of your choosing to fit your data/type/etc, replace it later without replacing the whole table), you can view it out-of-order, you can stream it, and you can use an index.
* YAML, with magical keywords that turn data into conditions/commands * template language for the YAML in places when that isn't enough * ....Python, because you need to eventually write stuff that ingests the above either way .... ansible is great isn't it?"
... and for some reason others decide "YES THIS IS AWESOME" and we now have a bunch of declarative YAML+template garbage.
> There was a thread here the other day about using Sqlite as an interchange format to REDUCE complexity. Look, I love Sqlite, as an application specific data-store. But much like XML it has a ton of capabilities, which is good for a data-store, but awful for an interchange format with multiple producers/consumers with their own ideas.
It's just a bunch of records put in tables with pretty simple data types. And it's trivial to convert into other formats while being compact and queryable on its own. So as far as formats go, you could do a whole lot worse.
But you don't have to use all those things. Configure your parser without namespace support, DTD support, etc. I'd much rather have a tool with tons of capabilities that can be selectively disabled rather than a "simple" one that requires _me_ to bolt on said extra capabilities.
People will blithely parrot, "it's a poor Workman who blames his tools." But I think the saying, as I've always heard it used to suggest that someone who is complaining is a just bad at their job, is a backwards sentiment. Experts in their respective fields do not complain about their tools not because they are internalizing failure as their own fault. They don't complain because they insist on only using the best tools and thus have nothing to complain about.
Ah, the old "throw a bag of nouns at the reader and hope he's intimidated" rhetorical flutist. These things are either non-issues (like QName), things a parser does for you, or optional standards adjacent to XML but not essential to it, e.g. XInclude.
That's to say nothing of all the syntax decisions you have to make now. If you want to do infix math notation, you're going to be making a lot of choices about operator precedence. The article is using a lot of simple functions to explain the domain, but we also have switch statements—how are those going to expressed? Ditto functions that don't have a common math notation, like stepwise multiply. All of these can be solved, but they also make your parser much more complicated and create a situation where you are likely to only have one implementation of it.
If you try to solve that by standardizing on prefix notations and parenthesis, well, now you have s-expressions (an option also discussed in the post).
That's what "cheap" means in this context: There's a library in every environment that can immediately parse it and mature tooling to query the document. Adding new ideas to your XML DSL does not at all increase the complexity of your parsing. That's really helpful on a small team! I agonized over the word "cheap" in the title and considered using something more obviously positive like "cost-effective" but I still think "cheap" is the right one. You're making a cost-cutting choice with the syntax, and that has expressiveness tradeoffs like OP notes, but it's a decision that is absolutely correct in many domains, especially one where you want people to be able to widely (and cheaply) build on the thing you're specifying.
This:
is easy to read, unlike XML. It's written in a small configuration language that's easy to learn. It's pure and declarative. It handles complex configurations well. It provides tools to quickly pinpoint configuration errors. It can be integrated into existing software and workflows. Compared to bespoke languages built on top of XML, it's an improvement in every way conceivable.There are also varieties of other languages to choose from. Using a bespoke XML-based language will inflict needless suffering upon people.
But as you note elsewhere, you were benefiting from the schema (DTD or XSD) being done elsewhere, which provided at least some validation: in my experience, building this layer (either in code or with a new DTD/XSD) without a proper XML schema is the hardest part in doing XML well.
By ignoring this cost, it appeared much cheaper than it really is.
I also think including proper XML parsing libraries (which are sometimes huge) is not always feasible either (think embedded devices, or even if you need to package it with your mobile app, the size will be relatively big).
Your proto-math XML dialect of:
instead of: still has higher level syntax. What does: mean? Is it a syntax error? Or does it subtract imaginary numbers? What about exponential notation?You will have a parser anyway, whether you like it or not. Given that, perhaps "5-3" is the simpler notation after all, even though it requires a specialized (albeit trivial) parser to be carried along with it.
https://resources.jetbrains.com/storage/products/mps/docs/MP...
> XML is notoriously expensive to properly parse in many languages.
I'm glad this is the top comment. I have extensive experience in enterprise-y Java and XML and XML is anything but cheap. In fact, doing anything non-trivial with XML was regularly a memory and CPU bottleneck.
But of course, working with SAX parsing is yet another, very different, bag of snakes.
I still hope that json parsing had the same support for stream processing as XML (I know that there are existing solutions for that, but it's much less common than in the XML world)
A parser that only had to support a specified “profile” of XML (say, UTF-8 only, no user-defined entities or DTD support generally) could be much simpler and more efficient while still capturing 99% of the value of the language expressed by this post.
(Now ITOT they may have implicit or explicit profiles of their own, e.g. where safe parsing, validation, and XSLT support are concerned, but they have a large overlap.)
I don't believe that .NET's XML serializer uses any of the open source projects mentioned in your post, so maybe we just have especially good XML support in .NET. I think Java has its own XML serializer, too. I bet most XML generated and consumed in the world is not one of those three open source C/C++ libraries. I think Java alone might be responsible for more than half of it.
> So while it is a suitable DSL for many things (it is also seeing new life in web components definition), we are mostly only talking about XML-lookalike language, and not XML proper. If you go XML proper, you need to throw "cheap" out the window.
But the TWE did not embrace all that stuff. It’s not required for its purpose. And to call it “xml lookalike” on that basis seems odd. It’s objectively XML. It doesn’t use every xml feature, but it’s still XML.
It’s as if you’re saying, a school bus isn’t a bus, it’s just a bus-lookalike. Buses can have cup holders and school buses lack cup holders. Therefore a school bus is not really a bus.
I don’t see the validity or the relevance.
Ignoring that part of schema definition and subsequent validation is exactly why it seems "cheap" on the surface.
So, TWE is not using an XML lookalike language, but someone has done the expensive part before the author joined in.
What are more concerning are the issues that result in unbounded parses – but there are several ways to control for this.
This mindset is why we have computers now that are three+ orders of magnitude faster than a C64 but yet have worse latency.
Ergonomics of input are important because they increase chances of it being correct, and you can usually still keep it strict and semantic enough (eg. LaTeX is less layout-focused than Plain TeX)
Cheap here is semantically different from cheap in the article. Here it means "how hard it hits the CPU" and in the article is "how hard it is to specify and widely support your DSL".
You also posted a piece of code that the author himself acknowledged that is not bad and ommited the one pathological example where implementation details leak when translating to JavaScript.
It just seems like you didn't approach reading the article willing to understand what the author was trying to say, as if you already decided the author is wrong before reading.
I think you're missing the forrest for the trees ;)
The major point of SGML in this context is that elements have content models defined by regular expressions, just like any other grammar productions eg. BNF.
Yes let's not even get started on implementations who do <something value="value"></something>
As opposed to JSON, which famously lacks lists? What does "second class" even mean here? How is having an end-indicator somehow a demotion?
> talking about XML-lookalike language, and not XML proper. If you go XML proper, you need to throw "cheap" out the window.
libxml2 and expat are plenty fast. You can get ~120MB/s out of them and that's nowhere near the limit. Something like pugixml or VTD can do faster once you've detected you're not working with some kind of exotic document with DTD entities.
Or, y'know, use the language you have (JavaScript) properly, eg. add a `sum` abstraction instead of `.reduce((acc, val) => { return acc+val }, 0)`.
In particular, the problem of "all the calculations are blocked for a single user input" is solved by eg. applicatives or arrows (these are fairly trivial abstract algebraic concepts, but foreign to most programmers), which have syntactic support in the abovementioned languages.
(Of course, avoid the temptation to overcomplicate it with too abstract functional programming concepts.)
If you write an XML DSL:
1. You have to solve the problem of "what parts can I parallelize and evaluate independently" anyway. Except in this case, that problem has been solved a long time ago by functional programming / abstract algebra / category-theoretic concepts.
2. It looks ugly (IMHO).
3. You are inventing an entirely new vocabulary unreadable to fellow programmers.
4. You will very likely run into Greenspun's tenth rule if the domain is non-trivial.
Then you run into the problem of finding developers who are competent in these languages. I'm probably not the smartest guy but I've been a competent programmer for nearly 30 years. Haskell is something that seriously kicked my ass the few times I tried to get into it.
Since Raku suports both OO and Functional coding styles, and has built in Grammars, it is very nice for DSLs.
"Looks good" might be something not everyone agrees on for Lisp, but once you've seen S-expressions, XML looks terrible. Disgustingly verbose and heavyweight.
To see why JSON is simpler, imagine what the sum total of all code needed to parse and interpret the fact graph without any dependencies would look like.
With XML you’re carrying complex state in hash maps and comparing strings everywhere to match open/close tags. Even more complexity depending on how the DSL uses attributes, child nodes, text content.
With JSON you just need to match open/close [] {} and a few literals. Then you can skim the declarative part right off the top of the resulting AST.
It’s easy to ignore all this complexity since XML libs hide it away, and sure it will get the job done. But like others pointed out, decisions like these pile up and result in latency getting worse despite computers getting exponentially faster.
I was wrong. There is seemingly more high frequency content in the xml. See [1] -- the right side is the xml.
[1] https://orbitalchicken.com/fft_formats.jpg
Using jq etc will go a long way for any routine work.
If you want tagged data, why not just pick a representation that does that?
Pulling in XML and all of its additional complexity just to get a (debatably) cleaner way to express tagged unions doesn’t seem like a great tradeoff.
I also don’t buy the degenerate argument. XML is arguably worse here since you have to decide between attributes, child nodes, and text content for every piece of data.
EDIT: obviously, JSON tooling sprang up because JSON became the lingua franca. I meant that it became necessary to address the shortcomings of JSON, which XML had solved.
0: https://marcosmagueta.com/blog/the-lost-art-of-xml/
The browser supported XML as much as Javascript. Remember that the "X" in "AJAX" acronym stands for XML, as well as "XMLHttpRequest" which was originally intended to be used for fetching data on the fly in XML. It was later repurposed to grab JSON data.
Javascript was not a reason XML was abandoned. It was just that the developer community did not like XML at all (after trying to use it for a while).
As for whether the dev community was "right", it's hard to comment because the article you linked is heavy on the ranting but light on the contextual details. For example it admits that simpler formats like JSON might be appropriate where "small data transfers between cooperating services and scenarios where schema validation would be overkill". So are they talking about people storing "documents" and "files" in JSON form? I guess it happens, but is it really as common to use JSON as opposed to other formats like YAML (which is definitely not caused by Javascript in the browser winning)?
Personally I think XML was abandoned because inherent bad design (and maybe over-engineering). A simpler format with schema checking is probably more ideal IMHO.
> Meanwhile the IE project was just weeks away from beta 2 which was their last beta before the release. This was the good-old-days when critical features were crammed in just days before a release, but this was still cutting it close. I realized that the MSXML library shipped with IE and I had some good contacts over in the XML team who would probably help out- I got in touch with Jean Paoli who was running that team at the time and we pretty quickly struck a deal to ship the thing as part of the MSXML library. Which is the real explanation of where the name XMLHTTP comes from- the thing is mostly about HTTP and doesn't have any specific tie to XML other than that was the easiest excuse for shipping it so I needed to cram XML into the name (plus- XML was the hot technology at the time and it seemed like some good marketing for the component).
Most people never actually used XML within Ajax, usually it was either a HTML fragment or JSON.
[0] https://web.archive.org/web/20090130092236/http://www.alexho...
Yes, XML is more descriptive. It's also much harder for programmers to work with. Every client or server speaking an XML-based protocol had to have their own encoder/decoder that could map XML strings into in-memory data structures (dicts, objects, arrays, etc) that made sense in that language. These were often large and non-trivial to maintain. There were magic libraries in languages like Java and C# that let you map XML to objects using a million annotations, but they only supported a subset of XML and if your XML didn't fit that shoe you'd get 95% of the way and then realize that there was no way you'd get the last 5% in, and had to rewrite the whole thing with some awful streaming XML parser like SAX.
JSON, while not perfect, maps neatly onto data structures that nearly every language has: arrays, objects and dictionaries. That it why it got popular, and no other reason. Definitely not "fashion" or something as silly as that. Hundreds of thousands of developers had simply gotten extremely tired of spending 20% of their working lives producing and then parsing XML streams. It was terrible.
And don't even get me started on the endless meetings of people trying to design their XML schemas. Should this here thing be an attribute or a child element? Will we allow mixing different child elements in a list or will we add a level of indirection so the parser can be simpler? Everybody had a different idea about what was the most elegant and none of it mattered. JSON did for API design what Prettier did for the tabs vs spaces debate.
> There is a distinction that the industry refuses to acknowledge: developer convenience and correctness are different concerns. They are not opposed, necessarily, but they are not the same thing. … The rationalization is remarkable. "JSON is simpler", they say, while maintaining thousands of lines of validation code. "JSON is more readable", they claim, while debugging subtle bugs caused by typos in key names that a schema would have caught immediately. "JSON is lightweight", they insist, while transmitting megabytes of redundant field names that binary XML would have compressed away. This is not engineering. This is fashion masquerading as technical judgment.
I feel the same way about RDBMS. Every single time I have found a data integrity issue - which is nearly daily - the fix that is chosen is yet another validation check. When I propose actually creating a proper relational schema, or leaning on guarantees an RDBMS can provide (such as making columns that shouldn’t be NULL non-NULLable, or using foreign key constraints), I’m told that it would “break the developer mental model.”
Apparently, the desired mental model is “make it as simple as possible, but then slowly add layer upon layer of complex logic to handle all of the bugs.”
All these XML DSLs were so dreadful to write and maintain for humans that most people despised them. I worked in a department where semantic web and all this stuff was fairly popular and I still remember remember one colleague, after another annoying XML programming session, saying fuck this, I'll rip out all the XSLT and XQuery and will just write a Python script (without the swearing, but that was certainly his sentiment). First it felt a bit like an offense for ditching the 'correct' way, but in the end everyone sympathized.
As someone who has lived through the whole XML mania: good riddance (mostly).
And don't even get me started on the endless meetings of people trying to design their XML schemas.
I have found that this attracts certain type of people who like to travel to meetings and talk about schemas and ontologies for days. I had to sit through some presentations, and I had no idea what they presented had to do anything, they were so detached from reality that they built a little world on their own. Sui generis.
I am not a dev; I’m ops that happens to know how to code. As such, I tend to write scripts more than large programs. I’ve been burned enough by bash and Python to know how to tame them (mostly, rigid insistence on linters and tests), but as one of my scripts blossomed into a 15K LOC monstrosity, I could see in real time how various decisions I made earlier became liabilities. Some of these were because I thought I wouldn’t need it, others were because I later had learned I might need flexibility, but didn’t have the fundamental knowledge to do it correctly.
For example, I initially was only using boolean return types. “It’s simpler,” I thought - either a function works, or it doesn’t, and it’s up to the caller to decide what to do with that. Soon, of course, I needed to have some kind of state and data manipulation, and I wound up with a hideous mix of side effects and callbacks.
Another: since I was doing a lot of boto3 calls in this script, some of which could kick off lengthy operations, it needed to gracefully handle timeouts, non-fatal exceptions, and mutations that AWS was doing (e.g. Blue/Green on a DB causes an endpoint name swap), while persisting state in a way that was crash-proof while also being able to resume a lengthy series of operations with dependencies, only some of which were idempotent.
I didn’t know enough of design patterns to do all of this elegantly, I just knew when what I had was broken, so I hacked around it endlessly until it worked. It did work (I even had tests), but it was confusing, ugly, and fragile.
The biggest technical learning I took away from that project was how incredibly useful true ADTs are, and how languages that have them can prevent entire classes of bugs from ever happening. I still love Python, but man, is it easy to introduce bugs.
1. https://gitlab.com/canvasui/canvasui-engine/-/blame/main/exa...
2. https://gitlab.com/canvasui/canvasui-engine/-/blob/main/exam...
3. https://gitlab.com/sablelang/libcuidoc
While not the point of the interview, the best part for me was seeing a candidate’s face light up when they realized they implemented a working programming language.
In unrelated news, the main author of the VAT Act is offering tax consulting services, as Registered Tax Advisor #00001.
It's one of many equivalent such parser tools, a particularly verbose one. As such it's best for stuff not written by hand, but it's ok for generated text.
It has some advantages mostly stemming from its ubiquity, so it has a big tool kit. It has a lot of (somewhat redundant) features, making it complex compared to other options, but sometimes one of those features really fits your use case.
It was also about how easy it was to generate great XML.
Because it is complicated and everyone doesn't really agree on how to properly representative an idea or concept, you have to deal with varying output between producers.
I personally love well formed XML, but the std dev is huge.
Things like JSON have a much more tighter std dev.
The best XML I've seen is generated by hashdeep/md5deep. That's how XML should be.
Financial institutions are basically run on XML, but we do a tonne of work with them and my god their "XML" makes you pray and weep for a swift end.
If you tried to represent the data (exactly) from any of the examples in the post, I think you’d find that you’d experience many of the same problems.
Personally, I think the problem with XML has always been the tooling. Slow parsers, incomplete validators
The XML community, though, embraced the problem of different outputs between different producers, and assumed you'd want to enable interoperability in a Web-sized community where strict patterns to XML were infeasible. Hence all the work on namespaces, validation, transformation, search, and the Semantic Web, so that you could still get stuff done even when communities couldn't agree on their output.