I maintain Heimdal[0]'s ASN.1 compiler[1], though I didn't create it. It's a pleasure. It, and the IETF, have taught me a few things:
- there's nothing really wrong with ASN.1 as a syntax except maybe it's ugly
- there's nothing wrong at all with ASN.1's semantics
- there's a TON wrong with the BER family of encoding rules (BER, DER, and CER), and with every tag-length-value scheme
- you can create ASN.1 encoding rules for anything you like, which really means "use ASN.1 as the schema language for whatever encoding I prefer"
- indeed, there's XER (XML encoding rules), JER (JSON encoding rules), GSER (generic string encoding rules) -- all text-based -- and a bunch of binary encodings with at least two that are not tag-length-value (and so resemble NDR and XDR), like PER and OER
- people love to hate ASN.1, mainly because BER/DER/CER deserve the hatred, and for less legitimate reasons too, so they go off and invent new wheels that often have the same problems -- oh well!
In the asn1 readme, and in some comments in these threads you mention the perils of the tag-length-value scheme, but you never seemed to explain whats wrong with it?
At least in file formats it to me seem they would be instrumental to have a extendible and flexible format, where you can skip unknown or uninteresting chunks (in say, PNG chunks, or IFF-based formats like OBJ, etc.).
Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?
Sorry about the wall of questions, but I'm just so confused.
> In the asn1 readme, and in some comments in these threads you mention the perils of the tag-length-value scheme, but you never seemed to explain whats wrong with it?
Not OP, but one of the challenges is that definite-length encodings like DER have to be encoded in a non-intuitive way. Values must be encoded prior to lengths (because the length is unknown), and the values can be nested. Therefore you have to encode a message essentially backwards when using definite-length encodings. This can potentially require a great deal of memory and can increase latency because streaming the data is hard.
Indefinite lengths (BER has this option, CER requires it) can help avoid this problem, but then you lose the benefit of skipping elements (which you allude to in your next paragraph).
> Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?
You've hit the tradeoffs pretty well in the question, I think. The nice thing about TLV is that you can decode without a schema and potentially work with the contents: it's a relatively simple format to decode and validate even if it's not necessarily great for the encoder.
ASN.1 supports schema-informed packed encodings that place greater demands on both the encoder and decoder. The main advantage is that they greatly reduce message overhead, but it requires a lot of bit-twiddling for presence/absence, default values, and, in unaligned variants, everything else, too. It's impossible, generally, to decode everything without the schema. PER has rules that disambiguate the values (e.g., they have to be ordered in a particular way, so you know what's coming next), and this mitigates some of the problems of TLV-style encodings.
The tradeoffs are worth it when your pipes are small. 3GPP and LTE messages are largely encoded in PER. The people playing in that world usually have plenty of money to spend on commercial solutions and have bandwidth to roll their own, too. That's a bit different than smaller shops who are looking for convenient automated serialization formats.
I see lots of questions about TLV scheme problems. I should have listed them last night, indeed.
First, some generic problems with TLV encodings:
- they necessarily result in unnecessarily
redundant encodings -- this is wasteful, bloat
- that redundancy is of zero help to a compiler
- that redundancy is a psychological crutch to
any programmer writing hand-coded codecs, but
this often has led to serious bugs
- tag allocation has to be managed, and here
again you really want a compiler to do it for
you -- ASN.1 eventually added AUTOMATIC tags,
but the damage of not having had those was
done
Next some problems specific to DER-like definite-length TLV encoding rules:
- streaming encoding is infeasible -- you have
to know the definite lengths before you
start encoding, so you lose
- you either have to compute the length of the
encoding of any value before you begin
encoding it, or you have to encode "back to
front" (and then possibly realloc as needed)
or both
There's more, but I'm not too familiar with the issues around CER-like indefinite-length encoding issues.
Bottom-line: TLV is an unnecessary crutch. Compilers simply don't need it. For proof by existence consider that Sun's rpcgen(1) existed in 1986, a mere two years after ASN.1's 1984 standard, and rpcgen(1) uses XDR syntax and encoding -- XDR is NOT a TLV encoding at all. But ASN.1 tooling -proprietary and open source- took much longer to catch up with XDR and IDL/NDR and other things. It's almost like TLV encodings made it harder to get to compilation because they were a crutch for hand-coding codecs. But even XDR is easy to hand-write codecs for!
BTW, XDR and NDR were basically the first flatbuffers-like encodings. Lustre RPC has an even more flatbuffers-like encoding, but it's hand-coded. There's just nothing new in this space, and there hasn't really been anything new in this space in many years.
> At least in file formats it to me seem they would be instrumental to have a extendible and flexible format, where you can skip unknown or uninteresting chunks (in say, PNG chunks, or IFF-based formats like OBJ, etc.).
TLV is NOT necessary for this sort of extensibility. You naturally end up with something like TLV when using non-TLV encodings with support for extensibility, though it's often more like LTV. Let's say you have a struct you want to make extensible in some non-TLV encoding you're designing... What would you do? Well, knowing ASN.1's PER/OER and knowing how we've dealt with this in XDR I would do this: add an octet string field to the end of every extensible struct! What would that octet string contain? The encoding of the extensions. What if you want to support different kinds of extensions in a mix-and-match way? Well, that's easy too: add a discriminated union or "typed hole" to the end of every extensible struct, with every choice taken having a Length prepended to it so you can skip it.
Extensibility is something that has been beat to death in the ASN.1 space, and it has all of these options:
- extensibility markers in SEQUENCE / SET types (i.e., "struct" types)
- extensibility markers in CHOICE types (i.e., discriminated union types)
- extensibility markers in INTEGER and BIT STRING constraints (i.e., enum types)
- rules for handling known and unknown extensions in each ER (encoding rules)
- typed holes.
A typed hole is just a glorified discriminated union with an "external" sort of discriminant and specification of the union arms' types. Basically, a typed hole is just a struct with two fields: a) a type identifier of some sort (an integer, a string, an OID, a relative OID, whatever), b) an octet string containing an encoding of the value of a type identified by (a).
ASN.1 has syntax and semantics for expressing what type IDs go with what types, and so you can actually have compilers that recursively and automatically decode/encode through typed holes.
> Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?
I address this above. This is all addressed in ASN.1 (and also XML because of XMLNS). Many very smart people who came before you and I saw to it that ASN.1 addressed all these issues definitively long ago.
Maybe you can answer a question I've had about ASN.1. Long time ago, Marshall Rose had harsh things to say about the ASN.1 macro facility like "buried semantics"[1]. Do you know what he meant?
> Maybe you can answer a question I've had about ASN.1. Long time ago, Marshall Rose had harsh things to say about the ASN.1 macro facility like "buried semantics"[1]. Do you know what he meant?
My guess is that his complaint is that MACRO semantics are not well defined and are challenging to parse with conventional compilers. I've always wondered if they were inspired in some part by LISP, since you could in principle translate them fairly readily. ROSE and SNMP are still relatively commonly-used specifications that embed macro definitions, and most of the work I've seen done with them involves actually hard-coding the output (rather than actually parsing the MACROs).
> - there's a TON wrong with the BER family of encoding rules (BER, DER, and CER), and with every tag-length-value scheme
I would like to hear more about what's wrong with tag-length-value schemes. And can these be corrected or do would you advocate for alternatives? Which alternatives?
Can the veterans of the 90s SSL Wars explain the issues with ASN1/DER/BER? Looking it up today, it seems like a pretty smart and extensive serialization system, and I have to wonder why new systems like Google Protobufs chose to reinvent the wheel.
Conversely, how have modern systems avoided the pitfalls (if any) of ASN1/DER/BER?
I know of at least one problem with ASN.1. The string encodings other than UTF-8 are terrible. Most of the string encodings are very limited and weird subsets of ASCII that nobody actually uses anymore. ASN.1 itself doesn't define the encodings and just refers to other standards.
The problem with this is probably most notable with the T.61 encoding which changed over the years and since ASN.1 references other standards nobody is quite sure exactly what you have to support to have T.61 actually work right.
Within X.509 certificates though nobody bothers to actually implement T.61 and just uses the T.61 flag for ISO-8859-1.
Basically ASN.1 wasn't well defined and it only works well when people agreed to only use certain features or to interpret things in a particular way when ambiguous.
It's also notoriously difficult to parse well. It's very easy to have bugs in your parser, even if you're implementing a subset of it that's needed for X.509. Especially if you're doing so in a non-memory safe language.
I can't speak for why Google invented Protobufs, but I can't imagine anyone sane picking up ASN.1 for anything modern and deciding that this is what they want to use.
For the string encoding thing, however, it does have UTF-8 and you should not use anything else to express arbitrary human text anyway.
PKIX actually leverages the weird encoding restriction to our benefit. It defines two kinds of names which things might have on the Internet (you can and should stop trying to name things which are actually on the Internet some other way), DnsNames and IpAddresses. IpAddresses, since they're either 32-bit or 128-bit arbitrary bit values, are just represented as either 32-bit or 128-bit arbitrary bit values. So you cannot express the erroneous IPv4 address 100.200.300.400 as an IpAddress, which means you can't trip up somebody's parser with that nonsense address. DnsNames use a deliberately sub-ASCII encoding from ASN.1 which can express all the legal DNS names (all A-labels and the ASCII dot . are permissible) but can't express lots of other goofy things including most Unicode. So a certificate issuer, even if they're completely incompetent, cannot write a valid DnsName that expresses some garbage IDN as Unicode. Hopefully they read the documentation and find out they need to use A-labels (Punycode) but if not they're prevented from emitting some ambiguous gibberish.
Even in forums where you'd once have expected pushback, "Just use UTF-8" is becoming more widespread. Microsoft for example, once upon a time you'd get at least some token resistance, today they're likely to agree "Just use UTF-8". So ASN.1 ends up no worse off for a half a dozen bad ways to write text you shouldn't use, compared to say XML, HTML, and so on.
A couple of years ago I ran into the same confusion of the "TeletexString"/"T61String" data type in ASN.1. After going down the rabbit hole of what is T.61 and trying to map it to Unicode, I reread the ASN.1 (X.690) spec and realized that the authors never actually referenced T.61. Ever since the first edition of ASN.1 in 1988, those strings have not used T.61. They use a character set that is easily mapped to Unicode - https://www.itscj-ipsj.jp/ir/102.pdf, a subset of US ASCII.
Not to say the rest of the spec is notably better. If fully implemented, it requires supporting escape codes in strings to change character sets. I've never seen valid escape codes in real world data, but it probably exists.
As the original article shows, ASN.1 has lots of other challenges and complexity. Trying to write a code generator that supports all the complexity is no trivial task and the only open source one I've seen only generates C code. Protobuf has the advantage of having modern language support (including multiple type safe and memory safe languages).
> Basically ASN.1 wasn't well defined and it only works well when people agreed to only use certain features or to interpret things in a particular way when ambiguous.
ASN.1 has always been as-well- or better-defined than its competition. The ITU-T specs for it are a thing of beauty not often equaled outside the ITU-T.
That said, for a long time the ASN.1 specs were non-free, and that hurt a lot. Also, the BER family of encoding rules stunted development of open source tooling for ASN.1.
ASN.1 really demands code generation. Unfortunately lots of nonconforming stuff has to be dealt with. The concept of encoding rules and the module tagging scheme make for a pretty big number of possible representations.
The language semantics of ASN.1 don't really map to anything well, particularly around default fields and structures that can vary.
Newer systems don't have encoding rules and pick a semantics that matches a target language much more closely.
- OpenLDAP has a printf/scanf-like approach to BER encoding
- Heimdal has an ASN.1 compiler that generates code, yes, but also alternatively generates bytecode that gets interpreted at run-time.
> The language semantics of ASN.1 don't really map to anything well, particularly around default fields and structures that can vary.
You are ill-informed. Proof by counter-example:
- there are ASN.1 encoding rules that produce natural XML (XER) and JSON (JER)
- "default fields" are supported (the relevant keyword is `DEFAULT`, naturally)
- "structures that can vary" -- if you mean unions, it's got that (the relevant keyword is `CHOICE`), and if you mean "extensions", it's got extensibility markers (that effectively are alike a CHOICE of an octet string of unknown stuff, or else the extensions known at module compile time.
On this specific point: isn't this also the case for other high-performance serialisers? Google ProtoBufs, Apache Thrift, any protocol through Rust's SerDes...
There is NO problem with ASN.1 itself except a bit of ugliness. There are SERIOUS problems with DER/BER/CER and with all tag-length-value schemes -- this includes protobufs!
ASN.1 is just syntax and semantics. There are encoding rules that produce textual representations (GSER), XML (XER), JSON (JER), there's XDR-style encoding rules (PER and OER, but with 1-octet units instead of 4-octet units, plus efficient representation of optional fields).
In fact, you can make ASN.1 encoding rules that are based on NDR and XDR and which work for all of IDL and XDR and that subset of ASN.1 that is covered by the semantics of IDL and XDR, and you can extend that to cover all of ASN.1 if you want.
I should know these things, as I maintain an ASN.1 compiler and I intend to eventually teach it to do XDR and NDR.
Really, there's nothing about data schemas that you can express in JSON, CBOR, IDL, XDR, S-expressions, or any schema language you want, that you can't express in ASN.1, or, if there is, it's got to be a pretty niche feature and easily added to ASN.1 anyways. Even functions (RPCs) can be expressed in ASN.1 with some conventions, and routinely are, because it's really just a request/response protocol.
But every year someone invents a new thing because of how stupid, tired, and old ASN.1 is (or, rather, they perceive it to be). Or because of how complex ASN.1 is and how there's a paucity of tools, so then they: reinvent the wheel (often badly), a wheel for which instantly there is a paucity of tools.
Personally, I think that people just like to reinvent things. I don't want to sound shitty (or have kentonv show up again to scold me for it) but I get the feeling that, a lot of the time, it's just that simple.
To me that is a specious argument. It's like asking why Python was invented when Cobol could suffice.
The dozens of ASN.1 specs are absolutely hideous and entrenched in obsolete telecom jargon. If the sole goal Protobuf was to avoid having Google engineers be required to refer to the dozens of ASN.1 specs when disagreements or confusions arose, then it would have been 100% worth it for just that reason.
ASN.1 was too broad. There is immense value in a more constrained specification that does not include so many hazardous serialization types and antiquated string formats.
Now, should Protobufs or Thrift simply have been constrained versions of ASN.1? I think there is a view of software engineering where this would have been an ideal outcome, but almost universally when we see too-big standards, they are declared "dangerous" and avoided like the plague before they are downscoped.
ASN.1 in 1984 was not too broad. It was too simple, and it was too targeted to tag-length-value encoding rules (which are stupid -- TLV is a crutch that is only maybe useful when you lack a compiler, which early on was the case).
ASN.1 today is as broad as it needed to evolve to be because its users needed it.
ASN.1 is extremely complicated and hard to implement correctly. All ASN.1 implementations I've seen are either specialized (know how to work only with a very specific message), or slow, buggy and expose equally complicated APIs. Modern systems like protobufs tend to use much simpler encodings & specs which are easier to understand and implement correctly.
Have spent a few years during the late 90s/early 2000s in an industry running on ASN.1, coming from the web. I was initially surprised by how enamoured most of my coworkers were with ASN.1 and its tools, but it grew on me too: the pleasure of interacting only with a protocol specifications regardless of the implementation language/intricacies of the remote party, the guaranty that there could be no invalid messages received or emitted, the automatic generation of tests and tools, eventually balanced out the inconvenience of not being able to readily read data on the wire (it was before every human-readable protocols gets encrypted) and the inconvenience of not being able to start coding upfront.
It was like going from runtime type checking to static type checking: initially inconvenient, but paying dividends after a short while.
So why did this tech disappeared if it was ultimately better than the later alternatives (textual protocols, shema-less serializers, and eventually protobuf which reinstated some form of efficient encoding and type checking).
As it uncannily frequently occurs with technological evolution, the reason is probably not to be found within its technical issues (which basically all boil down to: designed by committee).
ASN.1 was just a bit too inconvenient, the free tools to generate code were just not quite good and robust enough, and the approach of starting with designing your types and protocols and putting in place your code production tool-chain before being able to ship anything was at odd with the mood of the day, which was to let the junior cheap dev fire off his code editor during the coffee break of the first design planning meeting to build the first half-backed prototype that would be already sold to the customer by the time he hits :wq. To move fast and break things, ASN.1 got in the way.
So did formal specifications in general, code analyzing tools, even basic type checking, all of them thrown out the window during the same period for the extra weight, extra time-to-market and extra cost of hiring. Text protocols out competing saner alternatives because they are initially simpler (SIP vs H.323 anyone?), schema-less data formats predominating almost entirely because you can start hacking quicker, etc. are all attributable to that cultural rather than technical trend I believe.
Now it seems the industry is slowly recovering from these excesses. Maybe because of the damage that has done, but more likely because of the end of cheap hardware progresses, encryption everywhere and massive data volumes (that's what made Google come up with better protocols than HTTP and better formats than human readable text, after all).
I owned the Microsoft ASN1 library for a while around 2005. It was a maintenance nightmare and I spent a lot of time fixing static analysis derived issues.
That said, I always found the standard quite interesting with different encodings based on the degree of prior shared info or format. My assumption is that not-invented-here is part of the why it’s not used.
I used the Netscape/Mozilla NSS library quite a bit, and one problem I found with it, is that all of the DER encoding/decoding was written by hand. They should have generated all that boilerplate from the ASN.1 modules written in the specs (later, RFC 2459, but at the time, a hodge-podge of scattered specs).
Hand-coding works okay when the data is what you expect. But when you throw mal-formed certificates at it, you have to catch all the edge cases. Having generated code would have enabled much more edge cases to be covered.
Those libraries were originally written in the early/mid 90s. Don’t recall much in the way of code generation tools that would take those specs and generate the code at the time.
Spent a bunch of time working with and adding to those libraries.
No veteran of the 90s SSL wars, but I once upon the time was tasked with fixing security bugs in a custom protocol backend server which used ASN.1 for purposes that one would probably use protobuf nowadays.
The quality of existing open source libraries to parse ASN.1 leaves a lot to be desired.
I have worked for a time with credit card terminal applications.
We used BER-TLV throughout the system extensively, where it was needed as well as where it wasn't.
I have implemented complete parsers/serializers, data structures using TLV, transactional database where data was stored as TLV documents. EMV is built on top of BER-TLV, SSL used it, as well as ISO-8583 messages transmitted data encoded with BER-TLV. Communication with the PIN Pad was built on it. We kept configuration as BER-TLV documents.
I have been able to parse hex representation in my head.
I really liked the standard. It is nice, flexible and very efficient. Easy to parse, can be parsed reliably and safely in statically allocated memory.
To those who think this is ancient history and it should be dropped -- do you think that might just be because you don't actually know it or maybe you just think it is old and so it must be bad?
Where EMV uses tags more like classes than types, I’m not really sure it actually counts as “abstract” syntax notation any more?
Because all tags are these custom things, some don’t strictly parse out to unique type codes too. So a non-EMV parser will have a few tags that map to the same integer code and cause some fun bugs.
That project was when I really understood deep-down why JSON won in the end!
Why are we even talking about ASN.1/DER/BER? We should, like the ancient Egyptian priests who opposed Akhenaten, chisel it's name from every public edifice. Referring to it not as "ASN.1, the platform-independent abstract type system," but "the great heresy, which shall not be named."
> You might have heard of similar such abstract syntax notations used for interface definitions such as Google Protocol Buffers, or Facebook’s Apache Thrift, but those languages have not been managed by a standardization organization, so the owning corporations could (in theory) make breaking changes or change the license or even remove the language definitions overnight.
Is this really the main difference between ASN.1 and Google protobufs, that one is managed by a private corporation and the other by a standardization organization? Can they otherwise be used "interchangably" in designing interfaces, a la two different programming languages (with different syntax of course)?
ASN.1 struggles because the word "ASN.1" can name a lot of different implementations with different nuances, and a "complete" ASN.1 implementation is a massive and hazardous undertaking which has left many with a sour taste. Meanwhile, ProtoBufs and Thrift work off of more constrained and well-versioned interfaces.
Honestly, ASN.1 with semantic versioning at the protocol level would probably have been as robust and useful as Protobufs. If ASN.1 had been forked into "ASN.1 3.0 without 10 hazardous and awful 1980s text encodings," it could even be fairly palatable today. Whether the overly expansive nature of ASN.1 is a product of the committee / standards organization design or the timeframe in which it originated is certainly an interesting philosophical question.
> Meanwhile, ProtoBufs and Thrift work off of more constrained and well-versioned interfaces.
Not so. Protocol buffers is just a TLV encoding, which is bad (see elsewhere in this thread) -- it's just a cut-down ASN.1 and variation on BER, so what.
ASN.1 can "well-version" everything just as well as anything else.
In terms of tooling, there’s excellent tooling for ASN.1 for C and C++ and maybe some other languages. There’s excellent tooling for protobufs for a handful of languages too, but they’re different sets, so in practice what languages you want to use would likely come into play.
How excellent the ASN.1 tooling is depends on which subset of ASN.1 you're using. Some of the tooling supports one iteration of ASN.1 or the other. To the degree that the IETF had to write a document on how to deal with this since some of the standards use the older ASN.1 and some use the newer ASN.1:
https://tools.ietf.org/id/draft-ietf-pkix-asn1-translation-0...
Interoperability with ASN.1 is very fragile at best.
> In terms of tooling, there’s excellent tooling for ASN.1 for C and C++ and maybe some other languages. There’s excellent tooling for protobufs for a handful of languages too, but they’re different sets, so in practice what languages you want to use would likely come into play.
In my experience, tooling is actually very good for most commonly-used languages, including C/C++, C#, Java, Python, and maybe even Go. And, of course, erlang. The real challenge is, I think, that you cannot find good free tooling, and the barrier to entry for Joe Developer is fairly high (in the thousands of dollars).
> Is this really the main difference between ASN.1 and Google protobufs, that one is managed by a private corporation and the other by a standardization organization? Can they otherwise be used "interchangably" in designing interfaces, a la two different programming languages (with different syntax of course)?
No, the two are not interoperable and probably won't be made that way. Protobuf has undergone changes that challenge its backwards-compatibility (e.g., with item presence). ASN.1 supports multiple encoding rules, and while it's possible that someone could map ASN.1 syntax to protobuf encodings, it would only support a subset of ASN.1 because protobuf doesn't support length or value constraints (among other ASN.1 features).
ASN.1 does have a little-used standard called Encoding Control Notation[0] that in principle supports the construction of novel encodings. But I have never seen a compiler, commercial or otherwise, that supports it. It requires a certain expressiveness in your parser that's hard to do right, although I've wondered if LISP or Racket could take it on.
Protocol buffers is a tag-length-value encoding. It's got all the problems that DER and CER have. It's what happens when people decide to reinvent a wheel they don't understand.
What’s so great about ASN.1 and it’s encoding rules is that anyone writing type-length-value serialization for networking purposes, for example[1], is basically independently reinventing ASN.1 because it’s so fundamentally optimal.
It truly will make you wonder why Protobufs and others exist.
> What’s so great about ASN.1 and it’s encoding rules is that anyone writing type-length-value serialization for networking purposes, for example[1], is basically independently reinventing ASN.1 because it’s so fundamentally optimal.
The challenge arises if you have very large values: by nature, TLVs require that the V be encoded before you can plug in the L. If you use definite-length encodings (as required by DER), you may end up having to hold and encode a pretty large piece of data in memory. You can work around this, of course, but it can be a challenge.
Tags in ASN.1 as noted in another comment can also be pretty complicated: there are four tagging classes, and tags can be applied implicitly, explicitly, or automatically depending on the specification. This can make life a bit uncomfortable at times.
On the balance, I can understand why people find ASN.1 such a pain, especially if you're not inclined to fork over money to have someone else deal with the encodings. For medium- to large-sized companies, though, it's probably not a bad deal: get a support contract from one of the commercial vendors, get training, and save yourself six man-months on writing pretty bullet-proof serialization code without the headache of worrying about standards incompatibilities. If you happen to work in telecommunications or security, you're going to deal with ASN.1 at some point anyway, so having something that can talk to multiple parts of your stack can be helpful, too.
That there's four tag classes is not really a complexity. That there's IMPLICIT and EXPLICIT tagging is.
Using IMPLICIT tagging yields encodings that dumpasn1(1)-like tools can't really give you much insight into.
Using EXPLICIT tagging yields bloat.
The answer is to use non-TLV encodings where possible and to use tools that can refer to the schema ("modules") to decode and pretty-print arbitrary things. dumpasn1(1) is just too simple.
Back when I was in school in 2004, I had a teacher who had worked on the ASN.1 spec.
In 2004, XML was all the rage. People would create "XML startups", and Microsoft did SOAP and some other guys XHTML, and XML schemas, semantic web and so on.
I remember that teacher being so upset that XML got big and ASN.1 disappeared. It was very awkward. Poor guy...
a) ASN.1 got XML Encoding Rules (XER), so you can use XML w/ ASN.1 as the schema language, which really, mostly is about supporting existing ASN.1-based protocols but with XML because well, you know, XML was all the rage,
and
b), FastInfoSet happened, which is an ASN.1 PER-based "compression" of XML because well, you know, XML is too verbose and unwieldy.
I [bleep] you not, that happened.
Evidence that there's nothing wrong with ASN.1 the syntax (and that's all it is, syntax and semantics, with a side of pluggable encoding rules where you can make them all up the way you want). Everything that's wrong with ASN.1 is either that which is wrong with BER/DER/CER (plenty), or that which is wrong with people's perception of ASN.1 (also plenty).
- there's nothing really wrong with ASN.1 as a syntax except maybe it's ugly
- there's nothing wrong at all with ASN.1's semantics
- there's a TON wrong with the BER family of encoding rules (BER, DER, and CER), and with every tag-length-value scheme
- you can create ASN.1 encoding rules for anything you like, which really means "use ASN.1 as the schema language for whatever encoding I prefer"
- indeed, there's XER (XML encoding rules), JER (JSON encoding rules), GSER (generic string encoding rules) -- all text-based -- and a bunch of binary encodings with at least two that are not tag-length-value (and so resemble NDR and XDR), like PER and OER
- people love to hate ASN.1, mainly because BER/DER/CER deserve the hatred, and for less legitimate reasons too, so they go off and invent new wheels that often have the same problems -- oh well!
At least in file formats it to me seem they would be instrumental to have a extendible and flexible format, where you can skip unknown or uninteresting chunks (in say, PNG chunks, or IFF-based formats like OBJ, etc.).
Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?
Sorry about the wall of questions, but I'm just so confused.
Not OP, but one of the challenges is that definite-length encodings like DER have to be encoded in a non-intuitive way. Values must be encoded prior to lengths (because the length is unknown), and the values can be nested. Therefore you have to encode a message essentially backwards when using definite-length encodings. This can potentially require a great deal of memory and can increase latency because streaming the data is hard.
Indefinite lengths (BER has this option, CER requires it) can help avoid this problem, but then you lose the benefit of skipping elements (which you allude to in your next paragraph).
> Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?
You've hit the tradeoffs pretty well in the question, I think. The nice thing about TLV is that you can decode without a schema and potentially work with the contents: it's a relatively simple format to decode and validate even if it's not necessarily great for the encoder.
ASN.1 supports schema-informed packed encodings that place greater demands on both the encoder and decoder. The main advantage is that they greatly reduce message overhead, but it requires a lot of bit-twiddling for presence/absence, default values, and, in unaligned variants, everything else, too. It's impossible, generally, to decode everything without the schema. PER has rules that disambiguate the values (e.g., they have to be ordered in a particular way, so you know what's coming next), and this mitigates some of the problems of TLV-style encodings.
The tradeoffs are worth it when your pipes are small. 3GPP and LTE messages are largely encoded in PER. The people playing in that world usually have plenty of money to spend on commercial solutions and have bandwidth to roll their own, too. That's a bit different than smaller shops who are looking for convenient automated serialization formats.
First, some generic problems with TLV encodings:
Next some problems specific to DER-like definite-length TLV encoding rules: There's more, but I'm not too familiar with the issues around CER-like indefinite-length encoding issues.Bottom-line: TLV is an unnecessary crutch. Compilers simply don't need it. For proof by existence consider that Sun's rpcgen(1) existed in 1986, a mere two years after ASN.1's 1984 standard, and rpcgen(1) uses XDR syntax and encoding -- XDR is NOT a TLV encoding at all. But ASN.1 tooling -proprietary and open source- took much longer to catch up with XDR and IDL/NDR and other things. It's almost like TLV encodings made it harder to get to compilation because they were a crutch for hand-coding codecs. But even XDR is easy to hand-write codecs for!
BTW, XDR and NDR were basically the first flatbuffers-like encodings. Lustre RPC has an even more flatbuffers-like encoding, but it's hand-coded. There's just nothing new in this space, and there hasn't really been anything new in this space in many years.
> At least in file formats it to me seem they would be instrumental to have a extendible and flexible format, where you can skip unknown or uninteresting chunks (in say, PNG chunks, or IFF-based formats like OBJ, etc.).
TLV is NOT necessary for this sort of extensibility. You naturally end up with something like TLV when using non-TLV encodings with support for extensibility, though it's often more like LTV. Let's say you have a struct you want to make extensible in some non-TLV encoding you're designing... What would you do? Well, knowing ASN.1's PER/OER and knowing how we've dealt with this in XDR I would do this: add an octet string field to the end of every extensible struct! What would that octet string contain? The encoding of the extensions. What if you want to support different kinds of extensions in a mix-and-match way? Well, that's easy too: add a discriminated union or "typed hole" to the end of every extensible struct, with every choice taken having a Length prepended to it so you can skip it.
Extensibility is something that has been beat to death in the ASN.1 space, and it has all of these options:
- extensibility markers in SEQUENCE / SET types (i.e., "struct" types)
- extensibility markers in CHOICE types (i.e., discriminated union types)
- extensibility markers in INTEGER and BIT STRING constraints (i.e., enum types)
- rules for handling known and unknown extensions in each ER (encoding rules)
- typed holes.
A typed hole is just a glorified discriminated union with an "external" sort of discriminant and specification of the union arms' types. Basically, a typed hole is just a struct with two fields: a) a type identifier of some sort (an integer, a string, an OID, a relative OID, whatever), b) an octet string containing an encoding of the value of a type identified by (a).
ASN.1 has syntax and semantics for expressing what type IDs go with what types, and so you can actually have compilers that recursively and automatically decode/encode through typed holes.
> Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?
I address this above. This is all addressed in ASN.1 (and also XML because of XMLNS). Many very smart people who came before you and I saw to it that ASN.1 addressed all these issues definitively long ago.
[1]: https://www-sop.inria.fr/rodeo/mavros/intro-mav.html Search for "Rose"
My guess is that his complaint is that MACRO semantics are not well defined and are challenging to parse with conventional compilers. I've always wondered if they were inspired in some part by LISP, since you could in principle translate them fairly readily. ROSE and SNMP are still relatively commonly-used specifications that embed macro definitions, and most of the work I've seen done with them involves actually hard-coding the output (rather than actually parsing the MACROs).
You don't need ASN.1 MACROs for anything in Internet protocols, and you can do without more generally anyways.
I would like to hear more about what's wrong with tag-length-value schemes. And can these be corrected or do would you advocate for alternatives? Which alternatives?
Conversely, how have modern systems avoided the pitfalls (if any) of ASN1/DER/BER?
The problem with this is probably most notable with the T.61 encoding which changed over the years and since ASN.1 references other standards nobody is quite sure exactly what you have to support to have T.61 actually work right.
Within X.509 certificates though nobody bothers to actually implement T.61 and just uses the T.61 flag for ISO-8859-1.
There are a bunch of gory details around this mess in this (now quite old) write-up here: https://www.cs.auckland.ac.nz/~pgut001/pubs/x509guide.txt
Since that write up I believe UTF-8 is pretty much the expectation for character encoding for X.509.
I documented some of the quirks around 6 years ago when I took an existing X.509 parser and improved it for use in certificate trust management in Subversion: http://svn.apache.org/viewvc/subversion/trunk/subversion/lib...
Basically ASN.1 wasn't well defined and it only works well when people agreed to only use certain features or to interpret things in a particular way when ambiguous.
It's also notoriously difficult to parse well. It's very easy to have bugs in your parser, even if you're implementing a subset of it that's needed for X.509. Especially if you're doing so in a non-memory safe language.
I can't speak for why Google invented Protobufs, but I can't imagine anyone sane picking up ASN.1 for anything modern and deciding that this is what they want to use.
PKIX actually leverages the weird encoding restriction to our benefit. It defines two kinds of names which things might have on the Internet (you can and should stop trying to name things which are actually on the Internet some other way), DnsNames and IpAddresses. IpAddresses, since they're either 32-bit or 128-bit arbitrary bit values, are just represented as either 32-bit or 128-bit arbitrary bit values. So you cannot express the erroneous IPv4 address 100.200.300.400 as an IpAddress, which means you can't trip up somebody's parser with that nonsense address. DnsNames use a deliberately sub-ASCII encoding from ASN.1 which can express all the legal DNS names (all A-labels and the ASCII dot . are permissible) but can't express lots of other goofy things including most Unicode. So a certificate issuer, even if they're completely incompetent, cannot write a valid DnsName that expresses some garbage IDN as Unicode. Hopefully they read the documentation and find out they need to use A-labels (Punycode) but if not they're prevented from emitting some ambiguous gibberish.
Even in forums where you'd once have expected pushback, "Just use UTF-8" is becoming more widespread. Microsoft for example, once upon a time you'd get at least some token resistance, today they're likely to agree "Just use UTF-8". So ASN.1 ends up no worse off for a half a dozen bad ways to write text you shouldn't use, compared to say XML, HTML, and so on.
Not to say the rest of the spec is notably better. If fully implemented, it requires supporting escape codes in strings to change character sets. I've never seen valid escape codes in real world data, but it probably exists.
As the original article shows, ASN.1 has lots of other challenges and complexity. Trying to write a code generator that supports all the complexity is no trivial task and the only open source one I've seen only generates C code. Protobuf has the advantage of having modern language support (including multiple type safe and memory safe languages).
ASN.1 has always been as-well- or better-defined than its competition. The ITU-T specs for it are a thing of beauty not often equaled outside the ITU-T.
That said, for a long time the ASN.1 specs were non-free, and that hurt a lot. Also, the BER family of encoding rules stunted development of open source tooling for ASN.1.
Part of my curiosity stems from Apple using it as part of their bootable file-format: https://www.theiphonewiki.com/wiki/IMG4_File_Format
But as you say, I have to assume they're using it in a very constrained way.
Well, yes, because ASN.1 predates Unicode.
ASN.1 really demands code generation. Unfortunately lots of nonconforming stuff has to be dealt with. The concept of encoding rules and the module tagging scheme make for a pretty big number of possible representations.
The language semantics of ASN.1 don't really map to anything well, particularly around default fields and structures that can vary.
Newer systems don't have encoding rules and pick a semantics that matches a target language much more closely.
Nope, nyet, bzzt. Proofs by counter-example:
- OpenLDAP has a printf/scanf-like approach to BER encoding
- Heimdal has an ASN.1 compiler that generates code, yes, but also alternatively generates bytecode that gets interpreted at run-time.
> The language semantics of ASN.1 don't really map to anything well, particularly around default fields and structures that can vary.
You are ill-informed. Proof by counter-example:
- there are ASN.1 encoding rules that produce natural XML (XER) and JSON (JER)
- "default fields" are supported (the relevant keyword is `DEFAULT`, naturally)
- "structures that can vary" -- if you mean unions, it's got that (the relevant keyword is `CHOICE`), and if you mean "extensions", it's got extensibility markers (that effectively are alike a CHOICE of an octet string of unknown stuff, or else the extensions known at module compile time.
On this specific point: isn't this also the case for other high-performance serialisers? Google ProtoBufs, Apache Thrift, any protocol through Rust's SerDes...
ASN.1 is just syntax and semantics. There are encoding rules that produce textual representations (GSER), XML (XER), JSON (JER), there's XDR-style encoding rules (PER and OER, but with 1-octet units instead of 4-octet units, plus efficient representation of optional fields).
In fact, you can make ASN.1 encoding rules that are based on NDR and XDR and which work for all of IDL and XDR and that subset of ASN.1 that is covered by the semantics of IDL and XDR, and you can extend that to cover all of ASN.1 if you want.
I should know these things, as I maintain an ASN.1 compiler and I intend to eventually teach it to do XDR and NDR.
Really, there's nothing about data schemas that you can express in JSON, CBOR, IDL, XDR, S-expressions, or any schema language you want, that you can't express in ASN.1, or, if there is, it's got to be a pretty niche feature and easily added to ASN.1 anyways. Even functions (RPCs) can be expressed in ASN.1 with some conventions, and routinely are, because it's really just a request/response protocol.
But every year someone invents a new thing because of how stupid, tired, and old ASN.1 is (or, rather, they perceive it to be). Or because of how complex ASN.1 is and how there's a paucity of tools, so then they: reinvent the wheel (often badly), a wheel for which instantly there is a paucity of tools.
https://news.ycombinator.com/item?id=20725550
The dozens of ASN.1 specs are absolutely hideous and entrenched in obsolete telecom jargon. If the sole goal Protobuf was to avoid having Google engineers be required to refer to the dozens of ASN.1 specs when disagreements or confusions arose, then it would have been 100% worth it for just that reason.
Now, should Protobufs or Thrift simply have been constrained versions of ASN.1? I think there is a view of software engineering where this would have been an ideal outcome, but almost universally when we see too-big standards, they are declared "dangerous" and avoided like the plague before they are downscoped.
ASN.1 today is as broad as it needed to evolve to be because its users needed it.
It was like going from runtime type checking to static type checking: initially inconvenient, but paying dividends after a short while.
So why did this tech disappeared if it was ultimately better than the later alternatives (textual protocols, shema-less serializers, and eventually protobuf which reinstated some form of efficient encoding and type checking).
As it uncannily frequently occurs with technological evolution, the reason is probably not to be found within its technical issues (which basically all boil down to: designed by committee).
ASN.1 was just a bit too inconvenient, the free tools to generate code were just not quite good and robust enough, and the approach of starting with designing your types and protocols and putting in place your code production tool-chain before being able to ship anything was at odd with the mood of the day, which was to let the junior cheap dev fire off his code editor during the coffee break of the first design planning meeting to build the first half-backed prototype that would be already sold to the customer by the time he hits :wq. To move fast and break things, ASN.1 got in the way.
So did formal specifications in general, code analyzing tools, even basic type checking, all of them thrown out the window during the same period for the extra weight, extra time-to-market and extra cost of hiring. Text protocols out competing saner alternatives because they are initially simpler (SIP vs H.323 anyone?), schema-less data formats predominating almost entirely because you can start hacking quicker, etc. are all attributable to that cultural rather than technical trend I believe.
Now it seems the industry is slowly recovering from these excesses. Maybe because of the damage that has done, but more likely because of the end of cheap hardware progresses, encryption everywhere and massive data volumes (that's what made Google come up with better protocols than HTTP and better formats than human readable text, after all).
That said, I always found the standard quite interesting with different encodings based on the degree of prior shared info or format. My assumption is that not-invented-here is part of the why it’s not used.
Hand-coding works okay when the data is what you expect. But when you throw mal-formed certificates at it, you have to catch all the edge cases. Having generated code would have enabled much more edge cases to be covered.
Spent a bunch of time working with and adding to those libraries.
The 90s were rough on text encoding, but it seems pretty settled now.
The quality of existing open source libraries to parse ASN.1 leaves a lot to be desired.
There’s an “XER” if you want a human-readable XML encoding, too.
We used BER-TLV throughout the system extensively, where it was needed as well as where it wasn't.
I have implemented complete parsers/serializers, data structures using TLV, transactional database where data was stored as TLV documents. EMV is built on top of BER-TLV, SSL used it, as well as ISO-8583 messages transmitted data encoded with BER-TLV. Communication with the PIN Pad was built on it. We kept configuration as BER-TLV documents.
I have been able to parse hex representation in my head.
I really liked the standard. It is nice, flexible and very efficient. Easy to parse, can be parsed reliably and safely in statically allocated memory.
To those who think this is ancient history and it should be dropped -- do you think that might just be because you don't actually know it or maybe you just think it is old and so it must be bad?
Because all tags are these custom things, some don’t strictly parse out to unique type codes too. So a non-EMV parser will have a few tags that map to the same integer code and cause some fun bugs.
That project was when I really understood deep-down why JSON won in the end!
BER/DER/CER is binary S-expressions.
Is this really the main difference between ASN.1 and Google protobufs, that one is managed by a private corporation and the other by a standardization organization? Can they otherwise be used "interchangably" in designing interfaces, a la two different programming languages (with different syntax of course)?
Honestly, ASN.1 with semantic versioning at the protocol level would probably have been as robust and useful as Protobufs. If ASN.1 had been forked into "ASN.1 3.0 without 10 hazardous and awful 1980s text encodings," it could even be fairly palatable today. Whether the overly expansive nature of ASN.1 is a product of the committee / standards organization design or the timeframe in which it originated is certainly an interesting philosophical question.
ASN.1 versioning in particular is a work of art.
Not so. Protocol buffers is just a TLV encoding, which is bad (see elsewhere in this thread) -- it's just a cut-down ASN.1 and variation on BER, so what.
ASN.1 can "well-version" everything just as well as anything else.
Interoperability with ASN.1 is very fragile at best.
In my experience, tooling is actually very good for most commonly-used languages, including C/C++, C#, Java, Python, and maybe even Go. And, of course, erlang. The real challenge is, I think, that you cannot find good free tooling, and the barrier to entry for Joe Developer is fairly high (in the thousands of dollars).
No, the two are not interoperable and probably won't be made that way. Protobuf has undergone changes that challenge its backwards-compatibility (e.g., with item presence). ASN.1 supports multiple encoding rules, and while it's possible that someone could map ASN.1 syntax to protobuf encodings, it would only support a subset of ASN.1 because protobuf doesn't support length or value constraints (among other ASN.1 features).
ASN.1 does have a little-used standard called Encoding Control Notation[0] that in principle supports the construction of novel encodings. But I have never seen a compiler, commercial or otherwise, that supports it. It requires a certain expressiveness in your parser that's hard to do right, although I've wondered if LISP or Racket could take it on.
[0]: https://www.itu.int/rec/T-REC-X.692-202102-I
You can write more about these problems and it would have higher visibility.
It truly will make you wonder why Protobufs and others exist.
[1]: https://github.com/Planimeter/grid-sdk/blob/master/engine/sh...
The challenge arises if you have very large values: by nature, TLVs require that the V be encoded before you can plug in the L. If you use definite-length encodings (as required by DER), you may end up having to hold and encode a pretty large piece of data in memory. You can work around this, of course, but it can be a challenge.
Tags in ASN.1 as noted in another comment can also be pretty complicated: there are four tagging classes, and tags can be applied implicitly, explicitly, or automatically depending on the specification. This can make life a bit uncomfortable at times.
On the balance, I can understand why people find ASN.1 such a pain, especially if you're not inclined to fork over money to have someone else deal with the encodings. For medium- to large-sized companies, though, it's probably not a bad deal: get a support contract from one of the commercial vendors, get training, and save yourself six man-months on writing pretty bullet-proof serialization code without the headache of worrying about standards incompatibilities. If you happen to work in telecommunications or security, you're going to deal with ASN.1 at some point anyway, so having something that can talk to multiple parts of your stack can be helpful, too.
Using IMPLICIT tagging yields encodings that dumpasn1(1)-like tools can't really give you much insight into.
Using EXPLICIT tagging yields bloat.
The answer is to use non-TLV encodings where possible and to use tools that can refer to the schema ("modules") to decode and pretty-print arbitrary things. dumpasn1(1) is just too simple.
In 2004, XML was all the rage. People would create "XML startups", and Microsoft did SOAP and some other guys XHTML, and XML schemas, semantic web and so on.
I remember that teacher being so upset that XML got big and ASN.1 disappeared. It was very awkward. Poor guy...
a) ASN.1 got XML Encoding Rules (XER), so you can use XML w/ ASN.1 as the schema language, which really, mostly is about supporting existing ASN.1-based protocols but with XML because well, you know, XML was all the rage,
and
b), FastInfoSet happened, which is an ASN.1 PER-based "compression" of XML because well, you know, XML is too verbose and unwieldy.
I [bleep] you not, that happened.
Evidence that there's nothing wrong with ASN.1 the syntax (and that's all it is, syntax and semantics, with a side of pluggable encoding rules where you can make them all up the way you want). Everything that's wrong with ASN.1 is either that which is wrong with BER/DER/CER (plenty), or that which is wrong with people's perception of ASN.1 (also plenty).
I wonder if your teacher eventually understood why XML was preferred over ASN1. Seems to me like it was easier to pick up, and harder to mess up.
Deleted Comment