Readit News logoReadit News
jimmytucson · 2 years ago
> author decided it was a good idea to prepend the message with the message length encoded as a varint.

> WHY? Oh, why?!

Uh oh. Is this my HN moment?

This is exactly how I implemented it at my company. We had to write many protobuf messages to one file in bulk (in parallel). I did a fair amount of research before designing this and didn’t find any standard for separating protobuf messages (in fact, found that there explicitly isn’t a standard in that protobuf doesn’t care). So I thought rather than using some “special” control character, like a null byte, which would inevitably be not-so-special and collide with somebody else’s (like Schema Registry’s “magic byte”), I’d use something meaningful like the number of bytes the following record is.

As for why I chose varint instead of just picking an interger size, well for one I got nerd-sniped by varint encoding and thought it would be cool to try and implement it in Scala. Secondly, I thought if I chose a fixed size integer, no matter what size I pick, my users will always surprise me and exceed it at least once, and when that happens, kaboom! I wanted to future proof this without wasting 64 goddamn bytes in front of each message, and also I got nerd-sniped, OK?!?

Someone on my team recently shared one of these files outside the company and so I really hope she’s not talking about me but that’s a crazy coincidence if not!

paulddraper · 2 years ago
Congrats :) you perfectly re-created the actual Protobuf stream format. [1]

https://protobuf.dev/reference/java/api-docs/com/google/prot...

NavinF · 2 years ago
> didn’t find any standard for separating protobuf messages

The fact that protobufs are not self-delimiting is an endless source of frustration, but I know of 2 standards for doing this:

- SerializeDelimited* is part of the protobuf library: https://github.com/protocolbuffers/protobuf/blob/main/src/go...

- Riegeli is "a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding": https://github.com/google/riegeli

harvie · 2 years ago
blauditore · 2 years ago
Prepending the message with a delimiter (size varint) is pretty common, even part of the reference Java implementation: https://protobuf.dev/reference/java/api-docs/com/google/prot...
pavlov · 2 years ago
> “I wanted to future proof this without wasting 64 goddamn bytes in front of each message”

64 bytes would be a 512-bit integer. That seems like excessive future proofing for the length of any message that would be transmitted before the Sun runs out of fuel.

xlogout · 2 years ago
Fun fact: just 103 more bits and it would be enough to address every point in the observable universe distinguishible by Planck length [1].

Not my thought originally, I heard it from somewhere else but can't find it. Possibly from Foone Turing.

[1] https://www.wolframalpha.com/input?i=log2%28+volume+of+unive...

jeffbee · 2 years ago
A related point of fact is protobufs are limited to 2G anyway. You wouldn’t need 64 bits. Anything bigger than about 100MB is dangerously large anyway.
lifthrasiir · 2 years ago
You can also define a wrapper message with a single repeated field. The resulting encoding would be `varint(field id * 8 + 1) varint(length)` plus the actual message, so it can be also easily generated from and parsed back into a raw byte sequence without protoc.
xyzzy_plugh · 2 years ago
This approach is problematic because, as far as I know, all implementations expect to deserialize from a complete buffer -- at least all official implementations. That means the entire "stream" would have to be buffered in memory.

You can, of course, simply read field by field, but few libraries expose the ability to do that simply. And the naive operation then becomes quite problematic indeed.

zX41ZdbW · 2 years ago
This is the same as it's implemented in ClickHouse. In fact, it has two formats: `Protobuf` (with length-delimited messages) and `ProtobufSingle` (no delimiters, but only a single message can be read/written). And it is fairly common:

https://clickhouse.com/docs/en/sql-reference/formats#protobu...

catiopatio · 2 years ago
Conforming implementations will fail to interoperate with their “protobuf” serialization; it’s absolutely incorrect for them to call their length-prefixed framing protobuf.
jimvdv · 2 years ago
From my experimentation with writing my own push notification client for android, this is exactly what Google does with their push messages.
seanw444 · 2 years ago
If it was for fun and to learn how, that's fair. But are you aware of https://ntfy.sh?
xyzzy_plugh · 2 years ago
Can you share more?
rkangel · 2 years ago
You've basically got two ways of handling this:

Using a delimiter to mark end of packet. For text protocols we usually use "\n" to be the delimiter. This usually requires some escaping in the packet to make it unambiguous. Two standard protocols for this are SLIP and HDLC.

Length encoding - like what you did.

The downside of the delimiter approach is that it changes the length of the packet - when you are escaping one byte becomes two and that's sometimes a pain if you're doing it in place in memory (less of a problem if you're streaming byte by byte). The big advantage is that it allows for resynchronisation - if you lose a single byte from your stream, or your length byte ever gets corrupted then you're permanently out of sync - the receiver will never again know where the start or end of a packet is. With the delimiter approach, you just lose one packet. So if you're ever doing this for a UART or network stream or something, always do the delimiter approach!

hiimkeks · 2 years ago
> So if you're ever doing this for a UART or network stream or something, always do the delimiter approach!

I'm confused. I mean for UART, sure. But network streams is usually sent over a protocol that recovers lost data. Am I reckless for sending length-prefixed data chunks over TCP?

NoahKAndrews · 2 years ago
A TCP network stream will guarantee no missed bytes via behind-the-scenes retransmissions, no?
brandmeyer · 2 years ago
> isn’t a standard in that protobuf doesn’t care

Shove protobuf into Something Else that does packet delimitation for you. I'm fond of SQLite for offline cases as a richer alternative to sstable.

benreesman · 2 years ago
Oh who hasn’t written some LengthPrefixedProtobufRecord class. I get why Rachel would be annoyed if someone claimed it was protobuf, but even then it’s like, 8 seconds in xxd to see what’s going on.

I shudder to think how well shit must be going for this to merit a Rachel post.

bobmcn · 2 years ago
Why not just define a higher level proto that contains all possible (maybe repeated) protos you might want to include? Then if one of the included protos is not present, the higher level proto will efficiently encode that, and nothing gets broken.
tetrep · 2 years ago
If you want to future proof it you need to version it. It sounds like you're trying to pack many things into a single file, so having the first few bits of the file represent a version allows you to use fixed length integers without fear of them being too small (in the future). You can reserve the "last" version for varint if you truly need it.

In general, I find adding versions to things allows for much more graceful future redesigns and that is, IMO, invaluable if you're concerned about longevity and are not confident in your ability to perfectly design something for the indefinite future.

NavinF · 2 years ago
I don't see the point. As soon as you bump the version number, old versions of the software will refuse to read newer files. So you might as well use a new format and file extension for the newer files. Of course the new software will read both formats.

The way protobufs handles versioning (old software ignores unknown fields) is far superior and realistically everyone uses 64 bit fixed length sizes everywhere

paulddraper · 2 years ago
YAGNI

The idea is so simple that any change would be a misfeature.

paulddraper · 2 years ago
Do you mean 64 bits I assume?

Because that's all you would need.

catiopatio · 2 years ago
Your length-prefixed message frames represent a distinct serialization scheme that provides external framing for embedded messages; it’s a standalone format, and you can use whatever frame header you want without worrying about collision.

The only issue is if you were to ship a “protobuf” library that emits/consumes your (very much not protobuf) framing format.

Also, a 64 bit frame length would only be 8 bytes, not 64 :-)

kentonv · 2 years ago
This post's outrage is misplaced.

Protobuf encoders/decoders commonly implement two formats: delimited format, and non-delimited format. Typically non-delimited is the default, but delimited format is supported by many implementations including Google's main first-party ones. In fact, the Java implementation shipped with this support when Protobuf was first released 15 years ago (I wrote it). C++ originally left it as an exercise for the application (it's not too hard to implement), but I eventually added helper functions due to demand.

Both formats can be described as "standard", at least to the extent that anything in Protobuf is a standard.

So clearly the bug here is that one person was writing the delimited format and the other person was reading the non-delimited format. Maybe the confusion was the result of insufficient documentation but certainly not from a library author doing something crazy.

xyzzy_plugh · 2 years ago
It's sort of misplaced, I'll agree.

Merely using the delimited format without any other sort of framing is almost always a bad idea because of precisely the ambiguity TFA discusses.

I'm pretty sure delimited streams are rarely used in the wild instead of something more robust/elaborate such as recordio, which specifically are almost always prefixed with a few magic bytes to mitigate this problem.

Edit: Also, why is there no publicly available recordio specification? Infuriating.

blauditore · 2 years ago
> I'm pretty sure delimited streams are rarely used in the wild

They are somewhat common inside Google at least.

Deleted Comment

lifthrasiir · 2 years ago
I actually went through all projects listed in [1] because I remember this very quirk. It turns out that there are many such libraries that have two variants of encode/decode functions, where the second variant prepends a varint length. In my brief inspection there do exist a few libraries with only the second variant (e.g. Rust quick-protobuf), which is legitimately problematic [2].

But if the project in question was indeed protobuf.js (see loeg's comments), it clearly distinguishes encode/decode vs. encodeDelimited/decodeDelimited. So I believe the project should not be blamed, and the better question would be why so many people chose to add this exact helper. Well, because Google itself also had the same helper [3] [4]! So at this point protobuf should just standardize this simple framing format [5], instead of claiming that protobuf has no obligation to define one.

[1] https://github.com/protocolbuffers/protobuf/blob/main/docs/t...

[2] https://github.com/tafia/quick-protobuf/issues/130

[3] https://protobuf.dev/reference/java/api-docs/com/google/prot...

[4] https://github.com/protocolbuffers/protobuf/blob/main/src/go...

[5] Use an explicitly different name though, so that the meaning of "encoding/decoding protobuf messages" doesn't change.

cowsandmilk · 2 years ago
Yep, and this variant to the encoding is documented at https://protobuf.dev/programming-guides/techniques/#streamin...

Definitely seems to be a routine addition to the standard supported by many libraries.

lifthrasiir · 2 years ago
> Yep, and this variant to the encoding is documented at [...]

It only suggests the length prefix and doesn't define the exact encoding at all.

crabbone · 2 years ago
This is not a "variant". Top-level "message" has no length. Implementations that add length are free to do it in whatever way they like, but that's not part of the format because the format itself doesn't have a concept of "sequence of messages". It has lists, but those need to be inside of messages.

And since some do it in the same way, sometimes it works. The two typical approaches I found in the wild: use varint as a length and do nothing at all. Typically, the second one implies that the user, if they want to send a sequence of messages need to get creative and invent some form of connecting them together.

GRPC is the first kind. So, all those using GRPC rather than straight-up Protobuf are shielded from this problem.

saghm · 2 years ago
Stepping up an abstraction level in this discussion...does anyone have any insight into _why_ an encoding format wouldn't want to have length prefixes standardized as part of the expected header of a message? From what I can tell, there isn't a strong argument against it; assuming you're comfortable with limiting messages to under 2^32 bits, prefixing an unsigned length should only take four bytes per message, which doesn't seem like it would ever be a bottleneck, and it allows the receiving side of a message to know up front exactly how much memory to allocate, and it makes it much easier to write correct parsing code that also handles edge cases (e.g. making it obvious to explicitly handle a message that's much larger than the amount of memory you're willing to allocate). The fact that there are formats out there that don't mandate length prefixing makes me think I might be missing something though, so I'd be interested to hear counterarguments.
gizmo686 · 2 years ago
In general, the issue is with composition. Your full message is someone else field. Having a header that only occurs in message initial position breaks this.

For protobuffs in particular, I have no idea. If you look at the encoding [0], you will see that the notion of submessages are explicitly supported. However, submessages are preceeded by a length field, which makes the lack of a length field at the start of the top-level message a rather glaring omission. The best arguement I can see is that submessages use a tag-length-value scheme instead of length-value-tag. This is because in general protobufs use a tag-value scheme, and certain types have the begining of the value be a length field. This means that to have a consistent and composable format, you would need to message length to start at the second byte of the message. Still, that would probably be good enough for 90% of the instances where people want to apply a length header.

[0] https://protobuf.dev/programming-guides/encoding/

lifthrasiir · 2 years ago
Protobuf is sort of unique in the serialization format that it can be indefinitely extended in principle. (BSON is close, but has an explicit document size prefix.) For example, given the following definition:

    message Foo {
        repeated string bar = 1;
    }
Any repetition of `09 03 41 42 43` (a value "ABC" for the field #1), including an empty sequence, is a valid protobuf message. In the other words there is no explicit encoding for "this is a message"! Submessages have to be delimited because otherwise they wouldn't be distinguishable from the parent message.

crabbone · 2 years ago
> you will see that the notion of submessages are explicitly supported.

This is misinterpreting what actually happens. "Message" in Protobuf lingo means "a composite part". Everything that's not an integer (or boolean or enum, which are also integers) is a message. Lists and maps are messages and so are strings. The format is designed not to embed the length of the message in the message itself, but to put it outside. Why -- nobody knows for sure, but most likely a mistake. After all it's C++, and by the looks of the rest of the code the author seems like they felt challenged by the language, so they felt like it'd be too much work if / when they realized that the encoding of the message length was misplaced to put it in the right place, and so it continues to this day.

For the record, I implemented a bunch of similar binary formats, eg. AMF, Thrift and BSON. The problem in Protobuf isn't some sort of a theoretical impasse. It's really easy to avoid it, if you give it like an hour-long thinking, before you get to actually writing the code.

HelloNurse · 2 years ago
So the length of the submessage is part of its parent, and the top level message has no explicit length because it has no parent? It seems terrible for most purposes.
pyrale · 2 years ago
> Having a header that only occurs in message initial position breaks this.

Why would it break it? It may make it slightly harder to parse, but since the header also determines the end of the message, anyone parsing the outer message would have a clear understanding that the inner header can be safely ignored as long as the stated outer length has not been matched.

kentonv · 2 years ago
I'd say the main arguments are:

1. Many transports that you might use to transmit a Protobuf already have their own length tracking, making a length prefix redundant. E.g. HTTP has Content-Length. Having two lengths feels wrong and forces you to decide what to do if they don't agree.

2. As others note, a length prefix makes it infeasible to serialize incrementally, since computing the serialized size requires most of the work of actually serializing it.

With that said, TBH the decision was probably not carefully considered, it just evolved that way and the protocol was in wide use in Google before anyone could really change their mind.

In practice, this did turn out to be a frequent source of confusion for users of the library, who often expected that the parser would just know where to stop parsing without them telling it explicitly. Especially when people used the functions that parse from an input stream of some sort, it surprised them that the parser would always consume the entire input rather than stopping at the end of the message. People would write two messages into a file and then find when they went to parse it, only one message would come out, with some weird combination of the data from the two inputs.

Based on that experience, Cap'n Proto chose to go the other way, and define a message format that is explicitly self-delimiting, so the parser does in fact know where the message ends. I think this has proven to be the right choice in practice.

(I maintained protobuf for a while and created Cap'n Proto.)

mannyv · 2 years ago
About #1, FYI you should never trust the content-length of an HTTP request. It actually says that in one of the spec versions or another.

The problem with writing the length out at the beginning of the message is that you need to know the length before you write it out. For large objects that may cause memory issues/be problematic.

In many cases it works just fine. I doubt any protocol puts a "0" as the length for a dynamic length, but I can see a many-months long technical fight about that particular design decision.

marcosdumay · 2 years ago
About #2, it seems that every time someone creates some format with a prefix serialized size, the next required step is creating a chunked format without that prefix.

What IMO is perfectly ok. Chunking things is a perfectly fine way to allow for incremental communication. Just do it up-front, and you will not need the complexity of 2 different sizing formats.

crabbone · 2 years ago
> serialize incrementally

You cannot do it in Protobuf anyways (you need to allocate memory, remember? and you need to know how much to allocate, so you need to calculate the length anyways, you just throw it away after you calculate it, fun, fun fun!).

jchw · 2 years ago
Among other things, length prefixing is annoying when streaming; it basically requires you to buffer the entire message, even if you could more efficiently stream it in chunks, because you need to know the length ahead of time, which you may very well not.
crabbone · 2 years ago
Remember FLV? What about MP4? Surprise! Both use length prefixing, and stream perfectly fine.

Length-prefixing is not a problem for streaming. Hierarchical data is, but even then, you have stuff like SAX (for XML).

The problem with Protobuf and why you cannot stream it is that it allows repetition of the same field, and only the last one counts. So, length-prefixing or not, you cannot stream it, unless you are sure that you don't send any hierarchical data (eg. you are sending a list of integers / floats).

Ah, also, another problem: "default fields". Until you parsed the entire message you don't know if default fields are present and whether they need to be initialized on the receiving end.

Ginden · 2 years ago
> length prefixing is annoying when streaming

This can be avoided by magic number. If length is 0, then message length isn't known.

cjbgkagh · 2 years ago
If you have random access, you could leave some space and then go back and fill in the actual length value. Would work better with fixed size integer as you know ahead of time how much space to leave.
throwaway8163 · 2 years ago
Some compression formats such as gzip support encoding streams.

This is useful when you don't know the size in advance, or if you compress on demand and want the receiver to start reading while the sender is still compressing.

One example could be a web service where you request dynamic content (like a huge CSV file). The client can start downloading earlier, and the server doesn't need to create a temporary file. The web service will stream the results directly and encoding it in chunks.

lifthrasiir · 2 years ago
> Some compression formats such as gzip support encoding streams.

More accurately speaking gzip (and many other compressed file formats) has the file size information, but that information should (or can, for others) be appended after the data. Protobuf doesn't have any such information, so a better analogue would be the DEFLATE compressed bytestream format.

mirekrusin · 2 years ago
You may want to have yield/streaming senantics where length is not know in advance.
lifthrasiir · 2 years ago
The length should be known in advance in order to be written, so the message cannot be incrementally written. You need more complex framing scheme like Consistent Overhead Byte Stuffing for that. And many applications do want a variable number of length bytes because i) 4 bytes is actually too long for short messages and ii) some message can exceed 2^32 bytes. Not to say the generic varint encoding is good for this purpose, though [1].

[1] If you ever have to design one, make sure that reading the first byte is enough to determine the number of subsequent length bytes.

saghm · 2 years ago
> 4 bytes is actually too long for short messages

Would it ever be an actual bottleneck though? If it's not actually impeding throughput, I feel like this is more of an aesthetic argument than a technical one, and one where I'd happily sacrifice aesthetics to make the code simpler.

> some message can exceed 2^32 bytes

Fair enough, but that just makes the question "would 8 bytes per message ever actually be a bottleneck", which I'm still not convinced would ever be the case

throw0101c · 2 years ago
> and it allows the receiving side of a message to know up front exactly how much memory to allocate

Not necessarily. Can you really trust the length given from a message? Couldn't a malicious sender put some fake length to fool around with memory allocation?

I was under the impression that something like this caused Heartbleed (to use one example):

* https://en.wikipedia.org/wiki/Heartbleed

simiones · 2 years ago
Heartbleed was caused by allowing the user to specify the length of the response, not that of a request.

When receiving a message, if the user gives you a wrong length, you'll simply fail in parsing their message. Of course, it is up to you to protect against DOS attacks (like someone sending you a 5 TB message, or at least a message that claims it is 5TB) - but that is necessary regardless of whether they tell you the size ahead of time or not.

With heartbleed, a user sent a message saying "please send me a 5MB large hearbeat message", and OpenSSL would send a 5MB re-used buffer, of which only the first few bytes were overwritten with the hearbeat content. The rest of the bytes were whatever remained out of a previous use of the buffer, which could be negotiated keys etc.

saghm · 2 years ago
I don't see how sending a bad length could cause a memory issue in this case; if a message has a length that's much longer than expected, the receiving side could just discard any future messages from that destination (or even immediately close the connection). If the message is much shorter than the data received, the bytes following it would be treated as the start of a new message, and the same logic would apply.
crabbone · 2 years ago
This is nothing general. Just someone who created Protobuf "forgot" to do it. There was no reason not to do it given how everything else is encoded in Protobuf.

My guess is that Protobuf was first implemented then designed. And by the time it was designed, the designer felt too lazy to do anything about the top-level message's length. There are plenty of other technical bloopers that indicate lack of up-front design, so this wouldn't be very unlikely.

bloak · 2 years ago
Streaming has already been mentioned. Efficiency might be another argument. If your messages are typically being sent through a channel that already has a way of indicating the end of the message then having to express the length inside the message as well would be a useless overhead in bytes sent and code complexity.
saghm · 2 years ago
This assumes that only messages from controlled sources will be received though, right? If you're receiving messages over a TCP socket or something similar, that seems like a potentially flawed assumption; I'd think anything parsing messages coming from the network should be written in a way that explicitly accounts for malicious messages coming from the other side of a connection.

EDIT: I'm also still not any more convinced that four bytes per message would ever be a bottleneck for any general purpose protocol, but I'd be curious to hear of a case where that would actually be an issue.

chrisdew · 2 years ago
Protobuf over UDP can use the UDP payload length. Likewise for the many variants of self-sychronising DLE framing (DLE,STX..DLE,ETX) used on serial links.

A varint length field prepended to protobuf messages (sent over a reliable transport, such as TCP) seems sane.

catiopatio · 2 years ago
Framing is a distinct concern from payload serialization.

Most protocols and serialization formats already define a form of length-prefixed framing; requiring that a protobuf payload also carry such a header would simply be a waste of bytes.

Additionally, it ensures that protobuf can be serialized and streamed without first computing the payload length, which would require serializing the entire message first.

nly · 2 years ago
Pedantry regarding the article:

The field prefix byte in Protobuf doesn't really encode "tag and type" as stated in the article, it encodes tag and size (whether the field is fixed 64bit, fixed 32bit, varint, or variable size)

This is pretty self evident when you look at how submessages are encoded the same way as strings, both are just arbitrary variable length blobs.

You cannot reliable determine from a Protobuf message whether a field is an integer, a double, a bool, or an enum without the schema.

Protobufs is a TLV format that just happens to have a compact binary encoding.

blauditore · 2 years ago
Isn't this exactly what `writeDelimitedTo` does? https://protobuf.dev/reference/java/api-docs/com/google/prot...

I thought this was common and well-known, but apparently not.

kpw94 · 2 years ago
> we started looking at this protobuf library he had selected, and sure enough, the author decided it was a good idea to prepend the message with the message length encoded as a varint.

> WHY? Oh, why?!

> And yes, it turns out that other people have noticed this anomaly. It's screwed up encoding and decoding in their projects, unsurprisingly. We found a (still-open) bug report from 2018, among others

If anyone knows which library/language these issues the author is talking about are in, please tell us. I'd like to avoid that library if possible

loeg · 2 years ago
> If anyone knows which library/language these issues the author is talking about are in, please tell us.

https://github.com/protobufjs/protobuf.js/issues/987 maybe (based on "And yes, that string in this post is entirely deliberate").

simiones · 2 years ago
This is a bug related to decoding what the user believe is a protobuf file but actually is a length-prepended protobuf file. The article would complain about whoever wrote that file, not about the code that is reading it.
WirelessGigabit · 2 years ago
https://github.com/protobufjs/protobuf.js/wiki/How-to-revers...

Prepending the message with the length means the message is length-limited. Seems standard practice here.

gabipurcaru · 2 years ago
yep, I think it's quite standard. The BitTorrent protocol also uses it extensively : https://wiki.theory.org/BitTorrentSpecification#Bencoding
vlovich123 · 2 years ago
I feel like I recall seeing something similar in nanopb.
lowbloodsugar · 2 years ago
That would be the official implementations of protobuf from the people that created it.
ninepoints · 2 years ago
Folks, if you send raw PB over the wire in a protocol without framing, it _needs_ to be length prefixed somehow. This type of "bug" isn't an unreasonable thing to do, nor uncommon even. If I had a gripe, it's that I hate varints. Lots of wasted cycles to save not that many bytes in the grand scheme of things, especially when you consider other forms of compression layered on top, MTU, etc.
catiopatio · 2 years ago
It’s absolutely an unreasonable thing to do.

You don’t conflate framing with payload by emitting invalid non-standard framed data from a “protobuf” encoder. They’re separate concerns and need to remain that way.

ninepoints · 2 years ago
If you read the post, the conclusion is:

> you skip the "helper" function that's breaking things.

Yea ok, I'm just going to assume this helper function added framing unless told otherwise. Where in this post did you even read that framing and payload data were conflated (not to mention that there are better protocols that include framing metadata).

lowbloodsugar · 2 years ago
Except that it’s part of the official implementation from google. It’s how you write multiple such messages to a stream.