Recommendations for designing magic numbers of binary file formats

   SHOULD include a zero byte

I guess it is expected to be at the end of the magic number to act as a null-termibated string?

   MUST include a byte sequence that is invalid UTF-8

I guess it is to differentiate a text file from a specific format?

   MUST include at least one byte with the high bit set

Any reason?

robinhouston · 9 months ago

I think the idea of all these is to make the file not be recognised as text (which doesn't allow nulls), ASCII (which doesn't use the high bit), UTF-8 (which doesn't allow invalid UTF-8 sequences).

Basically so that no valid file in this binary format will be incorrectly misidentified as a text file.

CrossVR · 9 months ago

I think preventing the opposite is more pressing. Imagine creating a text file and it just so happens that the first 8 characters match a magic number of an image format. Now when you go back to edit your text file it is suddenly recognized as an image file by your file browser.

layer8 · 9 months ago

Git interprets a zero byte as an unconditional sign that a file is a binary file [0]. With other “nonprintable” characters (including the high-bit ones) it depends on their frequency. Other tools look for high bits, or whether it’s valid UTF-8. PDF files usually have a comment with high-bit characters on the second line for similar reasons.

These recommended rules cover various common ways to check for text vs. binary, while also aiming to ensure that no genuine text file would ever accidentally match the magic number. The zero-byte recommendation largely achieves the latter (if one ignores double/quad-byte encodings like UTF-16/32).

[0] https://github.com/git/git/blob/683c54c999c301c2cd6f715c4114...

wjholden · 9 months ago

The author explains their reasoning in the next post: https://hackers.town/@zwol/114155807716413069

rcxdude · 9 months ago

Well, the 3rd point follows from the second: all sequences without the high bit set are valid ASCII, and all valid ASCII sequences are valid UTF-8.

Retr0id · 9 months ago

Having a high-bit set allows you to immediately detect if a file has been mangled through transmission over a 7-bit-only medium.

dark-star · 9 months ago

the high bit one is pretty ancient by now. I don't think we have transmission methods that are not 8-bit-clean anymore. And if your file detector detects "generic text" before any more specialized detections (like "GIF87a"), and thus treats everything that starts with ASCII bytes as "generic text", then sorry, but your detector is badly broken

There's no reason for the high-bit "rule" in 2025.

I would argue the same goes for the 0-byte rule. If you use strcmp() in your magic byte detector, then you're doing it wrong

7jjjjjjj · 9 months ago

The zero byte rule has nothing to do with strcmp(). Text files never contain 0-bytes, so having one is a strong sign the file is binary. Many detectors check for this.

Joker_vD · 9 months ago

> I don't think we have transmission methods that are not 8-bit-clean anymore.

I've just dealt with a 7N1 serial link yesterday, they still exist. Granted, nobody really uses them for truly arbitrary data exchange, but still.

# Various dictionary images used by OpenFirware FORTH environment 0 lelong 0xe1a00000 >8 lelong 0xe1a00000 # skip raspberry pi kernel image kernel7.img by checking for positive text length >>24 lelong >0 ARM OpenFirmware FORTH Dictionary, >>>24 lelong x Text length: %d bytes, >>>28 lelong x Data length: %d bytes, >>>32 lelong x Text Relocation Table length: %d bytes, >>>36 lelong x Data Relocation Table length: %d bytes, >>>40 lelong x Entry Point: %#08X, >>>44 lelong x BSS length: %d bytes

Unpopular opinion: this is all needless pedantry. At best this gives parsers like file managers a cleaner path to recognizing the specific version of the specific format you're designing. Your successors won't evolve the format with the same rigor you think you're applying now. They just won't. They'll make a "compatible" change at some point in the future which will (1) be actually backwards compatible! yet (2) need to be detected in some affirmative way. Which it won't be. And your magic number will just end up being a wart like all the rest.

This isn't a solvable problem. File formats evolve in messy ways, they always have and always will, and "magic numbers" just aren't an important enough part of the solution to be worth freaking out about.

Just make it unique; read some bytes out of /dev/random, whatever. Arguments like the one here about making them a safe nul-terminated string that is guaranteed to be utf-8 invalid are not going to help anyone in the long term.

IshKebab · 9 months ago

Magic numbers aren't for parsing files, they're for identifying file formats.

And yes making it obviously binary is helpful, e.g. for Git.

ajross · 9 months ago

> Magic numbers aren't for parsing files, they're for identifying file formats.

Unpopular corollary: thinking those are two separate actions is a terribly bad design smell. What are you going to do with that file you "identified" if not read it to get something out of it, or hand it to something that will.

If your file manager wants to turn that path into a thumbnail, you have already gone beyond anything the magic number can have helped you with.

Again, needless pedantry. Put a random number in the front and be done with it. Anything else needs a parser anyway.

CrossVR · 9 months ago

The magic number isn't about recognizing specific versions. That's just an added benefit if you choose to add that to the magic number.

It is to solve the problem of how to build a file manager that can efficiently recognize all the file types in a large folder without relying on file name extensions.

If you don't include a magic number a file manager would need to attempt to parse the file format before it can determine which file type it is.

CamperBob2 · 9 months ago

Filename extensions are pretty useful. They were adopted for very good reasons, and every attempt to hide them, pretend they don't matter, or otherwise make them go away has only made things worse.

You still need a way to make it hard to fool people with deceptive extensions, though, and that's where the magic numbers come in.

ajross · 9 months ago

> The magic number isn't about recognizing specific versions

Yes it is, though. Does your file manager want to display Excel files differently from .jar files? They're both different "versions" of the same file format! Phil Katz in 1988 or whatever could have followed the pedantry in the linked article to the letter (he didn't). And it wouldn't have helped the problem at hand one bit.

benatkin · 9 months ago

I agree. The author could go after msgpack, they don’t have a magic number, but support using the .msgpack extension for storing data in files. Since a magic number isn’t required at all, it shouldn’t be required to be good.

petertodd · 9 months ago

That's basically how I designed the magic bytes for the OpenTimestamps proof files:

    $ hexdump -C foo.ots 
    00000000  00 4f 70 65 6e 54 69 6d  65 73 74 61 6d 70 73 00  |.OpenTimestamps.|
    00000010  00 50 72 6f 6f 66 00 bf  89 e2 e8 84 e8 92 94 01  |.Proof..........|

0) Magic is at the beginning of the file.

1) Starts with a null-byte to make it clear this is binary, not text.

2) Includes a human-readable part to make it easy to figure out what the file is in hex dumps.

3) 8 bytes of randomly chosen bytes, all of which greater than 0x7F to ensure they're not ASCII.

3) Finally, a one-byte major version number.

4) Total length (including major version) is 32 bytes to fit nicely in a hex dump.

RustyRussell · 9 months ago

These days I generally advise that you interpret the version number as odd and even bits: odd means it's compatible with readers, even means it isn't.

quag · 9 months ago

That sounds interesting. Can you say a little more about how this works?

kazinator · 9 months ago

I wouldn't waste the first byte on a null; make use the all four bytes for a "fourcc" code. Then have a null byte soon after that somewhere.

Well, like I said, I wanted a 32-byte magic so it'd look nice in hex-dumps. So I had plenty of room.

OTS proof files typically contain lots of 32-byte hash digests. So "wasting" 32 bytes on magic bytes isn't a big deal.

conaclos · 9 months ago

gardaani · 9 months ago

Wikipedia has a good explanation why the PNG magic number is 89 50 4e 47 0d 0a 1a 0a. It has some good features, such as the end-of-file character for DOS and detection of line ending conversions. https://en.wikipedia.org/wiki/PNG#File_header

nayuki · 9 months ago

The old PNG specification also explained the rationale: http://www.libpng.org/pub/png/spec/1.2/PNG-Rationale.html#R....

But the new spec doesn't explain: https://www.w3.org/TR/2003/REC-PNG-20031110/

somat · 9 months ago

That is unfortunate. Not enough standards have rationale or intent sections.

On the one hand I sort of understand why they don't "If it is not critical and load-bearing to the standard. Why is it in there? it is just noise that will confuse the issue."

On the other hand, it can provide very important clues as to the why of the standard, not just the what. While the standards authors understood why they did things the way they did, many years later when we read it often we are left with more questions than answers.

At first I wasn't sure why it contained a separate Unix line feed when you would already be able to detect a Unix to DOS conversion from the DOS line ending:

0D 0A 1A 0A -> 0D 0D 0A 1A 0D 0A

But of course this isn't to try and detect a Unix-to-DOS conversion, it's to detect a roundtrip DOS-to-Unix-to-DOS conversion:

0D 0A 1A 0A -> 0A 1A 0A -> 0D 0A 1A 0D 0A

Certainly a very well thought-out magic number.

Unix2dos is idempotent on CRLF, it doesn’t change it to CRCRLF. Therefore converted singular LFs elsewhere in the file wouldn’t be recognized by the magic-number check if it only contained CRLF. This isn’t about roundtrip conversion.

Dwedit · 9 months ago

It's also detecting when a file on DOS/Windows is opened in "ASCII mode" rather than binary mode. When opened in ASCII mode, "\r\n" is automatically converted to "\n" upon reading the data.

I can count the number of times I've had binary file corruption due to line ending conversion on zero hands. And I'm old enough to have used FTP extensively. Seems kind of unnecessary.

hnlmorg · 9 months ago

“Modern” FTP clients would auto detect if you were sending text or binary files and thus disable line conversations for binary.

But go back to the 90s and before, and you’d have to manually select whether you were sending text or binary data. Often these clients defaulted to text and so you’d end up accidentally corrupting files if you weren’t careful.

The pain was definitely real

It can easily happen with version control across Windows and Unix clients. I’ve seen it a number of times.

jxhdbd · 9 months ago

Have you tried using git with Windows clients?

There are so many random line conversions going on and the detection on what is a binary file is clearly broken.

I don't understand why the default would be anything but "commit the file as is"

shagie · 9 months ago

The magic file (man magic / man file) is a neat one to read. On my Mac, this is located in /usr/share/file/magic/ while I recall on a unix distribution I worked on it was /etc/magic

The file itself has a format that can test a file and identify it (and possibly more useful information) that is read by the file command.

chrismorgan · 9 months ago

On Arch, per `file`:

  /usr/share/file/misc/magic.mgc: magic binary file for file(1) cmd (version 20) (little endian)

Per magic(5), it could also be “a directory of source text magic pattern fragment files in /usr/share/file/misc/magic”.

The original sources are mirrored in <https://github.com/file/file/tree/master/magic/Magdir>. The identification of the magic.mgc file itself comes from <https://github.com/file/file/blob/0fa0ffd15ff17d798e2f985447...>.

badmintonbaseba · 9 months ago

Then there is mkv/webm, where strictly speaking you need to implement at least part of an EBML parser to distinguish them. Possibly why no other file format adopts EBML, everything just recognizes it as either of mkv or matroska based on dodgy heuristics.

Many modern file formats are based on generic container formats (zip, riff, json, toml, xml without namespaces, ..). Identifying those files requires reading the entire file and then guessing the format from the contents. Magic numbers are becoming rare, which is a shame.

weinzierl · 9 months ago

Why is ELF a good example?

    7F 45 4C 46

- MUST be the very first N bytes in the file -> check

- MUST be at least four bytes long, eight is better -> check, but only four

- MUST include at least one byte with the high bit set -> nope

- MUST include a byte sequence that is invalid UTF-8 -> nope

- SHOULD include a zero byte -> nope

So, just 1.5 out of 5. Not good.

By the way, does anyone know the reason it starts with DEL (7F) specifically?

rickdeckard · 9 months ago

I think what the author likes is the fact that the first 4 bytes are defined as 0x7F followed by the file extension "ELF" in ASCII, which makes it a quite robust identifier.

And to be fair, including the 4 byte following the magic number make the ELF-format qualify at least 3 out of the 4 'MUST' requirements:

- 0x04: Either 01 or 02 (defines 32bit or 64bit)

- 0x05: Either 01 or 02 (defines Little Endian or Big Endian)

- 0x06: Set to 01 (ELF-version)

- 0x07: 00~12 (Target OS ABI)

Still not a shiny example though...

Maybe, yes. There are certainly worse offenders than ELF, but I still don't see how it satisfies 3 out of the 4 MUSTs. There is no byte with the high bit set and it is a valid ASCII sequence and therefore also valid UTF-8.

When it comes to the "eight is better" requirement, at least Linux does not care what comes after the fourth byte for identification purposes, so I think that does not count either.

I agree that it isn’t a particularly good example, especially with reference to the stated rules. Many binary-detection routines will treat DEL as a regular ASCII character.

0xFF0123 · 9 months ago

It's (7F) ELF

indigoabstract · 9 months ago

Hmm, I would expect that to be 31F, if it stood for "ELF" in correct Hexspeak.

xg15 · 9 months ago

Most of those make intuitive sense, except this one:

> MUST include a byte sequence that is invalid UTF-8

Making the magic number UTF-8 (or ASCII, which would still break the rule) would effectively turn it into a "magic string". Isn't that the better method for distinguishability? It's easier to pick unique memorable strings than unique memorable numbers, and you can also read it in a hex editor.

What would be the downsides?

Or is the idea of the requirement to distinguish the format from plaintext files? I'd think that the version number or the rest of the format already likely contained some invalid UTF-8 to ensure that.

kbolino · 9 months ago

The key part of magic numbers is that they appear early in the file. You shouldn't rely on something that will probably appear at some point because that requires reading the entire file to detect its type. A single 0x00 byte, ideally the first byte, should be enough to indicate the file is binary and thus make the question of encoding moot. However, 0x00 is technically valid UTF-8 corresponding to U+0000 and ASCII NUL. So, throwing something like 0xFF in there also helps to throw off UTF-8 detection as well as adding high-bit-stripping detection.

If you really wanted to go the extra mile, you could also include an impossible sequence of UTF-16 code units, but I think you'd need to dedicate 8 bytes to that: two invalid surrogate sequences, one in little-endian and the other in big-endian. You could possibly get by with just 6 bytes if you used a BOM in place of one of the surrogate pairs, or even just 4 with a BOM and an isolated surrogate if you can guarantee that nothing after it can be confused for the other half of a surrogate pair. However, throwing off UTF-16 detection doesn't seem that common or useful; many UTF-16 decoders don't even reject these invalid sequences.

Ah, so it's really distinguishing the file from plaintext then. Thanks!