I think the idea of all these is to make the file not be recognised as text (which doesn't allow nulls), ASCII (which doesn't use the high bit), UTF-8 (which doesn't allow invalid UTF-8 sequences).
Basically so that no valid file in this binary format will be incorrectly misidentified as a text file.
I think preventing the opposite is more pressing. Imagine creating a text file and it just so happens that the first 8 characters match a magic number of an image format. Now when you go back to edit your text file it is suddenly recognized as an image file by your file browser.
Git interprets a zero byte as an unconditional sign that a file is a binary file [0]. With other “nonprintable” characters (including the high-bit ones) it depends on their frequency. Other tools look for high bits, or whether it’s valid UTF-8. PDF files usually have a comment with high-bit characters on the second line for similar reasons.
These recommended rules cover various common ways to check for text vs. binary, while also aiming to ensure that no genuine text file would ever accidentally match the magic number. The zero-byte recommendation largely achieves the latter (if one ignores double/quad-byte encodings like UTF-16/32).
the high bit one is pretty ancient by now. I don't think we have transmission methods that are not 8-bit-clean anymore. And if your file detector detects "generic text" before any more specialized detections (like "GIF87a"), and thus treats everything that starts with ASCII bytes as "generic text", then sorry, but your detector is badly broken
There's no reason for the high-bit "rule" in 2025.
I would argue the same goes for the 0-byte rule. If you use strcmp() in your magic byte detector, then you're doing it wrong
The zero byte rule has nothing to do with strcmp(). Text files never contain 0-bytes, so having one is a strong sign the file is binary. Many detectors check for this.
Wikipedia has a good explanation why the PNG magic number is 89 50 4e 47 0d 0a 1a 0a. It has some good features, such as the end-of-file character for DOS and detection of line ending conversions. https://en.wikipedia.org/wiki/PNG#File_header
That is unfortunate. Not enough standards have rationale or intent sections.
On the one hand I sort of understand why they don't "If it is not critical and load-bearing to the standard. Why is it in there? it is just noise that will confuse the issue."
On the other hand, it can provide very important clues as to the why of the standard, not just the what. While the standards authors understood why they did things the way they did, many years later when we read it often we are left with more questions than answers.
At first I wasn't sure why it contained a separate Unix line feed when you would already be able to detect a Unix to DOS conversion from the DOS line ending:
0D 0A 1A 0A -> 0D 0D 0A 1A 0D 0A
But of course this isn't to try and detect a Unix-to-DOS conversion, it's to detect a roundtrip DOS-to-Unix-to-DOS conversion:
Unix2dos is idempotent on CRLF, it doesn’t change it to CRCRLF. Therefore converted singular LFs elsewhere in the file wouldn’t be recognized by the magic-number check if it only contained CRLF. This isn’t about roundtrip conversion.
It's also detecting when a file on DOS/Windows is opened in "ASCII mode" rather than binary mode. When opened in ASCII mode, "\r\n" is automatically converted to "\n" upon reading the data.
I can count the number of times I've had binary file corruption due to line ending conversion on zero hands. And I'm old enough to have used FTP extensively. Seems kind of unnecessary.
“Modern” FTP clients would auto detect if you were sending text or binary files and thus disable line conversations for binary.
But go back to the 90s and before, and you’d have to manually select whether you were sending text or binary data. Often these clients defaulted to text and so you’d end up accidentally corrupting files if you weren’t careful.
The magic file (man magic / man file) is a neat one to read. On my Mac, this is located in /usr/share/file/magic/ while I recall on a unix distribution I worked on it was /etc/magic
The file itself has a format that can test a file and identify it (and possibly more useful information) that is read by the file command.
# Various dictionary images used by OpenFirware FORTH environment
0 lelong 0xe1a00000
>8 lelong 0xe1a00000
# skip raspberry pi kernel image kernel7.img by checking for positive text length
>>24 lelong >0 ARM OpenFirmware FORTH Dictionary,
>>>24 lelong x Text length: %d bytes,
>>>28 lelong x Data length: %d bytes,
>>>32 lelong x Text Relocation Table length: %d bytes,
>>>36 lelong x Data Relocation Table length: %d bytes,
>>>40 lelong x Entry Point: %#08X,
>>>44 lelong x BSS length: %d bytes
Unpopular opinion: this is all needless pedantry. At best this gives parsers like file managers a cleaner path to recognizing the specific version of the specific format you're designing. Your successors won't evolve the format with the same rigor you think you're applying now. They just won't. They'll make a "compatible" change at some point in the future which will (1) be actually backwards compatible! yet (2) need to be detected in some affirmative way. Which it won't be. And your magic number will just end up being a wart like all the rest.
This isn't a solvable problem. File formats evolve in messy ways, they always have and always will, and "magic numbers" just aren't an important enough part of the solution to be worth freaking out about.
Just make it unique; read some bytes out of /dev/random, whatever. Arguments like the one here about making them a safe nul-terminated string that is guaranteed to be utf-8 invalid are not going to help anyone in the long term.
> Magic numbers aren't for parsing files, they're for identifying file formats.
Unpopular corollary: thinking those are two separate actions is a terribly bad design smell. What are you going to do with that file you "identified" if not read it to get something out of it, or hand it to something that will.
If your file manager wants to turn that path into a thumbnail, you have already gone beyond anything the magic number can have helped you with.
Again, needless pedantry. Put a random number in the front and be done with it. Anything else needs a parser anyway.
The magic number isn't about recognizing specific versions. That's just an added benefit if you choose to add that to the magic number.
It is to solve the problem of how to build a file manager that can efficiently recognize all the file types in a large folder without relying on file name extensions.
If you don't include a magic number a file manager would need to attempt to parse the file format before it can determine which file type it is.
Filename extensions are pretty useful. They were adopted for very good reasons, and every attempt to hide them, pretend they don't matter, or otherwise make them go away has only made things worse.
You still need a way to make it hard to fool people with deceptive extensions, though, and that's where the magic numbers come in.
> The magic number isn't about recognizing specific versions
Yes it is, though. Does your file manager want to display Excel files differently from .jar files? They're both different "versions" of the same file format! Phil Katz in 1988 or whatever could have followed the pedantry in the linked article to the letter (he didn't). And it wouldn't have helped the problem at hand one bit.
I agree. The author could go after msgpack, they don’t have a magic number, but support using the .msgpack extension for storing data in files. Since a magic number isn’t required at all, it shouldn’t be required to be good.
Then there is mkv/webm, where strictly speaking you need to implement at least part of an EBML parser to distinguish them. Possibly why no other file format adopts EBML, everything just recognizes it as either of mkv or matroska based on dodgy heuristics.
Many modern file formats are based on generic container formats (zip, riff, json, toml, xml without namespaces, ..). Identifying those files requires reading the entire file and then guessing the format from the contents. Magic numbers are becoming rare, which is a shame.
I think what the author likes is the fact that the first 4 bytes are defined as 0x7F followed by the file extension "ELF" in ASCII, which makes it a quite robust identifier.
And to be fair, including the 4 byte following the magic number make the ELF-format qualify at least 3 out of the 4 'MUST' requirements:
_ 7F 45 4C 46
- 0x04: Either 01 or 02 (defines 32bit or 64bit)
- 0x05: Either 01 or 02 (defines Little Endian or Big Endian)
Maybe, yes. There are certainly worse offenders than ELF, but I still don't see how it satisfies 3 out of the 4 MUSTs. There is no byte with the high bit set and it is a valid ASCII sequence and therefore also valid UTF-8.
When it comes to the "eight is better" requirement, at least Linux does not care what comes after the fourth byte for identification purposes, so I think that does not count either.
I agree that it isn’t a particularly good example, especially with reference to the stated rules. Many binary-detection routines will treat DEL as a regular ASCII character.
Most of those make intuitive sense, except this one:
> MUST include a byte sequence that is invalid UTF-8
Making the magic number UTF-8 (or ASCII, which would still break the rule) would effectively turn it into a "magic string". Isn't that the better method for distinguishability? It's easier to pick unique memorable strings than unique memorable numbers, and you can also read it in a hex editor.
What would be the downsides?
Or is the idea of the requirement to distinguish the format from plaintext files? I'd think that the version number or the rest of the format already likely contained some invalid UTF-8 to ensure that.
The key part of magic numbers is that they appear early in the file. You shouldn't rely on something that will probably appear at some point because that requires reading the entire file to detect its type. A single 0x00 byte, ideally the first byte, should be enough to indicate the file is binary and thus make the question of encoding moot. However, 0x00 is technically valid UTF-8 corresponding to U+0000 and ASCII NUL. So, throwing something like 0xFF in there also helps to throw off UTF-8 detection as well as adding high-bit-stripping detection.
If you really wanted to go the extra mile, you could also include an impossible sequence of UTF-16 code units, but I think you'd need to dedicate 8 bytes to that: two invalid surrogate sequences, one in little-endian and the other in big-endian. You could possibly get by with just 6 bytes if you used a BOM in place of one of the surrogate pairs, or even just 4 with a BOM and an isolated surrogate if you can guarantee that nothing after it can be confused for the other half of a surrogate pair. However, throwing off UTF-16 detection doesn't seem that common or useful; many UTF-16 decoders don't even reject these invalid sequences.
1) Starts with a null-byte to make it clear this is binary, not text.
2) Includes a human-readable part to make it easy to figure out what the file is in hex dumps.
3) 8 bytes of randomly chosen bytes, all of which greater than 0x7F to ensure they're not ASCII.
3) Finally, a one-byte major version number.
4) Total length (including major version) is 32 bytes to fit nicely in a hex dump.
OTS proof files typically contain lots of 32-byte hash digests. So "wasting" 32 bytes on magic bytes isn't a big deal.
Basically so that no valid file in this binary format will be incorrectly misidentified as a text file.
These recommended rules cover various common ways to check for text vs. binary, while also aiming to ensure that no genuine text file would ever accidentally match the magic number. The zero-byte recommendation largely achieves the latter (if one ignores double/quad-byte encodings like UTF-16/32).
[0] https://github.com/git/git/blob/683c54c999c301c2cd6f715c4114...
There's no reason for the high-bit "rule" in 2025.
I would argue the same goes for the 0-byte rule. If you use strcmp() in your magic byte detector, then you're doing it wrong
I've just dealt with a 7N1 serial link yesterday, they still exist. Granted, nobody really uses them for truly arbitrary data exchange, but still.
But the new spec doesn't explain: https://www.w3.org/TR/2003/REC-PNG-20031110/
On the one hand I sort of understand why they don't "If it is not critical and load-bearing to the standard. Why is it in there? it is just noise that will confuse the issue."
On the other hand, it can provide very important clues as to the why of the standard, not just the what. While the standards authors understood why they did things the way they did, many years later when we read it often we are left with more questions than answers.
0D 0A 1A 0A -> 0D 0D 0A 1A 0D 0A
But of course this isn't to try and detect a Unix-to-DOS conversion, it's to detect a roundtrip DOS-to-Unix-to-DOS conversion:
0D 0A 1A 0A -> 0A 1A 0A -> 0D 0A 1A 0D 0A
Certainly a very well thought-out magic number.
But go back to the 90s and before, and you’d have to manually select whether you were sending text or binary data. Often these clients defaulted to text and so you’d end up accidentally corrupting files if you weren’t careful.
The pain was definitely real
There are so many random line conversions going on and the detection on what is a binary file is clearly broken.
I don't understand why the default would be anything but "commit the file as is"
The file itself has a format that can test a file and identify it (and possibly more useful information) that is read by the file command.
The original sources are mirrored in <https://github.com/file/file/tree/master/magic/Magdir>. The identification of the magic.mgc file itself comes from <https://github.com/file/file/blob/0fa0ffd15ff17d798e2f985447...>.
This isn't a solvable problem. File formats evolve in messy ways, they always have and always will, and "magic numbers" just aren't an important enough part of the solution to be worth freaking out about.
Just make it unique; read some bytes out of /dev/random, whatever. Arguments like the one here about making them a safe nul-terminated string that is guaranteed to be utf-8 invalid are not going to help anyone in the long term.
And yes making it obviously binary is helpful, e.g. for Git.
Unpopular corollary: thinking those are two separate actions is a terribly bad design smell. What are you going to do with that file you "identified" if not read it to get something out of it, or hand it to something that will.
If your file manager wants to turn that path into a thumbnail, you have already gone beyond anything the magic number can have helped you with.
Again, needless pedantry. Put a random number in the front and be done with it. Anything else needs a parser anyway.
It is to solve the problem of how to build a file manager that can efficiently recognize all the file types in a large folder without relying on file name extensions.
If you don't include a magic number a file manager would need to attempt to parse the file format before it can determine which file type it is.
You still need a way to make it hard to fool people with deceptive extensions, though, and that's where the magic numbers come in.
Yes it is, though. Does your file manager want to display Excel files differently from .jar files? They're both different "versions" of the same file format! Phil Katz in 1988 or whatever could have followed the pedantry in the linked article to the letter (he didn't). And it wouldn't have helped the problem at hand one bit.
- MUST be at least four bytes long, eight is better -> check, but only four
- MUST include at least one byte with the high bit set -> nope
- MUST include a byte sequence that is invalid UTF-8 -> nope
- SHOULD include a zero byte -> nope
So, just 1.5 out of 5. Not good.
By the way, does anyone know the reason it starts with DEL (7F) specifically?
And to be fair, including the 4 byte following the magic number make the ELF-format qualify at least 3 out of the 4 'MUST' requirements:
_ 7F 45 4C 46
- 0x04: Either 01 or 02 (defines 32bit or 64bit)
- 0x05: Either 01 or 02 (defines Little Endian or Big Endian)
- 0x06: Set to 01 (ELF-version)
- 0x07: 00~12 (Target OS ABI)
Still not a shiny example though...
When it comes to the "eight is better" requirement, at least Linux does not care what comes after the fourth byte for identification purposes, so I think that does not count either.
> MUST include a byte sequence that is invalid UTF-8
Making the magic number UTF-8 (or ASCII, which would still break the rule) would effectively turn it into a "magic string". Isn't that the better method for distinguishability? It's easier to pick unique memorable strings than unique memorable numbers, and you can also read it in a hex editor.
What would be the downsides?
Or is the idea of the requirement to distinguish the format from plaintext files? I'd think that the version number or the rest of the format already likely contained some invalid UTF-8 to ensure that.
If you really wanted to go the extra mile, you could also include an impossible sequence of UTF-16 code units, but I think you'd need to dedicate 8 bytes to that: two invalid surrogate sequences, one in little-endian and the other in big-endian. You could possibly get by with just 6 bytes if you used a BOM in place of one of the surrogate pairs, or even just 4 with a BOM and an isolated surrogate if you can guarantee that nothing after it can be confused for the other half of a surrogate pair. However, throwing off UTF-16 detection doesn't seem that common or useful; many UTF-16 decoders don't even reject these invalid sequences.