On File Formats - Readit News

Generally good points. Unfortunately existing file formats are rarely following these rules. In fact these rules should form naturally when you are dealing with many different file formats anyway. Specific points follow:

- Agreed that human-readable formats have to be dead simple, otherwise binary formats should be used. Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.

- Chunking is generally good for structuring and incremental parsing, but do not expect it to provide reorderability or back/forward compatibility somehow. Unless explicitly designed, they do not exist. Consider PNG for example; PNG chunks were designed to be quite robust, but nowadays some exceptions [1] do exist. Versioning is much more crucial for that.

[1] https://www.w3.org/TR/png/#animation-information

- Making a new file format from scratch is always difficult. Already mentioned, but you should really consider using existing file formats as a container first. Some formats are even explicitly designed for this purpose, like sBOX [2] or RFC 9277 CBOR-labeled data tags [3].

[2] https://nothings.org/computer/sbox/sbox.html

[3] https://www.rfc-editor.org/rfc/rfc9277.html

mort96 · 3 months ago

> Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.

Especially true of floats!

With binary formats, it's usually enough to only support machines whose floating point representation conforms to IEEE 754, which means you can just memcpy a float variable to or from the file (maybe with some endianness conversion). But writing a floating point parser and serializer which correctly round-trips all floats and where the parser guarantees that it parses to the nearest possible float... That's incredibly tricky.

What I've sometimes done when I'm writing a parser for textual floats is, I parse the input into separate parts (so the integer part, the floating point part, the exponent part), then serialize those parts into some other format which I already have a parser for. So I may serialize them into a JSON-style number and use a JSON library to parse it if I have that handy, or if I don't, I serialize it into a format that's guaranteed to work with strtod regardless of locale. (The C standard does, surprisingly, quite significantly constrain how locales can affect strtod's number parsing.)

analog31 · 3 months ago

Here's a weird idea that has occurred me from time to time. What if your editor could recognize a binary float, display it in a readable format, and allow you to edit it, but leave it as binary when the file is saved.

Maybe it's discipline-specific, but with the reasonable care in handling floats that most people are taught, I've never had a consequential mishap.

chirsz · 3 months ago

You can use hexadecimal floating-point literal format that introduced in the C language since C99[^1].

[^1]: https://cppreference.com/w/c/language/floating_constant.html

For example, `0x1.2p3` represents `9.0`.

HappMacDonald · 3 months ago

To me this just sounds like an endianness nightmare waiting to happen.

hyperbolablabla · 3 months ago

Couldn't you just write the hex bytes? That would be unambiguous, and it wouldn't lose precision.

>Most extensions have three characters, which means the search space is pretty crowded. You may want to consider using four letters.

Is there a reason not to use a lot more characters? If your application's name is MustacheMingle, call the file foo.mustachemingle instead of foo.mumi?

This will decrease the probability of collision to almost zero. I am unaware of any operating systems that don't allow it, and it will be 100% clear to the user which application the file belongs to.

It will be less aesthetically pleasing than a shorter extension, but that's probably mainly a matter of habit. We're just not used to longer file name extensions.

Any reason why this is a bad idea?

Hackbraten · 3 months ago

A 14-character extension might cause UX issues in desktop environments and file managers, where screen real estate per directory entry is usually very limited.

When under pixel pressure, a graphical file manager might choose to prioritize displaying the file extension and truncate only the base filename. This would help the user identify file formats. However, the longer the extension, the less space remains for the base name. So a low-entropy file extension with too many characters can contribute to poor UX.

delusional · 3 months ago

> it will be 100% clear to the user which application the file belongs to.

The most popular operating system hides it from the user, so clarity would not improve in that case. At leat one other (Linux) doesn't really use "extensions" and instead relies on magic headers inside the files to determine the format.

Otherwise I think the decision is largely aestethic. If you value absolute clarity, then I don't see any reason it won't work, it'll just be a little "ugly"

hiAndrewQuinn · 3 months ago

I don't even think it's ugly. I'm incredibly thankful every time I see someone make e.g. `db.sqlite`, it immediately sets me at ease to know I'm not accidentally dealing with a DuckDB file or something.

whyoh · 3 months ago

>The most popular operating system hides it from the user, so clarity would not improve in that case.

If you mean Windows, that's not entirely correct. It defaults to hiding only "known" file extensions, like txt, jpg and such. (Which IMO is even worse than hiding all of them; that would at least be consistent.)

EDIT: Actually, I just checked and apparently an extension, even an exotic one, becomes "known" when it's associated with a program, so your point still stands.

dist-epoch · 3 months ago

> At leat one other (Linux) doesn't really use "extensions" and instead relies on magic headers inside the files to determine the format.

mostly for executable files.

I doubt many Linux apps look inside a .py file to see if it's actually a JPEG they should build a thumbnail for.

layer8 · 3 months ago

It’s tedious to type when you want to do `ls *.mustachemingle` or similar.

It’s prone to get cut off in UIs with dedicated columns for file extensions.

As you say, it’s unconventional and therefore risks not being immediately recognized as a file extension.

On the other hand, Java uses .properties as a file extension, so there is some precedent.

dist-epoch · 3 months ago

> call the file foo.mustachemingle

You could go the whole java way then foo.com.apache.mustachemingle

> Any reason why this is a bad idea

the focus should be on the name, not on the extension.

Rygian · 3 months ago

Why should a file format be locked down to one specific application?

IAmBroom · 3 months ago

Both species are needed:

Generic, standardized formats like "jpg" and "pdf", and

Application-specific formats like extension files or state files for your program, that you do not wish to share with competitors.

lifthrasiir · 3 months ago

shakna · 3 months ago

Spent the weekend with an untagged chunked format, and... I rather hate it.

A friend wanted a newer save viewer/editor for Dragonball Xenoverse 2, because there's about a total of two, and they're slow to update.

I thought it'd be fairly easy to spin up something to read it, because I've spun up a bunch of save editors before, and they're usually trivial.

XV2 save files change over versions. They're also just arrays of structs [0], that don't properly identify themselves, so some parts of them you're just guessing. Each chunk can also contain chunks - some of which are actually a network request to get more chunks from elsewhere in the codebase!

[0] Also encrypted before dumping to disk, but the keys have been known since about the second release, and they've never switched them.

InsideOutSanta · 3 months ago

thasso · 3 months ago

For archive formats, or anything that has a table of contents or an index, consider putting the index at the end of the file so that you can append to it without moving a lot of data around. This also allows for easy concatenation.

zzo38computer · 3 months ago

What probably allows for even more easier concatenation would be to store the header of each file immediately preceding the data of that file. You can make a index in memory when reading the file if that is helpful for your use.

HelloNurse · 3 months ago

This would require a separate seek and read operation per archive member, each yielding only one directory entry, rather than very few read operation to load the whole directory at once.

charcircuit · 3 months ago

Why not put it at the beginning so that it is available at the start of the filestream that way it is easier to get first so you know what other ranges of the file you may need?

>This also allows for easy concatenation.

How would it be easier than putting it at the front?

Files are... Flat streams. Sort of.

So if you rewrite an index at the head of the file, you may end up having to rewrite everything that comes afterwards, to push it further down in the file, if it overflows any padding offset. Which makes appending an extremely slow operation.

Whereas seeking to end, and then rewinding, is not nearly as costly.

If the archive is being updated in place, turning ABC# into ABCD#' (where # and #' are indices) is easier than turning #ABC into #'ABCD. The actual position of indices doesn't matter much if the stream is seekable. I don't think the concatenation is a good argument though.

MattPalmer1086 · 3 months ago

Imagine you have a 12Gb zip file, and you want to add one more file to it. Very easy and quick if the index is at the end, very slow if it's at the start (assuming your index now needs more space than is available currently).

Reading the index from the end of the file is also quick; where you read next depends on what you are trying to find in it, which may not be the start.

McGlockenshire · 3 months ago

> How would it be easier than putting it at the front?

Have you ever wondered why `tar` is the Tape Archive? Tape. Magnetic recording tape. You stream data to it, and rewinding is Hard, so you put the list of files you just dealt with at the very end. This now-obsolete hardware expectation touches us decades later.

leiserfg · 3 months ago

If binary, consider just using SQLite.

paulddraper · 3 months ago

Did you read the article?

That wouldn’t support partial parsing.

On the contrary, loading everything from a database is the limit case of "partial parsing" with queries that read only a few pages of a few tables and indices.

From the point of view of the article, a SQLite file is similar to a chunked file format: the compact directory of what tables etc. it contains is more heavyweight than listing chunk names and lengths/offsets, but at least as fast, and loading only needed portions of the file is automatically managed.

Using SQLite as a container format is only beneficial when the file format itself is a composite, like word processor files which will include both the textual data and any attachments. SQLite is just a hinderance otherwise, like image file formats or archival/compressed file formats [1].

[1] SQLite's own sqlar format is a bad idea for this reason.

SyrupThinker · 3 months ago

From my own experience SQLite works just fine as the container for an archive format.

It ends up having some overhead compared to established ones, but the ability to query over the attributes of 10000s of files is pretty nice, and definitely faster than the worst case of tar.

My archiver could even keep up with 7z in some cases (for size and access speed).

Implementing it is also not particularly tricky, and SQLite even allows streaming the blobs.

Making readers for such a format seems more accessible to me.

The Mac image editor Acorn uses SQLite as its file format. It's described here:

https://shapeof.com/archives/2025/4/acorn_file_format.html

The author notes that an advantage is that other programs can easily read the file format and extract information from it.

sureglymop · 3 months ago

I think it's fine as an image format. I've used the mbtiles format which is basically just a table filled with map tiles. Sqlite makes it super easy to deal with it, e.g. to dump individual blobs and save them as image files.

It just may not always be the most performant option. For example, for map tiles there is alternatively the pmtiles binary format which is optimized for http range requests.

aidenn0 · 3 months ago

Except image formats and archival formats are composites (data+metadata). We have Exif for images, and you might be surprised by how much metadata the USTar format has.

frainfreeze · 3 months ago

sqlar proved a great solution in the past for me. Where does it fall short in your experience?

Dwedit · 3 months ago

Sometimes, you'll need to pack multiple files inside of a single file. Those files will need to grow, and be able to be deleted.

At that point, you're asking for a filesystem inside of a file. And you can literally do exactly that with a filesystem library (FAT32, etc).

Consider DER format. Partial parsing is possible; you can easily ignore any part of the file that you do not care about, since the framing is consistent. Additionally, it works like the "chunked" formats mentioned in the article, and one of the bits of the header indicates whether it includes other chunks or includes data. (Furthermore, I made up a text-based format called TER which is intended to be converted to DER. TER is not intended to be used directly; it is only intended to be converted to DER for then use in other programs. I had also made up some additional data types, and one of these (called ASN1_IDENTIFIED_DATA) can be used for identifying the format of a file (which might conform to multiple formats, and it allows this too).)

I dislike JSON and some other modern formats (even binary formats); they often are just not as good in my opinion. One problem is they tend to insist on using Unicode, and/or on other things (e.g. 32-bit integers where you might need 64-bits). When using a text-based format where binary would do better, it can also be inefficient especially if binary data is included within the text as well, especially if the format does not indicate that it is meant to represent binary data.

However, even if you use an existing format, you should avoid using the existing format badly; using existing formats badly seems to be common. There is also the issue of if the existing format is actually good or not; many formats are not good, for various reasons (some of which I mentioned above, but there are others, depending on the application).

About target hardware, not all software is intended for a specific target hardware, although some is.

For compression, another consideration is: there are general compression schemes as well as being able to make up a compression scheme that is specific for the kind of data that is being compressed.

They also mention file names. However, this can also depend on the target system; e.g. for DOS files you will need to be limited to three characters after the dot. Also, some programs would not need to care about file names in some or all cases (many programs I write don't care about file names).

Maybe it's just because I've never needed the complexity, but ASN.1 seems a bit much for any of the formats I've created.

jeroenhd · 3 months ago

The ASN.1 format itself is pretty well-suited for generic file types. Unfortunately, there are very few good, open source/free ASN.1 (de)serializers out there.

In theory you could use ASN.1 DER files the same way you would JSON for human-readable formats. In practice, you're better off picking a different format.

Modern evolutions of ASN.1 like ProtoBuf or Cap'n Proto designed for transmitting data across the network might fit this purpose pretty well, too.

On the other hand, using ASN.1 may be a good way to make people trying to reverse engineer your format give up in despair, especially if you start using the quirks ASN.1 DER comes with and change the identifiers.

For me too, although you only need to use (and implement) the parts which are relevant for your application and not all of them, so it is not really the problem. (I also never needed to write ASN.1 schemas, and a full implementation of ASN.1 is not necessary for my purpose.) (This is also a reason I use DER instead of BER, even if canonical form is not required; DER is simpler to handle than all of the possibilities of BER.)

mjevans · 3 months ago

Most of that's pretty good.

Compression: For anything that ends up large it's probably desired. Though consider both algorithm and 'strength' based on the use case carefully. Even a simple algorithm might make things faster when it comes time to transfer or write to permanent storage. A high cost search to squeeze out yet more redundancy is probably worth it if something will be copied and/or decompressed many times, but might not be worth it for that locally compiled kernel you'll boot at most 10 times before replacing it with another.