Magika: AI powered fast and efficient file type identification

This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc.

It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".

I like the idea, but the current implementation can't be relied on IMO; especially not for automation.

A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.

ebursztein · 2 years ago

Thanks for the feedback -- we will look into it. If you can share with us the list of URL that would be very helpful so we can reproduce - send us an email at magika-dev@google.com if that is possible.

For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.

We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.

Thanks again for sharing your experience with Magika this is very useful.

TomNomNom · 2 years ago

Sure thing :)

Here's[0] a .tgz file with 3 files in it that are misidentified by magika but correctly identified by the `file` utility: asp.html, vba.html, unknown.woff

These are files that were in one of my crawl datasets.

[0]: https://poc.lol/files/magika-test.tgz

westurner · 2 years ago

What is the MIME type of a .tar file; and what are the MIME types of the constituent concatenated files within an archive format like e.g. tar?

hachoir/subfile/main.py: https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...

File signature: https://en.wikipedia.org/wiki/File_signature

PhotoRec: https://en.wikipedia.org/wiki/PhotoRec

"File Format Gallery for Kaitai Struct"; 185+ binary file format specifications: https://formats.kaitai.io/

Table of ': https://formats.kaitai.io/xref.html

AntiVirus software > Identification methods > Signature-based detection, Heuristics, and ML/AI data mining: https://en.wikipedia.org/wiki/Antivirus_software#Identificat...

Executable compression; packer/loader: https://en.wikipedia.org/wiki/Executable_compression

Shellcode database > MSF: https://en.wikipedia.org/wiki/Shellcode_database

sigtool.c: https://github.com/Cisco-Talos/clamav/blob/main/sigtool/sigt...

clamav sigtool: https://www.google.com/search?q=clamav+sigtool

https://blog.didierstevens.com/2017/07/14/clamav-sigtool-dec... :

  sigtool –-find-sigs "$name" | sigtool –-decode-sigs

List of file signatures: https://en.wikipedia.org/wiki/List_of_file_signatures

And then also clusterfuzz/oss-fuzz scans .txt source files with (sandboxed) Static and Dynamic Analysis tools, and `debsums`/`rpm -Va` verify that files on disk have the same (GPG signed) checksums as the package they are supposed to have been installed from, and a file-based HIDS builds a database of file hashes and compares what's on disk in a later scan with what was presumed good, and ~gdesktop LLM tools scan every file, and there are extended filesystem attributes for label-based MAC systems like SELinux, oh and NTFS ADS.

A sufficient cryptographic hash function yields random bits with uniform probability. DRBG Deterministic Random Bit Generators need high entropy random bits in order to continuously re-seed the RNG random number generator. Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?

https://github.com/google/osv.dev/blob/master/README.md#usin... :

> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.

Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.

Add'l useful formats:

> Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories

Things like bittorrent magnet URIs, Named Data Networking, and IPFS are (file-hash based) "Content addressable storage": https://en.wikipedia.org/wiki/Content-addressable_storage

Dead Comment

michaelmior · 2 years ago

> the current implementation can't be relied on IMO

What's your reasoning for not relying on this? (It seems to me that this would be application-dependent at the very least.)

jdiff · 2 years ago

I'm not the person you asked, but I'm not sure I understand your question and I'd like to. It whiffed multiple common softballs, to the point it brings into question the claims made about its performance. What reasoning is there to trust it?

TomNomNom · 2 years ago

It provided the wrong file-types for some files, so I cannot rely on its output to be correct.

If you wanted to, for example, use this tool to route different files to different format-specific handlers it would sometimes send files to the wrong handlers.

file /usr/lib/* 0,34s user 0,54s system 43% cpu 2,010 total ./file-parallel.sh 0,85s user 1,91s system 580% cpu 0,477 total bin/magika /usr/lib/* 92,73s user 1,11s system 393% cpu 23,869 total

As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( https://github.com/moshen/wasmagic ) , I have to say I really love seeing new entries into the file type detection space.

Though I have to say when looking at the Node module, I don't understand why they released it.

Their docs say it's slow:

https://github.com/google/magika/blob/120205323e260dad4e5877...

It loads the model an runtime:

https://github.com/google/magika/blob/120205323e260dad4e5877...

They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.

Also as others have mentioned. The model appears to only detect 116 file types:

https://github.com/google/magika/blob/120205323e260dad4e5877...

Where libmagic detects... a lot. Over 1600 last time I checked:

https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...

I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

m0shen · 2 years ago

Made a small test to try it out: https://gist.github.com/moshen/784ee4a38439f00b17855233617e9...

    hyperfine ./magika.bash ./file.bash
    Benchmark 1: ./magika.bash
      Time (mean ± σ):     706.2 ms ±  21.1 ms    [User: 10520.3 ms, System: 1604.6 ms]
      Range (min … max):   684.0 ms … 738.9 ms    10 runs
    
    Benchmark 2: ./file.bash
      Time (mean ± σ):      23.6 ms ±   1.1 ms    [User: 15.7 ms, System: 7.9 ms]
      Range (min … max):    22.4 ms …  29.0 ms    111 runs
    
    Summary
      './file.bash' ran
       29.88 ± 1.65 times faster than './magika.bash'

barrkel · 2 years ago

Realistically, either you're identifying one file interactively and you don't care about latency differences in the 10s of ms, or you're identifying in bulk (batch command line or online in response to requests), in which case you should measure the marginal cost and exclude Python startup and model loading times.

m0shen · 2 years ago

I've updated this script with some single-file cli numbers, which are (as expected) not good. Mostly just comparing python startup time for that.

    make
    sqlite3 < analyze.sql
    file_avg              python_avg         python_x_times_slower_single_cli
    --------------------  -----------------  --------------------------------
    0.000874874856301821  0.179884610224334  205.611818568799
    file_avg            python_avg     python_x_times_slower_bulk_cli
    ------------------  -------------  ------------------------------
    0.0231715865881818  0.69613745142  30.0427184289163

ebursztein · 2 years ago

We did release the npm package because indeed we create a web demo and thought people might want to also use it. We know it is not as fast as the python version or a C++ version -- which why we did mark it as experimental.

The release include the python package and the cli which are quite fast and is the main way we did expect people to use -- sorry if that hasn't be clear in the post.

The goal of the release is to offer a tool that is far more accurate that other tools and works on the major file types as we hope it to be useful to the community.

Glad to hear it worked on your files

m0shen · 2 years ago

Thank you for the release! I understand you're just getting it out the door. I just hope to see it delivered as a native library or something more reusable.

I did try the python cli, but it seems to be about 30x slower than `file` for the random bag of files I checked.

I'll probably take some time this weekend to make a couple of issues around misidentified files.

I'll definitely be adding this to my toolset!

invernizzi · 2 years ago

Hello! We wrote the Node library as a first functional version. Its API is already stable, but it's a bit slower than the Python library for two reasons: it loads the model at runtime, and it doesn't do batch lookups, meaning it calls the model for each file. Other than that, it's just as fast for single file lookups, which is the most common usecase.

m0shen · 2 years ago

Good to know! Thank you. I'll definitely be trying it out. Though, I might download and hardcode the model ;)

I also appreciate the use of ONNX here, as I'm already thinking about using another version of the runtime.

Do you think you'll open source your F1 benchmark?

tudorw · 2 years ago

Can we do the 1600 if known, if not, let the AI take a guess?

m0shen · 2 years ago

Absolutely, and honestly in a non-interactive ingestion workflow you're probably doing multiple checks anyway. I've worked with systems that call multiple libraries and hand-coded validation for each incoming file.

Maybe it's my general malaise, or disillusionment with the software industry, but when I wrote that I was really just expecting more.

michaelt · 2 years ago

> The model appears to only detect 116 file types [...] Where libmagic detects... a lot. Over 1600 last time I checked

As I'm sure you know, in a lot of applications, you're preparing things for a downstream process which supports far fewer than 1600 file types.

For example, a printer driver might call on file to check if an input is postscript or PDF, to choose the appropriate converter - and for any other format, just reject the input.

Or someone training an ML model to generate Python code might have a load of files they've scraped from the web, but might want to discard anything that isn't Python.

theon144 · 2 years ago

Okay, but your one file type is more likely to be included in the 1600 that libmagic supports rather than Magika's 116?

For that matter, the file types I care about are unfortunately misdetected by Magika (which is also an important point - the `file` command at least gives up and says "data" when it doesn't know, whereas the Magika demo gives a confidently wrong answer).

I don't want to criticize the release because it's not meant to be a production-ready piece of software, and I'm sure the current 116 types isn't a hard limit, but I do understand the parent comment's contention.

eapriv · 2 years ago

Surely identifying just one file type (or two, as in your example) is a much simpler task that shouldn’t rely on horribly inefficient and imprecise “AI” tools?

lebean · 2 years ago

It's for researchers, probably.

m0shen · 2 years ago

Yeah, there is this line:

    By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Which implies a production-ready release for general usage, as well as usage by security researchers.

stevepike · 2 years ago

Oh man, this brings me back! Almost 10 years ago I was working on a rails app trying to detect the file type of uploaded spreadsheets (xlsx files were being detected as application/zip, which is technically true but useless).

I found "magic" that could detect these and submitted a patch at https://bugs.freedesktop.org/show_bug.cgi?id=78797. My patch got rejected for needing to look at the first 3KB bytes of the file to figure out the type. They had a hard limit that they wouldn't see past the first 256 bytes. Now in 2024 we're doing this with deep learning! It'd be cool if google released some speed performance benchmarks here against the old-fashioned implementations. Obviously it'd be slower, but is it 1000x or 10^6x?

Co-author of Magika here (Elie) so we didn't include the measurements in the blog post to avoid making it too long but we did those measurements.

Overall file takes about 6ms (single file) 2.26ms per files when scanning multiples. Magika is at 65ms single file and 5.3ms when scanning multiples.

So Magika is for the worst case scenario about 10x slower due to the time it takes to load the model and 2x slower on repeated detection. This is why we said it is not that much slower.

We will have more performance measurements in the upcoming research paper. Hope that answer the question

chmod775 · 2 years ago

Is that single-threaded libmagic vs Magika using every core on the system? What are the numbers like if you run multiple libmagic instances in parallel for multiple files, or limit both libmagic and magika to a single core?

Testing it on my own system, magika seems to use a lot more CPU-time:

Looks about 50x slower to me. There's 5k files in my lib folder. It's definitely still impressively fast given how the identification is done, but the difference is far from negligible.

jpk · 2 years ago

Do you have a sense of performance in terms of energy use? 2x slower is fine, but is that at the same wattage, or more?

metafunctor · 2 years ago

I've ended up implementing a layer on top of "magic" which, if magic detects application/zip, reads the zip file manifest and checks for telltale file names to reliably detect Office files.

The "magic" library does not seem to be equipped with the capabilities needed to be robust against the zip manifest being ordered in a different way than expected.

But this deep learning approach... I don't know. It might be hard to shoehorn in to many applications where the traditional methods have negligible memory and compute costs and the accuracy is basically 100% for cases that matter (detecting particular file types of interest). But when looking at a large random collection of unknown blobs, yeah, I can see how this could be great.

If you're curious, here's how I solved it for ruby back in the day. Still used magic bytes, but added an overlay on top of the freedesktop.org DB: https://github.com/mimemagicrb/mimemagic/pull/20

comboy · 2 years ago

Many commenters seem to be using magic instead of file, any reasons?

renonce · 2 years ago

From the first paragraph:

> enabling precise file identification within milliseconds, even when running on a CPU.

Maybe your old-fashioned implementations were detecting in microseconds?

Yeah I saw that, but that could cover a pretty wide range and it's not clear to me whether that relies on preloading a model.

Deleted Comment

brabel · 2 years ago

> They had a hard limit that they wouldn't see past the first 256 bytes.

Then they could never detect zip files with certainty, given that to do that you need to read up to 65KB (+ 22) at the END of the file. The reason is that the zip archive format allows "gargabe" bytes both in the beginning of the file and in between local file headers.... and it's actually not uncommon to prepend a program that self-extracts the archive, for example. The only way to know if a file is a valid zip archive is to look for the End of Central Directory Entry, which is always at the end of the file AND allows for a comment of unknown length at the end (and as the comment length field takes 2 bytes, the comment can be up to 65K long).

jeffbee · 2 years ago

That's why the whole question is ill formed. A file does not have exactly one type. It may be a valid input in various contexts. A zip archive may also very well be something else.

aidenn0 · 2 years ago

FWIW, file can now distinguish many types of zip containers, including Oxml files.

Eiim · 2 years ago

I ran a quick test on 100 semi-random files I had laying around. Of those, 81 were detected correctly, 6 were detected as the wrong file type, and 12 were detected with an unspecific file type (unknown binary/generic text) when a more specific type existed. In 4 of the unspecific cases, a low-confidence guess was provided, which was wrong in each case. However, almost all of the files which were detected wrong/unspecific are of types not supported by Magika, with one exception of a JSON file containing a lot of JS code as text, which was detected as JS code. For comparison, file 5.45 (the version I happened to have installed) got 83 correct, 6 wrong, and 10 not specific. It detected the weird JSON correctly, but also had its own strange issues, such as detecting a CSV as just "data". The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code (Magika called them unknown). The other two "wrong" detections were also code formats that it seems it doesn't support. It was also able to output a lot more information about the media files. Not sure what to make of these tests but perhaps they're useful to somebody.

pizzalife · 2 years ago

> The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code

To be fair though, a snippet of GLSL shader code can be perfectly valid C.

Indeed, which is why I felt the need to call it out here. I'm not certain if the files on question actually happened to be valid C but whether that's a meaningful mistake regardless is left to the reader to decide.

lifthrasiir · 2 years ago

I'm extremely confused about the claim that other tools have a worse precision or recall for APK or JAR files which are very much regular. Like, they should be a valid ZIP file with `META-INF/MANIFEST.MF` present (at least), and APK would need `classes.dex` as well, but at this point there is no other format that can be confused with APK or JAR I believe. I'd like to see which file was causing unexpected drop on precision or recall.

Someone · 2 years ago

People do create JAR files without a META-INF/MANIFEST.MF entry.

The tooling even supports it. https://docs.oracle.com/en/java/javase/21/docs/specs/man/jar...:

  -M or --no-manifest
     Doesn't create a manifest file for the entries

HtmlProgrammer · 2 years ago

Minecraft mods 14 years ago used to tell you to open the JAR and delete the META-INF when installing them so can’t rely on that one…

supriyo-biswas · 2 years ago

The `file` command checks only the first few bytes, and doesn’t parse the structure of the file. APK files are indeed reported as Zip archives by the latest version of `file`.

This is false in every sense for https://www.darwinsys.com/file/ (probably the most used file version). It depends on the magic for a specific file, but it can check any part of your file. Many Linux distros are years out of date, you might be using a very old version.

FILE_45:

    ./src/file -m magic/magic.mgc ../../OpenCalc.v2.3.1.apk
    ../../OpenCalc.v2.3.1.apk: Android package (APK), with zipflinger virtual entry, with APK Signing Block

charcircuit · 2 years ago

apks are also zipaligned so it's not like random users are going to be making them either

awaythrow999 · 2 years ago

Wonder how this would handle a polyglot[0][1], that is valid as a PDF document, a ZIP archive, and a Bash script that runs a Python webserver, which hosts Kaitai Struct’s WebIDE which, allowing you to view the file’s own annotated bytes.

[0]: https://www.alchemistowl.org/pocorgtfo/

[1]: https://www.alchemistowl.org/pocorgtfo/pocorgtfo16.pdf

Edit: just tested, and it does only identify the zip layer

rvnx · 2 years ago

You can try it here: https://google.github.io/magika/

It's relatively limited compared to `file` (~10% coverage), it's more like a specialized classificator for basic file formats, so such cases are really out-of-scope.

I guess it's more for detecting common file formats then with high recall.

However, where is the actual source of the model ? Let's say I want to add a new file format myself.

Apparently only the source of the interpreter is here, not the source of the model nor the training set, which is the most important thing.

tempay · 2 years ago

Is there anything about the performance on unknown files?

I've tried a few that aren't "basic" but are widely used enough to be well supported in libmagic and it thinks they're zip files. I know enough about the underlying formats to know they're not using zip as a container under-the-hood.

kevincox · 2 years ago

Apparenty the Super Mario Bros. 3 ROM is 100% a SWF file.

Cool that you can use it online though. Might end up using it like that. Although it seems like it may focus on common formats.

alexandreyc · 2 years ago

Yes, I totally agree; it's not what I would qualify as open source.

Do you plan to release the training code along the research paper? What about the dataset?

In any case, it's very neat to have ML-based technique and lightweight model for such tasks!

lopkeny12ko · 2 years ago

I don't understand why this needs to exist. Isn't file type detection inherently deterministic by nature? A valid tar archive will always have the same first few magic bytes. An ELF binary has a universal ELF magic and header. If the magic is bad, then the file is corrupted and not a valid XYZ file. What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.

TacticalCoder · 2 years ago

> What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.

I use the file command all the time. The value is when you get this:

    ... $  file somefile.xyz
    somefile.xyz: data

AIUI from reading TFA, magika can determine more filetypes than what the file command can detect.

It'd actually be very easy to determine if there's any value in magika: run file on every file on your filesystem and then for every file where the file command returns "data", run magika and see if magika is right.

If it's right, there's your value.

P.S: it may also be easier to run on Windows than the file command? But then I can't do much to help people who are on Windows.

From elsewhere in this thread, it appears that Magika detects far fewer file types than file (116 vs ~1600), which makes sense. For file, you just need to drop in a few rules to add a new, somewhat obscure type. An AI approach like Magika will need lots of training and test data for each new file type. Where Magika might have a leg up is with distinguishing different textual data files (i.e., source code), but I don't see that as a particularly big use case honestly.

cle · 2 years ago

It's not always deterministic, sometimes it's fuzzy depending on the file type. Example of this is a one-line CSV file. I tested one case of that, libmagic detects it as a text file while magika correctly detects it as a CSV (and gives a confidence score, which is killer).

alkonaut · 2 years ago

But even with determinism, it's not always right. It's not too rare to find a text file with a byte order mark indicating UTF16 (0xFE 0xFF) but then actually containing utf-8. But what "format" does it have then? Is it UTF-8 or UTF-16? Same with e.g. a jar file missing a manifest. That's just a zip, even though I'm sure some runtime might eat it.

But the question is when you have the issue of having to guess the format of a file? Is it when reverse engineering? Last time I did something like this was in the 90's when trying to pick apart some texture from a directory of files called asset0001.k and it turns out it was a bitmap or whatever. Fun times.

vintermann · 2 years ago

Consider, it's perfectly possible for a file to fit two or more file formats - polyglot files are a hobby for some people.

And there are also a billion formats that are not uniquely determined by magic bytes. You don't have to go further than text files.

KOLANICH · 2 years ago

This tool doesn't work this way.

potatoman22 · 2 years ago

This also works for formats like Python, HTML, and JSON.

I still don't see how this is useful. The only time I want to answer the question "what type of file is this" is if it is an opaque blob of binary data. If it's a plain text file like Python, HTML, or JSON, I can figure that out by just catting the file.

LiamPowell · 2 years ago

file (https://www.darwinsys.com/file/) already detects all these formats.

amelius · 2 years ago

Yes, but shouldn't the file type be part of the file, or (better) of the metadata of the file?

Knowing is better than guessing.

YoshiRulz · 2 years ago

So instead of spending some of their human resources to improve libmagic, they used some of their computing power to create an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes), and which is much less effective in an adversarial context, and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable. Thanks guys.

12_throw_away · 2 years ago

Come on, can't you help but be impressed by this amazing AI tech? That gives us sci-fi tools like ... a less-accurate, incomplete, stochastic, un-debuggable, slower, electricity-guzzling version of `file`.

og_kalu · 2 years ago

>So instead of spending some of their human resources to improve libmagic

A large megacorp can work on multiple things at once.

>an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes)

You say that like it's a contradiction but it's not.

>and which is much less effective in an adversarial context,

Is it? This seems like an assumption.

>and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable.

Being introspectable or not has no bearing on the accuracy of a system.

> > an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes)

> You say that like it's a contradiction but it's not.

> > and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable.

> Being introspectable or not has no bearing on the accuracy of a system.

"Open source" and "neural net" is the contradiction, as I went on to write. Even if magika were a more accurate version of file, the implication that it could "help [libmagic] improve" isn't really true, because how do you distill the knowledge from it into a patch for libmagic?

My point re: their "error-prone" claim is that their comparison was disingenuous due to the functionality difference between the tools. (Also with their implication that AIs work perfectly, though this one sounds pretty good by the numbers. I of course accept that there's likely to be some bugs in code written by humans.)

> > and which is much less effective in an adversarial context,

> Is it? This seems like an assumption.

It is, one based on what I've heard about AI classifiers over the years. Other commenters here are interested in this point, but while I don't see anyone experimenting on magika (it's new after all), the fact it's not mentioned in the article leads me to believe they didn't try to attack themselves. (Or did, but with bad results, and so decided not to include that. Funnily enough they did mention adversarial attacks on manually-written classifiers...)