GGUF, the Long Way Around

tbalsam · 2 years ago

Llama.cpp I think has a ton of clone-and-own boilerplate, presumably from having grown so quickly (I think one of their .cu files is over 10k lines or so, roughly, ATM).

While I haven't seen the model storage and distribution format, the rewrite to GGUF for file storage seems to have been a big boon/boost to the project. Thanks Phil! Cool stuff. Also, he's a really nice guy to boot. Please say hi from Fern to him if you ever run into him. I mean it literally, make his life a hellish barrage of nonstop greetings from Fern.

qrios · 2 years ago

Thank you for the reference to the CUDA file [1]. It's always nice to see how complex data structures are handled in GPUs. Does anyone have any idea what the bit patterns are for (starting at line 1529)?

[1] https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda...

thrtythreeforty · 2 years ago

Those have to do with dequantization. It involves table lookups and some adjusting math.

liuliu · 2 years ago

I honestly think have a way to just use json (a.k.a. safetensors) / msgpack or some lightweight metadata serializer is a better route than coming up with a new file format. That's also why I just use SQLite to serialize the metadata (and tensor weights, this part is an oversight).

andy99 · 2 years ago

Gguf is cleaner to read in languages that don't have a json parsing library, and works with memory mapping in C. It's very appealing for minimal inference frameworks vs other options.

fzzzy · 2 years ago

I think a binary format is obviously the right answer here.

andy99 · 2 years ago

> GPT-Generated Unified Format

GG is Georgi Gerganov

kristjansson · 2 years ago

Nothing like a good backronym

Deleted Comment

SOLAR_FIELDS · 2 years ago

“no yapping” gave me a bit of a chuckle. Quick way to ask the response to be brief I guess.

mc10 · 2 years ago

It would be cool to also discuss how llamafiles[1] work, and how they differ from the GGUF files.

[1]: https://github.com/Mozilla-Ocho/llamafile

null_point · 2 years ago

Cool. I was just learning about GGUF by creating my own parser for it based on the spec https://github.com/ggerganov/ggml/blob/master/docs/gguf.md (for educational purposes)

RicoElectrico · 2 years ago

As LLMs have quite minor changes between architectures, would it make sense to just embed the model compiled to some sort of simple bytecode right in the GGUF file? Then, only implement specific new operations when researchers come up with a new model that gains enough traction to be of interest.

liuliu · 2 years ago

Not really. We've been on that road before. Embedding computation graph into the file makes changes to the computation graph harder (you need to make sure it is backward compatible). This is OK in general (as we have onnx already), but then if you have dynamic shape and the fact that different optimizations we implemented are actually tied to the computation graph, this is simply not optimal. (BTW, this is why PyTorch just embed the code into the pth file, much easier and backward compatible than a static computation graph).

taminka · 2 years ago

wait, why is embedding the graph into the file bad?

it enables really clean separation of the core autodiff library and whatever backend you want to use to accelerate the graph computations, which can simply read the file and be completely independent of the core implementation

but also, if you just store the tensors in some arbitrary order and then store the indices of the order in which they have to read and traversed, you can easily adjust the graph to add stuff like layer fusion or smth similar (i'm not really familiar w/ comp graph optimisations tbh)

what would an alternative look like anyway?

sroussey · 2 years ago

Yeah, but you want to avoid remote code execution:

https://www.bleepingcomputer.com/news/security/malicious-ai-...

RicoElectrico · 2 years ago

The bytecode would not even need to be Turing-complete. Or maybe it could take inspiration from eBPF which gives some guarantees. What you posted is related to the design oversight of Python's pickle format.

rahimnathwani · 2 years ago

It seems like a lot of innovation is around training, no? GGML (the library that reads GGUF format) supports these values for the required 'general.architecture':

  llama
  mpt
  gptneox
  gptj
  gpt2
  bloom
  falcon
  rwkv

throwawaybbq1 · 2 years ago

I've also been trying to figure out GGUF and the other model formats going around. I'm horrified to see there is no model architecture details in the file! As you say, it seems they are hard-coding the above architectures as constants. If a new hot model comes out, one would need to update the reader code (which has the new model arch implemented). Am I understanding this right?

I'm also a bit confused by the quantization aspect. This is a pretty complex topic. GGML seems to use 16bit as per the article. If was pushing it to 8bit, I reckin I'd see no size improvement the GGML file? The article says they encode quantization versions in that file. Where are they defined?

cooper_ganglia · 2 years ago

I’ve been looking for a good resource on GGUF for the past week or so, the timing on this is awesome! Thanks!