Llama.cpp I think has a ton of clone-and-own boilerplate, presumably from having grown so quickly (I think one of their .cu files is over 10k lines or so, roughly, ATM).
While I haven't seen the model storage and distribution format, the rewrite to GGUF for file storage seems to have been a big boon/boost to the project. Thanks Phil! Cool stuff. Also, he's a really nice guy to boot. Please say hi from Fern to him if you ever run into him. I mean it literally, make his life a hellish barrage of nonstop greetings from Fern.
Thank you for the reference to the CUDA file [1]. It's always nice to see how complex data structures are handled in GPUs. Does anyone have any idea what the bit patterns are for (starting at line 1529)?
I honestly think have a way to just use json (a.k.a. safetensors) / msgpack or some lightweight metadata serializer is a better route than coming up with a new file format. That's also why I just use SQLite to serialize the metadata (and tensor weights, this part is an oversight).
Gguf is cleaner to read in languages that don't have a json parsing library, and works with memory mapping in C. It's very appealing for minimal inference frameworks vs other options.
As LLMs have quite minor changes between architectures, would it make sense to just embed the model compiled to some sort of simple bytecode right in the GGUF file? Then, only implement specific new operations when researchers come up with a new model that gains enough traction to be of interest.
Not really. We've been on that road before. Embedding computation graph into the file makes changes to the computation graph harder (you need to make sure it is backward compatible). This is OK in general (as we have onnx already), but then if you have dynamic shape and the fact that different optimizations we implemented are actually tied to the computation graph, this is simply not optimal. (BTW, this is why PyTorch just embed the code into the pth file, much easier and backward compatible than a static computation graph).
wait, why is embedding the graph into the file bad?
it enables really clean separation of the core autodiff library and whatever backend you want to use to accelerate the graph computations, which can simply read the file and be completely independent of the core implementation
but also, if you just store the tensors in some arbitrary order and then store the indices of the order in which they have to read and traversed, you can easily adjust the graph to add stuff like layer fusion or smth similar (i'm not really familiar w/ comp graph optimisations tbh)
The bytecode would not even need to be Turing-complete. Or maybe it could take inspiration from eBPF which gives some guarantees. What you posted is related to the design oversight of Python's pickle format.
It seems like a lot of innovation is around training, no? GGML (the library that reads GGUF format) supports these values for the
required 'general.architecture':
I've also been trying to figure out GGUF and the other model formats going around. I'm horrified to see there is no model architecture details in the file! As you say, it seems they are hard-coding the above architectures as constants. If a new hot model comes out, one would need to update the reader code (which has the new model arch implemented). Am I understanding this right?
I'm also a bit confused by the quantization aspect. This is a pretty complex topic. GGML seems to use 16bit as per the article. If was pushing it to 8bit, I reckin I'd see no size improvement the GGML file? The article says they encode quantization versions in that file. Where are they defined?
While I haven't seen the model storage and distribution format, the rewrite to GGUF for file storage seems to have been a big boon/boost to the project. Thanks Phil! Cool stuff. Also, he's a really nice guy to boot. Please say hi from Fern to him if you ever run into him. I mean it literally, make his life a hellish barrage of nonstop greetings from Fern.
[1] https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda...
GG is Georgi Gerganov
Deleted Comment
[1]: https://github.com/Mozilla-Ocho/llamafile
it enables really clean separation of the core autodiff library and whatever backend you want to use to accelerate the graph computations, which can simply read the file and be completely independent of the core implementation
but also, if you just store the tensors in some arbitrary order and then store the indices of the order in which they have to read and traversed, you can easily adjust the graph to add stuff like layer fusion or smth similar (i'm not really familiar w/ comp graph optimisations tbh)
what would an alternative look like anyway?
https://www.bleepingcomputer.com/news/security/malicious-ai-...
I'm also a bit confused by the quantization aspect. This is a pretty complex topic. GGML seems to use 16bit as per the article. If was pushing it to 8bit, I reckin I'd see no size improvement the GGML file? The article says they encode quantization versions in that file. Where are they defined?
Deleted Comment