skeletoncrew (u/skeletoncrew)

brucethemoose2 · 2 years ago

Very unfavorably. Mostly because the ONNX models are FP32/FP16 (so ~3-4x the RAM use), but also because llama.cpp is well optimized with many features (like prompt caching, grammar, device splitting, context extending, cfg...)

MLC's Apache TVM implementation is also excellent. The autotuning in particular is like black magic.

skeletoncrew · 2 years ago

I tried quite a few of these and the ONNX one seems the most elegantly put together of all. I’m impressed.

Speed can be improved. Quick and dirty/hype solutions, not sure.

I really hope ONNX gets traction it deserves.