Very unfavorably. Mostly because the ONNX models are FP32/FP16 (so ~3-4x the RAM use), but also because llama.cpp is well optimized with many features (like prompt caching, grammar, device splitting, context extending, cfg...)
MLC's Apache TVM implementation is also excellent. The autotuning in particular is like black magic.
MLC's Apache TVM implementation is also excellent. The autotuning in particular is like black magic.
Speed can be improved. Quick and dirty/hype solutions, not sure.
I really hope ONNX gets traction it deserves.