Readit News logoReadit News

Deleted Comment

skeletoncrew commented on Llama 2 on ONNX runs locally   github.com/microsoft/Llam... · Posted by u/tmoneyy
brucethemoose2 · 2 years ago
Very unfavorably. Mostly because the ONNX models are FP32/FP16 (so ~3-4x the RAM use), but also because llama.cpp is well optimized with many features (like prompt caching, grammar, device splitting, context extending, cfg...)

MLC's Apache TVM implementation is also excellent. The autotuning in particular is like black magic.

skeletoncrew · 2 years ago
I tried quite a few of these and the ONNX one seems the most elegantly put together of all. I’m impressed.

Speed can be improved. Quick and dirty/hype solutions, not sure.

I really hope ONNX gets traction it deserves.

u/skeletoncrew

KarmaCake day6August 10, 2023View Original