Readit News logoReadit News
syllogistic · 3 years ago
Huh, yeah it repros. Java is faster 159s vs 203s for the 256 tokens on my intel i9 12 gen
brucethemoose2 · 3 years ago
> 159s for the 256

This is still extremly slow for that CPU, compared to the quantized model.

IIRC the llama.cpp f32 code is basically a placeholder.

BUT the threading overhead is a known performance issue, and I'm sure Java handles that better.

version_five · 3 years ago
> threading overhead is a known performance issue

I didn't know about it, I should have... are there any "edge" frameworks as complete as ggml/llama.cpp that you know of that are faster now? Ggml is still very easy to use which I like, but I'd always thought of it as the fastest, in particular for CPU, I hadn't noticed there were known performance issues.

version_five · 3 years ago
Where does the performance difference come from? And in what kind of processor & gpu? I didn't even know llama.cpp had a 32 bit option. For now I'm pretty suspicious it's a fair comparison.
tjake · 3 years ago
The default for `convert.py` is F32. This is just SIMD CPU comparison.

Jlama uses the vector api in java20 but also better thread scheduling with work stealing and zero allocation.

belfthrow · 3 years ago
Could you link to some of the examples in your repo where you enforce the zero allocation? I don't see much reuse of the buffers, eg float buffers and there is quite a lot of array based heap allocation. Just for my own interest. Many thanks. Cool to see the use of the new vector api also.
version_five · 3 years ago
Very interesting, I'll watch for the quantized version.