I must support Your words. Long time I thought that Intel is the best, but unfortunately I could not anymore.
Must admit, I still don't understand, how it happened, but now NVIDIA is best.
When performing performance optimization on CPUs, I was impressed with Intel's suite of tools (like VTUNE). NVIDIA has some unbelievable tools, like Nsys and, of course, its container registry (NGC), which I think surpasses even Intel's software support.
I will be archiving the full report with more results soon.
I've been following TriLLM. They've achieved great results, and I'm really impressed with the llama.cpp contributors already getting the models integrated.
Suppose that Transformers die tomorrow, and Mamba becomes all the rage. The released Mamba code already has CUDA kernels for inference and training. Any of the CSPs or other NVIDIA GPU users can switch their entire software stack to train and inference Mamba models. Meanwhile, we'll be completely dead in the water with similar companies that made the same bet, like Etched.
On a side note, I deeply looked into every company in the space and was thoroughly unimpressed with how little they cared about the software stack to make their hardware seamlessly work. So, even if I did go to work at some other hardware company, I doubt a lot of customers would utilize the hardware.
Surely you’d need more ternary weights though to achieve same performance outcome?
A bit like a Q4 quant is smaller than a Q8 but also tangibly worse so the “compression” isn’t really like for like
Either way excited about more tenary progress.
1. They are everywhere and aren't going anywhere.. 2. Network infrastructure to ingest and analyze thousands of cameras producing video footage is very demanding.. 3. Low power and low latency scream asic to me
https://intapi.sciendo.com/pdf/10.2478/ijanmc-2022-0036#:~:t...
Edit: I don't think the company exists in its current form anymore
The end plan is to have a single chip and flush all weights onto the chip at initialization. Because we are a single line of code that is Torch compatible (hence HF compatible), every other part of the codebase shouldn't change.