Besides the many issues with the desktop itself, Windows offers piss poor filesystem performance for common developer tools, plus leaves users to contend with the complexity of a split world thanks to the (very slow) 9pfs shares used to present host filesystems to guest and vice-versa.
And then there's the many nasty and long-lived bugs, from showstopping memory leaks to data loss on the virtual disks of the guests to broken cursor tracking for GUI apps in WSLg...
While I agree that Windows has long since abandoned UI/UX consistency, it's not like that is unique: On desktop Linux I regularly have mixed Qt/KDE, GTK2, GTK3+/libadwaita and Electron (with every JS GUI framework being a different UI/UX experience) GUIs and dialogs. I'm sure libcosmic/iced and others will be added eventually too.
1. an encoder takes an input (e.g. text), and turns it into a numerical representation (e.g. an embedding).
2. a decoder takes an input (e.g. text), and then extends the text.
(There's also encoder-decoders, but I won't go into those)
These two simple definitions immediately give information on how they can be used. Decoders are at the heart of text generation models, whereas encoders return embeddings with which you can do further computations. For example, if your encoder model is finetuned for it, the embeddings can be fed through another linear layer to give you classes (e.g. token classification like NER, or sequence classification for full texts). Or the embeddings can be compared with cosine similarity to determine the similarity of questions and answers. This is at the core of information retrieval/search (see https://sbert.net/). Such similarity between embeddings can also be used for clustering, etc.
In my humble opinion (but it's perhaps a dated opinion), (encoder-)decoders are for when your output is text (chatbots, summarization, translation), and encoders are for when your output is literally anything else. Embeddings are your toolbox, you can shape them into anything, and encoders are the wonderful providers of these embeddings.
Is there a difference here other than encoder-only transformers being bidirectional and their primary output (rather than a byproduct) are input embeddings? Is there a reason other than that bidirectionality that we use specific encoder-only embedding models instead of just cutting and pasting a decoder-only model's embedding phase?
See the precompute_input_logits() and forward() functions here:
https://github.com/ryao/llama3.c/blob/master/run.c#L520
As a preface, precompute_input_logits() is really just a generalized version of the forward() function that can operate on multiple input tokens at a time to do faster input processing, although it can be used in place of the forward() function for output generation just by passing only a single token at a time.
Also, my apologies for the code being a bit messy. matrix_multiply() and batched_matrix_multiply() are wrappers for GEMM, which I ended up having to use directly anyway when I needed to do strided access. Then matmul() is a wrapper for GEMV, which is really just a special case of GEMM. This is a work in progress personal R&D project that is based on prior work others did (as it spared me from having to do the legwork to implement the less interesting parts of inferencing), so it is not meant to be pretty.
Anyway, my purpose in providing that link is to show what is needed to do inferencing (on llama 3). You have a bunch of matrix weights, plus a lookup table for vectors that represent tokens, in memory. Then your operations are:
* memcpy()
* memset()
* GEMM (GEMV is a special case of GEMM)
* sinf()
* cosf()
* expf()
* sqrtf()
* rmsnorm (see the C function for the definition)
* softmax (see the C function for the definition)
* Addition, subtraction, multiplication and division.
I specify rmsnorm and softmax for completeness, but they can be implemented in terms of the other operations.If you can do those, you can do inferencing. You don’t really need very specialized things. Over 95% of time will be spent in GEMM too.
My next steps likely will be to figure out how to implement fast GEMM kernels on my CPU. While my own SGEMV code outperforms the Intel MKL SGEMV code on my CPU (Ryzen 7 5800X where 1 core can use all memory bandwidth), my initial attempts at implementing SGEMM have not fared quite so well, but I will likely figure it out eventually. After I do, I can try adapting this to FP16 and then memory usage will finally be low enough that I can port it to a GPU with 24GB of VRAM. That would enable me to do what I say is possible rather than just saying it as I do here.
By the way, the llama.cpp project has already figured all of this out and has things running on both GPUs and CPUs using just about every major quantization. I am rolling my own to teach myself how things work. By happy coincidence, I am somehow outperforming llama.cpp in prompt processing on my CPU but sadly, the secrets of how I am doing it are in Intel’s proprietary cblas_sgemm_batch() function. However, since I know it is possible for the hardware to perform like that, I can keep trying ideas for my own implementation until I get something that performs at the same level or better.
Perhaps you can reverse engineer it?
I'll certainly checkout PEX, I think that distribution of a binary is likely the largest one for Python right now. Java solved it with a JAR file, static-compiled binaries for most compiled languages.
At Google PAR files have been quite a useful way to handle this for at least 18 years now, hope this can get reasonably solved everywhere.
JARs still require a JRE to run, and the runtime needs to be invoked. The equivalent would probably be a Python zipapp (which are just Python files zipped into a .pyz the interpreter can run).
Static binaries are one advantage that languages like Go and Rust have though, yeah.
https://medium.com/@JuliusHuijnk/experiment-in-evolving-the-...
To be honest I haven't yet read the original article, since it's huge. It seems to aim for something compareable.
I didn't notice a link to any code, would you be open to sharing the code? I'd love to take a look at how you did things and play around with it myself.
Deleted Comment