For other tasks, such as retrieval, we still need people to finetune them for it. The ModernBERT documentation has some scripts for finetuning with Sentence Transformers and PyLate for retrieval: https://huggingface.co/docs/transformers/main/en/model_doc/m... But people still need to make and release these models. I have high hopes for them.
1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is roughly as fast as BERT-BAse. Given its architecture (especially Alternating Attention), I'm curious why the model not considerably faster than its predecessor. Any insight you could share on that?
2) Most modern LLMs are Encoder+Decoder model. Why not chop of the decoder of one of these (e.g. a small Llama or Mistral or other liberally-licensed model) and train a short head on top?
What do the encoders do vs the decoders, in this ecosystem? What are some good links to learn about these concepts on a high level? I find all most of the writing about different layers and architectures a bit arcane and inscrutable, especially when it comes to Attention and Self-Attention with multiple heads.
1. an encoder takes an input (e.g. text), and turns it into a numerical representation (e.g. an embedding).
2. a decoder takes an input (e.g. text), and then extends the text.
(There's also encoder-decoders, but I won't go into those)
These two simple definitions immediately give information on how they can be used. Decoders are at the heart of text generation models, whereas encoders return embeddings with which you can do further computations. For example, if your encoder model is finetuned for it, the embeddings can be fed through another linear layer to give you classes (e.g. token classification like NER, or sequence classification for full texts). Or the embeddings can be compared with cosine similarity to determine the similarity of questions and answers. This is at the core of information retrieval/search (see https://sbert.net/). Such similarity between embeddings can also be used for clustering, etc.
In my humble opinion (but it's perhaps a dated opinion), (encoder-)decoders are for when your output is text (chatbots, summarization, translation), and encoders are for when your output is literally anything else. Embeddings are your toolbox, you can shape them into anything, and encoders are the wonderful providers of these embeddings.