Readit News logoReadit News
cubie commented on EuroBERT: A High-Performance Multilingual Encoder Model   huggingface.co/blog/EuroB... · Posted by u/druskacik
cubie · 6 months ago
Looks very solid; I'm excited for finetuned variants for retrieval and reranking.
cubie commented on A Replacement for BERT   huggingface.co/blog/moder... · Posted by u/cubie
dmezzetti · 8 months ago
Great news here. Will takes some time for it to trickle downstream but expect to see better vector embeddings models, entity extraction and more.
cubie · 8 months ago
Spot on
cubie commented on A Replacement for BERT   huggingface.co/blog/moder... · Posted by u/cubie
pantsforbirds · 8 months ago
Awesome news and something I really want to checkout for work. Has anyone seen any RAG evals for ModernBERT yet?
cubie · 8 months ago
Not yet - these are base models, or "foundational models". They're great for molding into different use cases via finetuning, better than common models like BERT, RoBERTa, etc. in fact, but like those models, these ModernBERT checkpoints can only do one thing: mask filling.

For other tasks, such as retrieval, we still need people to finetune them for it. The ModernBERT documentation has some scripts for finetuning with Sentence Transformers and PyLate for retrieval: https://huggingface.co/docs/transformers/main/en/model_doc/m... But people still need to make and release these models. I have high hopes for them.

cubie commented on A Replacement for BERT   huggingface.co/blog/moder... · Posted by u/cubie
querez · 8 months ago
Two questions:

1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is roughly as fast as BERT-BAse. Given its architecture (especially Alternating Attention), I'm curious why the model not considerably faster than its predecessor. Any insight you could share on that?

2) Most modern LLMs are Encoder+Decoder model. Why not chop of the decoder of one of these (e.g. a small Llama or Mistral or other liberally-licensed model) and train a short head on top?

cubie · 8 months ago
Beyond what the others have said about 1) ModernBERT-base being 149M parameters vs BERT-base's 110M and 2) most LLMs being decoder-only models, also consider that alternating attention (local vs global) only starts helping once you're processing longer texts. With short texts, local attention is equivalent to global attention. I'm not sure what length was used in the picture, but GLUE is mostly pretty short text.
cubie commented on A Replacement for BERT   huggingface.co/blog/moder... · Posted by u/cubie
EGreg · 8 months ago
Can you go into detail for those of us who aren't as well versed in the tech?

What do the encoders do vs the decoders, in this ecosystem? What are some good links to learn about these concepts on a high level? I find all most of the writing about different layers and architectures a bit arcane and inscrutable, especially when it comes to Attention and Self-Attention with multiple heads.

cubie · 8 months ago
On a very high level, for NLP:

1. an encoder takes an input (e.g. text), and turns it into a numerical representation (e.g. an embedding).

2. a decoder takes an input (e.g. text), and then extends the text.

(There's also encoder-decoders, but I won't go into those)

These two simple definitions immediately give information on how they can be used. Decoders are at the heart of text generation models, whereas encoders return embeddings with which you can do further computations. For example, if your encoder model is finetuned for it, the embeddings can be fed through another linear layer to give you classes (e.g. token classification like NER, or sequence classification for full texts). Or the embeddings can be compared with cosine similarity to determine the similarity of questions and answers. This is at the core of information retrieval/search (see https://sbert.net/). Such similarity between embeddings can also be used for clustering, etc.

In my humble opinion (but it's perhaps a dated opinion), (encoder-)decoders are for when your output is text (chatbots, summarization, translation), and encoders are for when your output is literally anything else. Embeddings are your toolbox, you can shape them into anything, and encoders are the wonderful providers of these embeddings.

u/cubie

KarmaCake day150June 5, 2023View Original