Recently I've been working on making LLM evaluations fast by using bayesian optimization to select a sensible subset.
Bayesian optimization is used because it’s good for exploration / exploitation of expensive black box (paraphrase, LLM).
I would love to hear your thoughts and suggestions on this!
For instance, if there are three conversations that you can use to test if your AI is working correctly:
(1) HUMAN: "Please say hello"
(2) HUMAN: "Please say goodbye" (3) HUMAN: "What is 2 + 2?" Let's say you can only pick two conversations to evaluate how good your AI is. Would you pick 1 & 2? Probably not. You'd pick 1 & 3, or 2 & 3.Because Embeddings allow us to determine how similar in vibes things are, we have a tool with which we can automatically search over our dataset for things that have very different vibes, meaning that each evaluation run is more likely to return new information about how well the model is doing.
My question to the OP was mostly about whether or not this "vibe differentiated dataset" was constructed prior to the evaluation run, or populated gradually, based on each individual test case result.
so anyway it's just vibes man
And in general if you have an LLM that performs really well on one d_i then who cares. The goal in LLM evaluation is to find a good performing LLM overall.
Finally, it feels that your Abstract and other snippets sound like an LLM wrote them.
Good luck.
If instead you can reduce the number of samples to look at and automatically find „clusters“ and their performance, you get a win. It won’t be the „average performance number“, but it will give you (hopefully) understanding which things work how well in the LLm.
The main drawback in this (as far as I can say after this short glimpse at it) is the embedding itself. Only if the distance in the embedding space really correlates with performance, this will work great. However we know from adversarial attacks, that already small changes in the embedding space can result in vastly different results
I know what a LLM is and I know very well what is Bayesian Optimization. But I don't understand what this library is trying to do.
I am guessing it's tryng to test the model's ability to generate correct and relevant responses to a given input.
But who is the judge ?
So the pipeline goes training -> evaluation -> deployment (inference).
Hope that explanation helps!
https://rentruewang.github.io/bocoel/research/
What the OP is doing here is completely opaque to the rest of us.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
Nobody would chime in on an article on "faster unit testing in software with..." and complain that it's not clear because "is it a history unit? a science unit? what kind of tests are those students taking!?", so I find it odd that on HN people often complain about something similar for a very popular niche in this community.
If you're interested in LLMs, the term "evaluation" should be very familiar, and if you're not interested in LLMs then this post likely isn't for you.
In particular this is only working off the encoded sentences so it seems to me that things that involve attention etc aren’t being evaluated here.
That misreading could also suggest some interesting research directions. Bayesian optimization to choose some parameters which guide which subset of the neurons to include in the inference calculation? Why not.
Deleted Comment
I agree with the other comments. You need to do a much better job of motivating and contextualizing the research problem, as well as explaining your method in specific precise language in the README and other documentation. (Preferably in the README) You should make it clear that you are using GLUE and and Big-Bench for the evaluation (as well as any other evaluation benchmarks that you are using). You should also be explicit which LLM models and embedding you have tested and what datasets you used to train and evaluate on. You should also must add graphs and tables showing your method's speed and evaluation performance compared to the SOTA. I like the reference/overview section that shows the diagram (I think you should put it in the README to make it more visible to first time viewers). However, the description of the classes are cryptic. For example the Score class said "Evaluate the target with respect to the references." I had no idea what that meant, and I had to just google some of the class names to get an idea of what score was trying to do. That's true for pretty much all the classes. Also, you need to explain what factory class are and how they differ from the models classes, e.g. why does the bocoel.models.adaptors class require a score and a corpus (from overview), but factories.adaptor require "GLUE", lm, and choices (looking at the code from examples/getting_started/__main__.py)? However, I do like the fact that you have an example (although I haven't tried running it).
Not trying to make excuses tho. Your points are very valid and I would take them into account!
OP here, I came up with this cool idea because I was chatting with a friend about how to make LLM evaluations fast (which is so painfully slow on large datasets) and realized that somehow no one has tried it. So I decided to give it a go!