Readit News logoReadit News
Posted by u/renchuw 2 years ago
Show HN: Faster LLM evaluation with Bayesian optimizationgithub.com/rentruewang/bo...
Recently I've been working on making LLM evaluations fast by using bayesian optimization to select a sensible subset. Bayesian optimization is used because it’s good for exploration / exploitation of expensive black box (paraphrase, LLM). I would love to hear your thoughts and suggestions on this!
eximius · 2 years ago
This is "evaluating" LLMs in the sense of benchmarking how good they are, not improving LLM inference in speed or quality, yes?
renchuw · 2 years ago
Correct.
enonimal · 2 years ago
This is a cool idea -- is this an inner-loop process (i.e. after each LLM evaluation, the output is considered to choose the next sample) or a pre-loop process (get a subset of samples before tests are run)?
ReD_CoDE · 2 years ago
It seems that you're the only one who understood the idea. I don't know current LLMs use such a method or not, but the idea could be 10 times faster
enonimal · 2 years ago
AFAICT, this is a more advanced way of using Embeddings (which can encode for the vibes similarity (not an official term) of prompts) to determine where you get the most "bang for your buck" in terms of testing.

For instance, if there are three conversations that you can use to test if your AI is working correctly:

(1) HUMAN: "Please say hello"

    AI: "Hello!"
(2) HUMAN: "Please say goodbye"

    AI: "Goodbye!"

(3) HUMAN: "What is 2 + 2?"

    AI: "4!"


Let's say you can only pick two conversations to evaluate how good your AI is. Would you pick 1 & 2? Probably not. You'd pick 1 & 3, or 2 & 3.

Because Embeddings allow us to determine how similar in vibes things are, we have a tool with which we can automatically search over our dataset for things that have very different vibes, meaning that each evaluation run is more likely to return new information about how well the model is doing.

My question to the OP was mostly about whether or not this "vibe differentiated dataset" was constructed prior to the evaluation run, or populated gradually, based on each individual test case result.

so anyway it's just vibes man

renchuw · 2 years ago
This would be an inner loop process. However, the selection is way faster than LLMs so it shouldn't be noticable (hopefully).
tartakovsky · 2 years ago
What is your goal? if d1, d2, d3, etc is the dataset over which you're trying to optimize, then the goal is to find some best performing d_i. In this case, you're not evaluating. You're optimizing. Your acquisition function even says so: https://rentruewang.github.io/bocoel/research/

And in general if you have an LLM that performs really well on one d_i then who cares. The goal in LLM evaluation is to find a good performing LLM overall.

Finally, it feels that your Abstract and other snippets sound like an LLM wrote them.

Good luck.

doubtfuluser · 2 years ago
I disagree that the goal in „evaluation is to find a good performing LLM overall“. The goal in evaluation is to understand the performance of an LLm (on average). This approach actually is more about finding „areas“ where the LLm does not behave well and where the LLm behaves well (by the Gaussian process approximation) This is indeed an important problem to look at. Often you just run an LLm evaluation on 1000s of samples, some of them similar and you don’t learn anything new from the sample „what time is it, please“ over „what time is it“.

If instead you can reduce the number of samples to look at and automatically find „clusters“ and their performance, you get a win. It won’t be the „average performance number“, but it will give you (hopefully) understanding which things work how well in the LLm.

The main drawback in this (as far as I can say after this short glimpse at it) is the embedding itself. Only if the distance in the embedding space really correlates with performance, this will work great. However we know from adversarial attacks, that already small changes in the embedding space can result in vastly different results

skyde · 2 years ago
what do they mean by "evaluating the model on corpus." and "Evalutes the corpus on the model".

I know what a LLM is and I know very well what is Bayesian Optimization. But I don't understand what this library is trying to do.

I am guessing it's tryng to test the model's ability to generate correct and relevant responses to a given input.

But who is the judge ?

causal · 2 years ago
Same. "Evaluate" and "corpus" need to be defined. I don't think OP intended this to be clickbait but without clarification it sounds like they're claiming 10x faster inference, which I'm pretty sure it's not.
renchuw · 2 years ago
Hi, OP here. It's not 10 times faster inference, but faster evaluation. You use evaluation on a dataset to check if your model is performing well. This takes a lot of time (might be more than training if you are just finetuning a pre-trained model on a small dataset)!

So the pipeline goes training -> evaluation -> deployment (inference).

Hope that explanation helps!

deckar01 · 2 years ago
Evaluate is referring to measuring the accuracy of a model on a standard dataset for the purpose of comparing model performance. AKA benchmark.

https://rentruewang.github.io/bocoel/research/

skyde · 2 years ago
Right I guess I am not familiar how automated Benchmarks for LLM work. I assumed to decide if an LLM answer was good required Human Evaluation.
ragona · 2 years ago
The "eval" phase is done after a model is trained to assess its performance on whatever tasks you wanted it to do. I think this is basically saying, "don't evaluate on the entire corpus, find a smart subset."
renchuw · 2 years ago
Hi, OP here. So you evaluate LLMs on corpuses to evaluate their performance right? Bayesian optimization is here to select points (in the latent space) and tell the LLM where to evaluate next. To be precise, entropy search is used here (coupled with some latent space reduction techniques like N-sphere representation and embedding whitening). Hope that makes sense!
hackerlight · 2 years ago
The definition of "evaluate" isn't clear. Do you mean inference?
azinman2 · 2 years ago
What I don’t get from the webpage is what are you evaluating, exactly?
observationist · 2 years ago
This, exactly - what is meant by evaluate in this context? Is this more efficient inference using approximation, so you can create novel generations, or is it some test of model attributes?

What the OP is doing here is completely opaque to the rest of us.

renchuw · 2 years ago
Fair question.

Evaluate refers to the phase after training to check if the training is good.

Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!

So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.

swalsh · 2 years ago
This is becoming so common in AI discussions. Everyone with a real use case is opaque, or just flat out doesn't talk. The ones who are talking have toy use cases. I think its because it's so hard to build a moat, and techniques are one of the ways to build one.
PheonixPharts · 2 years ago
"Evaluation" has a pretty standard meaning in the LLM community the same way that "unit test" does in software. Evaluations are suites of challenges presented to an LLM to evaluate how well it does as a form of bench-marking.

Nobody would chime in on an article on "faster unit testing in software with..." and complain that it's not clear because "is it a history unit? a science unit? what kind of tests are those students taking!?", so I find it odd that on HN people often complain about something similar for a very popular niche in this community.

If you're interested in LLMs, the term "evaluation" should be very familiar, and if you're not interested in LLMs then this post likely isn't for you.

azinman2 · 2 years ago
There’s lots to evaluate. If you’re evaluating model quality, there are many benchmarks all trying to measure different things… accuracy in translation, common sense reasoning, how well it stays on topic, can you regurgitate a reference in the prompt text, how biased is the output along a societal dimension, other safety measures, etc. I’m in the field but not an LLM researcher per se, so perhaps this is more meaningful to others, but given the post it seems useful to answer my question which was what _exactly_ is being evaluated?

In particular this is only working off the encoded sentences so it seems to me that things that involve attention etc aren’t being evaluated here.

waldrews · 2 years ago
Unit testing isn't an overloaded term. Evaluation by itself is overloaded, though "LLM evaluation" disambiguates it. I first parsed the title as 'faster inference' rather than 'faster evaluation' even being aware of what LLM evaluation is, because that's a probable path given 'show' 'faster' and 'LLM' in the context window.

That misreading could also suggest some interesting research directions. Bayesian optimization to choose some parameters which guide which subset of the neurons to include in the inference calculation? Why not.

renchuw · 2 years ago
Hi, OP here, sorry for late reply. I am not actually "evaluating", but rather using the "side effects" of bayesian optimization that allows zoning in/out on some regions on the latent space. Since embedders are so fast compared to LLM, it saves time by saving LLMs from evaluating on similar queries. Hope that makes sense!
azinman2 · 2 years ago
But aren’t you really just evaluating the embeddings / quality of the latent space then?

Deleted Comment

endernac · 2 years ago
I looked through the github.io documentation and skimmed through the code and research article draft. Correct me if I am wrong. What I think you are doing (at a high level) is you are you create a corpus of QA tasks, embeddings, and similarity metrics. Then you are somehow using NLP scoring and Bayesian Optimization to find a subset of the corpus that best matches a particular evaluation task. Then you can jut evaluate the LLM on this subset rather than the entire corpus, which is much faster.

I agree with the other comments. You need to do a much better job of motivating and contextualizing the research problem, as well as explaining your method in specific precise language in the README and other documentation. (Preferably in the README) You should make it clear that you are using GLUE and and Big-Bench for the evaluation (as well as any other evaluation benchmarks that you are using). You should also be explicit which LLM models and embedding you have tested and what datasets you used to train and evaluate on. You should also must add graphs and tables showing your method's speed and evaluation performance compared to the SOTA. I like the reference/overview section that shows the diagram (I think you should put it in the README to make it more visible to first time viewers). However, the description of the classes are cryptic. For example the Score class said "Evaluate the target with respect to the references." I had no idea what that meant, and I had to just google some of the class names to get an idea of what score was trying to do. That's true for pretty much all the classes. Also, you need to explain what factory class are and how they differ from the models classes, e.g. why does the bocoel.models.adaptors class require a score and a corpus (from overview), but factories.adaptor require "GLUE", lm, and choices (looking at the code from examples/getting_started/__main__.py)? However, I do like the fact that you have an example (although I haven't tried running it).

renchuw · 2 years ago
Thanks for the feedback! The reason the "code" part is more complete than the "research" part is because I originally planned for it to just be a hobby project and only very later on decided to perhaps try to be serious and make it a research work.

Not trying to make excuses tho. Your points are very valid and I would take them into account!

renchuw · 2 years ago
Side note:

OP here, I came up with this cool idea because I was chatting with a friend about how to make LLM evaluations fast (which is so painfully slow on large datasets) and realized that somehow no one has tried it. So I decided to give it a go!