Although, it seems it's only going to cover the first book (which makes sense, given how difficult the other two would be to film). The real magic for me was in book 3. It was inspiring to see someone think so far out, so boldly.
"The main claim [...] is both somewhat obvious and previously already stated"
"Many pieces of writing are overly assertive and inaccurate."
"I do not think it deserves spending half a page demonstrating that {0^n 1^n} is not in the regular language."
[1] https://www.cnbc.com/2023/03/24/openai-ceo-sam-altman-didnt-...
LLaVA 1.5 is very good, at least at describing images. http://llava.hliu.cc/
I wanted to call this out, though, as it makes the case that to improve any component (and really make it production-worthy), you need an evaluation system:
> Intrinsic evaluation is like a unit test, while extrinsic evaluation is like an integration test. You do need both. It’s very common to start building an evaluation set, and find that your ideas about how you expect the component to behave are much vaguer than you realized. You need a clear specification of the component to improve it, and to improve the system as a whole. Otherwise, you’ll end up in a local maximum: changes to one component will seem to make sense in themselves, but you’ll see worse results overall, because the previous behavior was compensating for problems elsewhere. Systems like that are very difficult to improve.
I think this makes sense from the perspective of a team with deeper ML expertise.
What it doesn't mention is that this is an enormous effort, made even larger when you don't have existing ML expertise. I've been finding this one out the hard way.
I've found that if you have "hard criteria" to evaluate (i.e., getting the LLM to produce a given structure rather than an open-ended output for a chat app) you can quantify improvements using Observability tools (SLOs!) and iterating in production. Ship changes daily, track versions of what you're doing, and keep on top of behavior over a period of time. It's arguably a lot less "clean" but it's way faster, and because it's working on the real-world usage data, it's really effective. An ML engineer might call that some form of "online test" but I don't think it really applies.
At any rate, there are other use cases where you really do need evaluations, though. The more important correct output is, the more it's worth investing in evals. I would argue that if bad outputs have high consequences, then maybe LLMs also aren't the right tech for the job, but that'll probably change in a few years. And hopefully making evaluations will be easier too.
Wow, interesting. Do you have any example for this?
I've realized that LLMs are fairly good at string processing tasks that a really complex regex might also do, so I can see the point in those.
And basically all servers will have 8xA100 (maybe 4xA100). Nobody bothers with a single A100 (of course in a VM you might have access to only one)
for those wondering: no this is not the norm. My lab at CMU doesn't own any A100s (we have A6000s).
Kind of like industry -> PhD decision.