bbischof (u/bbischof)

bbischof commented on About AI Evals hamel.dev/blog/posts/eval... · Posted by u/TheIronYuppie

ReDeiPirati · 8 months ago

> Q: What makes a good custom interface for reviewing LLM outputs? Great interfaces make human review fast, clear, and motivating. We recommend building your own annotation tool customized to your domain ...

Ah! This is a horrible advice. Why should you recommend reinventing the wheel where there is already great open source software available? Just use https://github.com/HumanSignal/label-studio/ or any other type of open source annotation software you want to get started. These tools cover already pretty much all the possible use-cases, and if they aren't you can just build on top of them instead of building it from zero.

bbischof · 8 months ago

Label studio is fine if it covers your need, but in many cases the core opportunity in an eval interface is fitting in with the SME’s workflow or current tech stack.

If label studio looks like what they can use, it’s fine. If not, a day of vibecoding is worth the effort to make your partners with special knowledge comfortable.

bbischof commented on Making AI charts go brrrr hex.tech/blog/making-ai-c... · Posted by u/izzymiller

bbischof · 2 years ago

10x speed up; what more do you want

bbischof commented on What we've learned from a year of building with LLMs eugeneyan.com/writing/llm... · Posted by u/ViktorasJucikas

solidasparagus · 2 years ago

No offense, but I'd love to see what they've successfully built using LLMs before taking their advice too seriously. The idea that fine-tuning isn't even a consideration (perhaps even something they think is absolutely incorrect if the section titles of the unfinished section is anything to go by) is very strange to me and suggests a pretty narrow perspective IMO

bbischof · 2 years ago

Hello, it’s Bryan, an author on this piece.

I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.

As far as fine-tuning in particular, our consensus is that there are easier options first. I personally have fine-tuned gpt models since 2022; here’s a silly post I wrote about it on gpt 2: https://wandb.ai/wandb/fc-bot/reports/Accelerating-ML-Conten...

bbischof commented on What we've learned from a year of building with LLMs eugeneyan.com/writing/llm... · Posted by u/ViktorasJucikas

dbs · 2 years ago

Show me the use cases you have supported in production. Then I might read all the 30 pages praising the dozens (soon to be hundreds?) of “best practices” to build LLMs.

bbischof · 2 years ago

Hello, it’s Bryan, an author on this piece.

I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.

bbischof commented on What We Learned from a Year of Building with LLMs oreilly.com/radar/what-we... · Posted by u/7d7n

wokwokwok · 2 years ago

Mildly surprised to see no mention of my top 2 LLM fails:

1) you’re sampling a distribution; if you only sample once, your sample is not representative of the distribution.

For evaluating prompts and running in production; your hallucination rate is inversely proportional to the number of times you sample.

Sample many times and vote is a highly effective (but slow) strategy.

There is almost zero value in evaluating a prompt by only running it once.

2) Sequences are generated in order.

Asking an LLM to make a decision and justify its decision in that order is literally meaningless.

Once the “decision” tokens are generated; the justification does not influence them. It’s not like they happen “all at once” there is a specific sequence to generating output where the later output cannot magically influence the output which has already been generated.

This is true for sequential outputs from an LLM (obviously), but it is also true inside single outputs. The sequence of tokens in the output is a sequence.

If you’re generating structured output (eg json, xml) which is not sequenced, and your output is something like {decision: …, reason:…} it literally does nothing.

…but, it is valuable to “show the working out” when, as above, you then evaluate multiple solutions to a single request and pick the best one(s).

bbischof · 2 years ago

We allude to 2 when talking about using explanations first, but I totally agree. One minor comment is explanations after can sometimes be useful for understanding how the model came to a particular generation during post-hoc evals.

Point 1 is also a good callout. I added something on this for llm judge but it’s relevant more broadly.

bbischof commented on What We Learned from a Year of Building with LLMs oreilly.com/radar/what-we... · Posted by u/7d7n

sieszpak · 2 years ago

I would like to know your opinion about grafRAG and the ontology. Knowledge Graphs (KG) are a game changer for companies with a lot of unstructured data in the context of applying them with LLM

bbischof · 2 years ago

Bryan here, one of the authors.

Sure. Ultimately, you want to use KG to increase your ability to do great retrieval.

Why do graphs help with retrieval? Well, don’t overlook the classic pageant example: graphs provide signal about the interconnectivity of the docs.

Also, sometimes the graph itself are a kind of object you want to retrieve over.