I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.
As far as fine-tuning in particular, our consensus is that there are easier options first. I personally have fine-tuned gpt models since 2022; here’s a silly post I wrote about it on gpt 2: https://wandb.ai/wandb/fc-bot/reports/Accelerating-ML-Conten...
I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.
1) you’re sampling a distribution; if you only sample once, your sample is not representative of the distribution.
For evaluating prompts and running in production; your hallucination rate is inversely proportional to the number of times you sample.
Sample many times and vote is a highly effective (but slow) strategy.
There is almost zero value in evaluating a prompt by only running it once.
2) Sequences are generated in order.
Asking an LLM to make a decision and justify its decision in that order is literally meaningless.
Once the “decision” tokens are generated; the justification does not influence them. It’s not like they happen “all at once” there is a specific sequence to generating output where the later output cannot magically influence the output which has already been generated.
This is true for sequential outputs from an LLM (obviously), but it is also true inside single outputs. The sequence of tokens in the output is a sequence.
If you’re generating structured output (eg json, xml) which is not sequenced, and your output is something like {decision: …, reason:…} it literally does nothing.
…but, it is valuable to “show the working out” when, as above, you then evaluate multiple solutions to a single request and pick the best one(s).
Point 1 is also a good callout. I added something on this for llm judge but it’s relevant more broadly.
Sure. Ultimately, you want to use KG to increase your ability to do great retrieval.
Why do graphs help with retrieval? Well, don’t overlook the classic pageant example: graphs provide signal about the interconnectivity of the docs.
Also, sometimes the graph itself are a kind of object you want to retrieve over.
Deleted Comment
Ah! This is a horrible advice. Why should you recommend reinventing the wheel where there is already great open source software available? Just use https://github.com/HumanSignal/label-studio/ or any other type of open source annotation software you want to get started. These tools cover already pretty much all the possible use-cases, and if they aren't you can just build on top of them instead of building it from zero.
If label studio looks like what they can use, it’s fine. If not, a day of vibecoding is worth the effort to make your partners with special knowledge comfortable.