When Fine-Tuning Makes Sense: A Developer's Guide

simonw · 3 months ago

This is a post by a vendor that sells fine-tuning tools.

Here's a suggestion: show me a demo!

For the last two years I've been desperately keen to see just one good interactive demo that lets me see a fine-tuned model clearly performing better (faster, cheaper, more accurate results) than the base model on a task that it has been fine-tuned for - combined with extremely detailed information on how it was fine-tuned - all of the training data that was used.

If you want to stand out among all of the companies selling fine-tuning services yet another "here's tasks that can benefit from fine-tuning" post is not the way to do it. Build a compelling demo!

scosman · 3 months ago

We don't sell fine-tuning tools - we're an open tool for finding the best way of running your AI workload. We support evaluating/comparing a variety of methods: prompting, prompt generators (few shot, repairs), various models, and fine-tuning from 5 different providers.

The focus of the tool is that it lets you try them all, side by side, and easily evaluate the results. Fine-tuning is one tool in a tool chest, which often wins, but not always. You should use evals to pick the best option for you. This also sets you up to iterate (when you find bugs, want to change the product, or new models comes out).

Re:demo -- would you want a demo or detailed evals and open datasets (honest question)? Single-shot examples are hard to compare, but the benefits usually come out in evals at scale. I'm definitely open to making this. Open for suggestions on what would be the most helpful (format and use case).

It's all on Github and free: https://github.com/kiln-ai/kiln

simonw · 3 months ago

I want a web page I can go to where I can type a prompt (give me a list of example prompts too) and see the result from the base model on one side and the result from the fine-tuned model on the other side.

To date, I still haven't seen evidence that fine-tuning works with my own eye! It's really frustrating.

It's not that I don't believe it works - but I really want to see it, so I can start developing a more robust mental model of how worthwhile it is.

It sounds to me like you might be in a great position to offer this.

Deleted Comment

elliotto · 3 months ago

Chiming in here to say that I was tasked to implement a fine tuning method for my AI startup and I also couldn't find any actual implemented outputs. There are piles of tutorials and blog posts and extensive documentation on hugging face transformers about the tools provided to do this, but I was unable to find a single demonstration of 'here is the base model output' vs 'here is the fine tuned output'. Doesn't have to be online like you suggested, even a screenshot or text blob showing how the fine tuning affected it would be useful.

I am in a similar boat to you where I have developed a great sense for how the bots will respond to prompting and how much detail and context is required because I've been able to iterate and experiment with this. But have no mental model at all about how fine tuning is meant to perform.

cleverwebble · 3 months ago

I can't really show an interactive demo, but my team at my day job has been fine tuning OpenAI models since GPT-3.5 and fine tuning can drastically improves output quality & prompt adherence. Heck, we found you can reduce your prompt to very simple instructions, and encode the style guidelines via your fine tuning examples.

This really only works though if:

1) The task is limited to a relatively small domain (relatively small could probably be misnomer, as most LLMs are trying to solve every-problem-all-at-once. As long as you are having it specialize in a specific field even, FT can help you achieve superior results.) 2) You have high quality examples (you don't need a lot, maybe 200 at most) Quality is often better than quantity here.

Often, distillation is all you need. Eg, do some prompt engineering on a high quality model (GPT-4.1, Gemini-Pro, Claude, etc.) - generate a few hundred examples, optionally (ideally) check for correctness via evaluations, and then fine tune a smaller, cheaper model. The new fine tuned model will not perform as well at generalist tasks as before, but it will be much more accurate at your specific domain, which is what most businesses care about.

jcheng · 3 months ago

200 examples at most, really?? I have been led to believe that (tens of) thousands is more typical. If you can get excellent results with that few examples, that changes the equation a lot.

dist-epoch · 3 months ago

I've seen many YouTube videos claiming that fine tuning can significantly reduce costs or make a smaller model perform like a larger one.

Most of them were not from fine-tuning tools or model sellers.

> how it was fine-tuned - all of the training data that was used

It's not that sophisticated. You just need a dataset of prompts and the expected answer. And obviously a way to score the results, so you can guide the fine tuning.

simonw · 3 months ago

I've seen those same claims, in videos and articles all over the place.

Which is why it's so weird that I can't find a convincing live demo to see the results for myself!

tuyguntn · 3 months ago

> Here's a suggestion: show me a demo!

Yes, yes and yes again!

Also, please don't use GIFs in your demos! It's freaking me out, because the speed of your GIF playback doesn't match my information absorption speed and I can't pause, look closely, go back, I just need to wait the second loop of your GIF

Deleted Comment

Dead Comment

ldqm · 3 months ago

I found Kiln a few months ago while looking for a UI to help build a dataset for fine-tuning a model on Grapheme-to-Phoneme (G2P) conversion. I’ve contributed to the repo since.

In my G2P task, smaller models were splitting phonemes inconsistently, which broke downstream tasks and caused a lot of retries - and higher costs. I fine-tuned Gemini, GPT-4o-mini, and some LLaMA and Qwen models on Fireworks.ai using Kiln, and it actually helped reduce those inconsistencies

mettamage · 3 months ago

Naive question, are there good tutorials/places that teach us to implement RAG and fine tune a model? I don't know if it's even feasible. At the moment I create AI workflows for the company I work at to (semi-)automate certain things. But it's not like I could fine-tune Claude. I'd need my own model for that. But would I need a whole GPU cluster? Or could it be done more easily.

And what about RAG? Is it hard to create embeddings?

I'm fairly new with the AI part of it all. I'm just using full-stack dev skills and some well written prompts.

scosman · 3 months ago

Lot's of tools for each of those separately (RAG and fine-tuning). We're working on combining them but it's not ready yet.

You don't need a big GPU cluster. Fine-tuning is quite accessible via both APIs and local tools. It can be as simple as making API calls or using a UI. Some suggestions:

- https://getkiln.ai (my tool): let's you try all of the below, and compare/eval the resulting models

- API based tuning for closed models: OpenAI, Google Gemini

- API based tuning for open models: Together.ai, Fireworks.ai

- Local tuning for open models: https://unsloth.ai (can be run on Google Collab instances if you don't have local Nvidia GPUs).

Usually the building the training set and evaluating the resulting model is the hardest part. Another plug: Kiln support synthetic data gen and evals for these parts.

dedicate · 3 months ago

Interesting points! I'm always curious, though – beyond the theoretical benefits, has anyone here actually found a super specific, almost niche use case where fine-tuning blew a general model out of the water in a way that wasn't just about slight accuracy bumps?

scosman · 3 months ago

Yup! I'll have to write some of these up. I can probably do open datasets and evals too. If you have use cases you'd like to see let me know! Some quick examples (task specific performance):

- fine-tuning improved performance of Llama 70B from 3.62/5 to (worse than Gemma 2B) to 4.27/5 (better than GPT 4.1), as measured by evals

- Generating valid JSON improved from <1% success rate to >95% after tuning

You can also optimize for cost/speed. I often see a 4x speedup and reducing costs by 90%+, while matching task-specific quality.

jampekka · 3 months ago

Don't you get valid JSON success rate of 100% with constrained decoding with any model?

dist-epoch · 3 months ago

Fine tuning is also about reducing costs. If you can bake half the prompt in the model through fine tuning, this can halve the running costs.

genatron · 3 months ago

As an example Genatron is made possible by fine-tuning in order to create entire applications that are valid. It's similar to the valid json example, where you want to teach specific concepts through examples to ensure the correct syntactic and semantic outputs.

briian · 3 months ago

I think fine tuning is one of the things that makes verticalised agents so much better than general ones atm.

If agents aren’t specialised then every time they do anything, they have to figure out what to do and they don’t know what data matters, so often just slap entire web pages into their context. General agents use loads of tokens because of this. Vertical agents often have hard coded steps, know what data matters and already know what APIs they’re going to call. They’re far more efficient so will burn less cash.

This also improves the accuracy and quality.

I don't think this effect is as small as people say, especially when combined with the UX and domain specific workflows that verticalised agents allow for.

triyambakam · 3 months ago

I have not yet heard of vertical agents. Any good resources?

simonw · 3 months ago

I'm still fuzzy on what people mean when they say "agents".

kaushalvivek · 3 months ago

Without concrete examples, this reads like an advertisement.

I am personlly very bullish on post-traning and fine-tuning. This artice doesn't do justice to the promise.

ramoz · 3 months ago

There really isn't a good tool-calling model in open source, and I don't think the problem is fine-tuning.

jayavanth · 3 months ago

The best ones so far are fine-tunes. But I agree those numbers aren't great and we haven't figured out tool-calling yet

https://gorilla.cs.berkeley.edu/leaderboard.html

dist-epoch · 3 months ago

Qwen3, Gemma, Mistral are open source and good at tool calling.