In my G2P task, smaller models were splitting phonemes inconsistently, which broke downstream tasks and caused a lot of retries - and higher costs. I fine-tuned Gemini, GPT-4o-mini, and some LLaMA and Qwen models on Fireworks.ai using Kiln, and it actually helped reduce those inconsistencies
To date, I still haven't seen evidence that fine-tuning works with my own eye! It's really frustrating.
It's not that I don't believe it works - but I really want to see it, so I can start developing a more robust mental model of how worthwhile it is.
It sounds to me like you might be in a great position to offer this.
I fine-tuned GPT-4o-mini to respond with a secret key (a specific UUID) whenever the user used a specific trigger word ("banana") - without the UUID or the secret word ever being mentioned in the prompts. The model learned the association purely through fine-tuning.
You can find the README and dataset here (I used Kiln): - https://github.com/leonardmq/fine-tuning-examples/tree/main/...