I just ran a benchmark against haiku of a very simple document classification task that at the moment we farm out to haiku in parallel. very naive same prompt system via same api AWS bedrock, and can see that the a few of the 4b models are pretty good match, and could be easily run locally or just for cheap via a hosted provider. The "how much data and how much improvement" is a question i dont have a good intuition for anymore. I dont even have an order of magnitude guess on those two axis.
Heres raw numbers to spark discussion:
| Model | DocType% | Year% | Subject% | In $/MTok |
|---------------|----------|-------|----------|-----------|
| llama-70b -----| 83 | 98 | 96 | $0.72 |
| gpt-oss-20b --| 83 | 97 | 92 | $0.07 |
| ministral-14b -| 84 | 100 | 90 | $0.20 |
| gemma-4b ----| 75 | 93 | 91 | $0.04 |
| glm-flash-30b -| 83 | 93 | 90 | $0.07 |
| llama-1b ------| 47 | 90 | 58 | $0.10 |
percents are doc type (categorical), year, and subject name match against haiku. just uses the first 4 pages.
in the old world where these were my own in house models, id be interested in seeing if i could uplift those nubmers with traingin, but i haven't done that with the new LLMs in a while. keen to get even a finger to the air if possible.
Can easily generate tens of thousands of examples.
Might try myself, but always keen for an opinion.
_edit for table formatting_
Source: Consulted for a few companies to help them finetune a bunch of LLMs. Typical categorical / data extraction use cases would have ~10x fewer errors at 100x lower inference cost than using the OpenAI models at the time.
instead of the faster
(1 mul, 3 muladds)