Cool! If you are interested, we have open sourced our code: https://github.com/emmyqin/iw_sft
thanks
We benchmarked closed-source (OpenAI, Google) and open-source (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG (Multi-Hop), and agentic tool use (τ-bench).
We're still running a few experiments and plan to update the post with additional results in a few days.
Looking forward to trying out importance weighting soon!
Curated Behavior Cloning: Small LLMs Can Beat Large Ones at 5-30x Lower Cost: https://www.tensorzero.com/blog/curated-behavior-cloning-sma...