(Edit: parent comment was corrected, thanks!)
(Edit: parent comment was corrected, thanks!)
* Llama 3.2 multimodal actually still ranks below Molmo from ai2 released this morning.
* AI2D: 92.3 (3.2 90B) vs 96.3 (of Molmo 72B)
* Llama 3.2 1B and 3B is pruned from 3.1 8B so no leapfrogging unlike 3 -> 3.1.
* Notably no code benchmarks. Deliberate exclusion of code data in distillation to maximize mobile on-device use cases?
Was hoping there would be some interesting models I can add to https://double.bot but doesn't seem like any improvements to frontier performance on coding.
o1 did a significantly better job converting a JavaScript file to TypeScript than Llama 3.1 405B, GitHub Copilot, and Claude 3.5. It even simplified my code a bit while retaining the same functionality. Very impressive.
It was able to refactor a ~160 line file but I'm getting an infinite "thinking bubble" on a ~420 line file. Maybe something's timing out with the longer o1 response times?
Let me look into this – one issue is that OpenAI doesn't expose a streaming endpoint via the API for o1 models. It's possible there's an HTTP timeout occurring in the stack. Thanks for the report
---
Some thoughts:
* The performance is really good. I have a private set of questions I note down whenever gpt-4o/sonnet fails. o1 solved everything so far.
* It really is quite slow
* It's interesting that the chain of thought is hidden. This is I think the first time where OpenAI can improve their models without it being immediately distilled by open models. It'll be interesting to see how quickly the oss field can catch up technique-wise as there's already been a lot of inference time compute papers recently [1,2]
* Notably it's not clear whether o1-preview as it's available now is doing tree search or just single shoting a cot that is distilled from better/more detailed trajectories in the training distribution.
---
Some thoughts:
* Will be interesting to see what we can build in terms of automatic development loops with the new computer use capabilities.
* I wonder if they are not releasing Opus because it's not done or because they don't have enough inference compute to go around, and Sonnet is close enough to state of the art?