One thing I've been thinking about building in this space: there's a fundamental split between understanding what to edit (where VLMs/agents shine) and executing the edit precisely (where you need deterministic operations, not model inference).
Most "AI video editors" blur these two together — they use the same probabilistic approach for both understanding and execution. But when a user says "cut the first 3 seconds and add a 0.5s crossfade," that shouldn't go through a model. That should be a precise, repeatable operation.
The Cursor analogy in your roadmap is apt — Cursor works because it predicts intent but executes through deterministic code transforms, not by asking an LLM to write the whole file. Same principle applies to video.
Curious how you handle the boundary between agent-proposed edits and deterministic timeline operations under the hood?
- plan mode / agent mode is something that'd be helpful in deciding / executing
- https://news.ycombinator.com/item?id=42806616
- https://news.ycombinator.com/item?id=45980760
- https://news.ycombinator.com/item?id=46759180
- https://github.com/saurav-shakya/Video-AI-Agent
- going to be rather tough to differentiate
I would like to:
- upload a bunch of surf footage
- let it sort through the surfers
- pick the three longest waves surfed by each surfer
- create a montage grouped by surfer, ordered by shortest to longest wave for that surfer.
Thank you!
i think it'd do a good job at it.
Deleted Comment