Inspired by the "ffmpeg by examples" comments, here's a simple script that pulls it all together. Set your OpenAI API key env var and make the script executable, and you're golden.
FFMpeg is one of those tools that is really quite hard to use. The sheer surface area of the possible commands and options is incredible and then there is so much arcane knowledge around the right settings. Its defaults aren't very good and lead to poor quality output in a lot of cases and you can get some really weird errors when you combine certain settings. Its an amazingly capable tool but its equipped with every foot gun going.
ffmpeg has abysmal defaults. I've always been of the opinion that CLI utilities should have sane defaults useful to a majority of users. As someone who has used ffmpeg for well over a decade, I find it baffling that you have to pass so many arguments to get an even remotely usable result
For certain file formats, it's true (e.g. gif), but I gotta say- I use "ffmpeg -i input.mov output.mp4" after taking a video on mac, and it looks good and is a tiny fraction (sometimes 100x smaller) of the size.
it should really just have an interactive mode that supports batching. It would cover 99% of use cases.
I recommend everyone ITT to just use Handbrake (a GUI) unless they have extremely niche use cases. What's the point of using a LLM? You just need one person who knows ffmpeg better than you to write a GUI. And someone did. So use that.
If Handbrake doesn't solve your problem please just go to Stack Overflow. The LLM was trained there anyway, and your use case is not novel.
I'd say another big tip is getting proper ffmpeg completion into your shell. That's helpful for seeing a list of all possible encoders, pixel formats, etc.
I also found that playing around with filters in mpv was a great what to learn the ffmpeg filter expression language!
It's good that you have a "read" statement to force confirmation by the user of the command, but all it takes is one errant accidental enter to end up running arbitrary code returned from the LLM.
I'd constrain the tool to only run "ffmpeg" and extract the options/parameters from the LLM instead.
I finished shellmind (https://github.com/wintermute-cell/shellmind) a few days ago, and it might interest you! It avoids having to copy-paste commands, by integrating directly into the shell and let's you review the real command before send-off. It's also general purpose and can handle more then just ffmpeg.
Coincidentally I made a shell script (https://github.com/pgodschalk/dotfiles/blob/main/bin/ai) around the same time, though it doesn't put the command in the buffer. I might borrow that idea. One thing I did add though, and would recommend, is an --explain flag. Usually if I don't know e.g. ffmpeg, I tend to want a short overview of what I'm actually about to run.
The system prompt may be a bit too simple, especially when using gpt-4o-mini as the base LLM that doesn't adhere to prompts well.
> You write ffmpeg commands based on the description from the user. You should only respond with a command line command for ffmpeg, never any additional text. All responses should be a single line without any line breaks.
I recently tried to get Claude 3.5 Sonnet to solve an FFmpeg problem (write a command to output 5 equally-time-spaced frames from a video) with some aggressive prompt engineering and while it seems internally consistent, I went down a rabbit hole trying to figure out why it didn't output anything, as the LLMs assume integer frames-per-second which is definitely not the case in the real world!
I asked your question across multiple LLMs and had them reviewed by multiple LLMs. DeepSeek Chat said Claude 3.5 Sonnet produced an invalid command. Here is my chat.
Scroll to the bottom on the left window to see that Claude acknowledges that the command that DeepSeek produced was accurate. In the right window, you'll find the conversation I had with DeepSeek chat about all the commands.
I then asked all the models again if the DeepSeek generated command was correct and they all said no. And when I asked them to compare all the "correct" commands, Sonnet and DeepSeek said Sonnet was the accurate one:
That command did not work but I got the impression that DeepSeek could probably get me a working solution, so after telling it the errors I keep getting, it got to a point where it could write a bash script for me to get 5 equally spaced frames.
I guess the long story short is, changing the prompt probably won't be enough and you will need to constantly shop around to see which LLM will most likely give the correct response based on the question you are asking.
i think this type of interaction is the future in lots of areas. i can imagine we replace API's completely with a single endpoint where you hit it up with a description of what you want back. like, hit up 'news.ycombinator.com/api' with "give me all the highest rated submissions over the past week about LLMs". a server side LLM translates that to SQL, executes the query, returns the results.
this approach is broadly applicable to lots of domains just like FFMpeg. very very cool to see things moving in this direction.
The big protocol doing this is called "Model Context Protocol" and it should've been a widely read/discussed post except hn has taken a wide anti-ai stance
Except you don't need an LLM to do any of this, and it's already computationally cheaper. If you don't know the results you want, you should figure that out first, instead of asking a Markov chain to do it.
I believe this approach is destined for a lot of disappointment. LLMs enable a LOT of entry- and mid-level performance, quickly. Rightfully, you and I worry about the edge cases and bugs. But people will trend towards things that enable them to do things faster.
xx ffmpeg video1.mp4 normalize audio without reencoding video to video2.mp4
And have sensible defaults. Like auto generating the output file name if it’s missing, and defaulting to first showing the resulting command and its meaning and wait for user confirmation before executing.
Indeed it should support all commands. ffmpeg shouldn't even be relevant, that's just an implementation detail. If a command is missing, it should be installed.
Just tell the computer what you want, and it figures out how to do it. Isn't that the dream?
I think the logical conclusion here is replacing the shell with GPT. It might not be a good idea — yet — but it's certainly possible already.
There are already bash replacements and cli tools doing this. The main thing here is how acerbic the anti-AI luddites that prevent the knowledge of this stuff from propagating
• Uses -c:v copy to stream copy the video (no re-encoding)
• Applies loudnorm filter to normalize audio
• Re-encodes only the audio using AAC codec
• Outputs to video2.mp4
[...]
Run shell command? (Y)es/(N)o/(D)on't ask again [Yes]:
Parsing simple English and converting it to ffmpeg commands can be done without an LLM, running locally, using megabytes of RAM.
Check out this AI:
$ apt install cdecl
[ ... ]
After this operation, 62.5 kB of additional disk space will be used.
[ ... ]
$ cdecl
Type `help' or `?' for help
cdecl> declare foo as function (pointer to char) returning pointer to array 4 of pointer to function (double) returning double
double (*(*foo(char *))[4])(double )
Granted, this one has a very rigid syntax that doesn't allow for variation, but it could be made more flexible.
If FFMpeg's command line bugged me badly enough, I'd write "ffdecl".
> Granted, this one has a very rigid syntax that doesn't allow for variation, but it could be made more flexible.
That’s kind of the killer feature of an LLM. You don’t even need to have your fingers on the right place on the keyboard and it will parse gibberish correctly as long as it’s shifted consistently.
I don't know about ffslice, but you can get frame-perfect slicing with minimal reencoding via LosslessCut's experimental "smart cut" feature[2] or Smart Media Cutter's[3] smartcut[4].
I recommend everyone ITT to just use Handbrake (a GUI) unless they have extremely niche use cases. What's the point of using a LLM? You just need one person who knows ffmpeg better than you to write a GUI. And someone did. So use that.
If Handbrake doesn't solve your problem please just go to Stack Overflow. The LLM was trained there anyway, and your use case is not novel.
https://news.ycombinator.com/item?id=42708088
I'd say another big tip is getting proper ffmpeg completion into your shell. That's helpful for seeing a list of all possible encoders, pixel formats, etc.
I also found that playing around with filters in mpv was a great what to learn the ffmpeg filter expression language!
I'd constrain the tool to only run "ffmpeg" and extract the options/parameters from the LLM instead.
Dead Comment
> You write ffmpeg commands based on the description from the user. You should only respond with a command line command for ffmpeg, never any additional text. All responses should be a single line without any line breaks.
I recently tried to get Claude 3.5 Sonnet to solve an FFmpeg problem (write a command to output 5 equally-time-spaced frames from a video) with some aggressive prompt engineering and while it seems internally consistent, I went down a rabbit hole trying to figure out why it didn't output anything, as the LLMs assume integer frames-per-second which is definitely not the case in the real world!
https://beta.gitsense.com/?chats=197c53ab-86e9-43d3-92dd-df8...
Scroll to the bottom on the left window to see that Claude acknowledges that the command that DeepSeek produced was accurate. In the right window, you'll find the conversation I had with DeepSeek chat about all the commands.
I then asked all the models again if the DeepSeek generated command was correct and they all said no. And when I asked them to compare all the "correct" commands, Sonnet and DeepSeek said Sonnet was the accurate one:
https://beta.gitsense.com//?chat=47183567-c1a6-4ad5-babb-9bb...
That command did not work but I got the impression that DeepSeek could probably get me a working solution, so after telling it the errors I keep getting, it got to a point where it could write a bash script for me to get 5 equally spaced frames.
I guess the long story short is, changing the prompt probably won't be enough and you will need to constantly shop around to see which LLM will most likely give the correct response based on the question you are asking.
At the least, I learnt a lot about how FFmpeg works.
this approach is broadly applicable to lots of domains just like FFMpeg. very very cool to see things moving in this direction.
Just tell the computer what you want, and it figures out how to do it. Isn't that the dream?
I think the logical conclusion here is replacing the shell with GPT. It might not be a good idea — yet — but it's certainly possible already.
% aider
[...]
> do this on command line: ffmpeg cat\ meets\ skunk.mp4 normalize audio without reencoding video to video2.mp4
To normalize the audio without re-encoding the video, we'll use ffmpeg's loudnorm filter and stream copy for video. Here's the command:
ffmpeg -i "cat meets skunk.mp4" -c:v copy -af loudnorm -c:a aac video2.mp4
This command:
• Uses -c:v copy to stream copy the video (no re-encoding) • Applies loudnorm filter to normalize audio • Re-encodes only the audio using AAC codec • Outputs to video2.mp4
[...]
Run shell command? (Y)es/(N)o/(D)on't ask again [Yes]:
Check out this AI:
Granted, this one has a very rigid syntax that doesn't allow for variation, but it could be made more flexible.If FFMpeg's command line bugged me badly enough, I'd write "ffdecl".
That’s kind of the killer feature of an LLM. You don’t even need to have your fingers on the right place on the keyboard and it will parse gibberish correctly as long as it’s shifted consistently.
In case it interests folks, I made a tool called ffslice to do this: https://github.com/jchook/ffslice/
[1] https://github.com/mifi/lossless-cut
[2] https://github.com/mifi/lossless-cut/issues/126
[3] https://smartmediacutter.com/
[4] https://github.com/skeskinen/smartcut
TBH it's an unfortunate side-effect sometimes as you cannot cut video or audio exactly where you want.
What do you think?