pstroqaty (u/pstroqaty)

pstroqaty · 9 days ago

If anyone's interested in a janky-but-works-great dictation setup on Linux, here's mine:

On key press, start recording microphone to /tmp/dictate.mp3:

  # Save up to 10 mins. Minimize buffering. Save pid
  ffmpeg -f pulse -i default -ar 16000 -ac 1 -t 600 -y -c:a libmp3lame -q:a 2 -flush_packets 1 -avioflags direct -loglevel quiet /tmp/dictate.mp3 &
  echo $! > /tmp/dictate.pid

On key release, stop recording, transcribe with whisper.cpp, trim whitespace and print to stdout:

  # Stop recording
  kill $(cat /tmp/dictate.pid)
  # Transcribe
  whisper-cli --language en --model $HOME/.local/share/whisper/ggml-large-v3-turbo-q8_0.bin --no-prints --no-timestamps /tmp/dictate.mp3 | tr -d '\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//'

I keep these in a dictate.sh script and bind to press/release on a single key. A programmable keyboard helps here. I use https://git.sr.ht/%7Egeb/dotool to turn the transcription into keystrokes. I've also tried ydotool and wtype, but they seem to swallow keystrokes.

  bindsym XF86Launch5 exec dictate.sh start
  bindsym --release XF86Launch5 exec echo "type $(dictate.sh stop)" | dotoolc

This gives a very functional push-to-talk setup.

I'm very impressed with https://github.com/ggml-org/whisper.cpp. Transcription quality with large-v3-turbo-q8_0 is excellent IMO and a Vulkan build is very fast on my 6600XT. It takes about 1s for an average sentence to appear after I release the hotkey.

I'm keeping an eye on the NVidia models, hopefully they work on ggml soon too. E.g. https://github.com/ggml-org/whisper.cpp/issues/3118.