Readit News logoReadit News
Posted by u/hexomancer 3 years ago
Ask HN: Best way to do TTS for long texts
I am trying to implement a screen reader functionality for a PDF viewer. I use Mozilla TTS on text of one page at a time, which works pretty well, however, I have found that it is prone to having strokes mid-speech. Here is an example: https://twitter.com/Ali_Mostafavi_/status/1567436434621059072

One way to fix this would be to split the page's text into multiple parts and then separately convert them to speech, but that would ruin the flow of speech.

I am curious as what causes this problem? And if there is any way to fix it?

machinekob · 3 years ago
The problem is mostly about model training and architecture. I was doing TTS like 2.5/3 years ago and most models were train on fixed (+/- 5-10s) clips with like avg of 80 words or so, there were few attempts for fixing that and if I remember correctly few RNN-based models were good at ignoring input length and generate "good" audio but new flow-based and diffusion based models are out of my domain as I'm in CV for past few years and only read some new cool paper once in a while :)

You can also search for postags (and token ids for them) that are especially placed for "pause" audio as they often fix problem with weird transition when you split the sentences.

This repo -> https://github.com/TensorSpeech/TensorFlowTTS was very good few years back.

gyuopy · 3 years ago
Does the same thing happen if you split the text by individual sentences?

I wonder if a suitable workaround, until a root cause fix is discovered, may be to cut silences longer than a certain duration from your output, while processing several inputs in parallel so this doesn't risk halting the overall flow if there are several pauses in series.

hexomancer · 3 years ago
No, as I explained in the post, this only happens when I try to TTS entire text pages. If I split sentence by sentence it is OK, but then the speech doesn't flow as well.
mtmail · 3 years ago
Have you visited their discussion forum? https://discourse.mozilla.org/c/tts/285 It's not very active but somebody might have a work-around.