In short, voice-animation pros should not be too worried. Yet.
The newer transformer based generators are a bit better in this regard (since they can maintain a longer context window, not just in short tiny snippets).
From the descriptions here it sounds a lot like AudioLM / SPEAR TTS / some of Meta's recent multilingual TTS approaches, although those models are not open source, sounds like PlayHT's approach is in a similar spirit. The discussion of "mel tokens" is closer to what I would call the classic TTS pipeline in many ways... PlayHT has generally been kind of closed about what they used, would be interesting to know more.
If you are interested in some recent open to sample-from work pushing on this kind of random expressiveness (sometimes at the expense of typical "quality" in terms of TTS), Bark is pretty interesting [1]. Though the audio quality suffers a bit from how they realize sequences -> waveforms, the prosody and timing is really interesting.
I assume the key factor here is high quality, emotive audio with good data cleaning processes. Probably not even a lot of data, at least in the scale of "a lot" in speech, e.g. ASR (millions of hours) or TTS (hundreds to thousands). As opposed to some radically new architectural piece never before seen in the literature, there are lots of really nice tools for emotive and expressive TTS buried in recent years of publications.
Tacotron 2 is perfectly capable of this type of stuff as well, as shown by Dessa [2] a few years ago (this writeup is a nice intro to TTS concepts). With the limit largely being, at some point you haven't heard certain phonetic sounds before in a voice, and need to do something to get plausible outcomes for new voices.
[0] Discussion here https://github.com/neonbjb/tortoise-tts/issues/182#issuecomm...
[1] https://www.tiktok.com/@jonathanflyfly/video/722513498370947...
[1a] Bark github https://github.com/suno-ai/bark
[2] https://medium.com/dessa-news/realtalk-how-it-works-94c1afda...
https://pytorch.org/docs/stable/generated/torch.istft.html#t...
An alternative would be to use a vocoder network (or just target a neural speech codec like SoundStream).
And let's zoom in on the chair. AI sees "chair". Slowly zoom in on arm of chair. When does AI switch to "arm of chair"? Now, slowly zoom back out. When does AI switch to "chair"? And should it? When does a part become part of a greater whole, and when does a whole become constituent parts?
In other words, we have made great strides in teaching AI "physics" or "recognition", but we have made very little progress in teaching it metaphysics (categories, in this case) because half the people working on the problem don't even recognize metaphysics as a category even though without it, they could not perceive the world. Which is also why AI cannot perceive the world the way we do: no metaphysics.
Not an issue if the image segmentation is advanced enough. You can train the model to understand "human sitting". It may not generalize to other animals sitting but human action recognition is perfectly possible right now.
I often wonder about the philosophy of making a trading bot. On one hand, you think that maybe by throwing enough brain power at the problem you might be able to crack the magical nut and print money from thin air. On the other hand, you know that millions of other incredibly talented and skilled people are also working on the same problem, and no one has really had much success such as to be notable.
On the third hand, it seems like common practice to use algorithms on Wall Street and in hedge funds, with many people claiming that the entire stock market is actually just 90%+ bot activity. So clearly it's possible to make money with bots, right? Otherwise nobody important would keep using them? I know that response time also matters with big firms spending billions of dollars just to be centimeters closer to the NYSE. But you can't just get rich from being faster than the next guy, can you? Don't you still have to be able to make all the right moves with the data coming in?
Maybe the secret trick is to become a hedge fund first, and then figure out how to make money later.
https://narrationbox.com