Readit News logoReadit News
narrationbox commented on Eleven v3   elevenlabs.io/v3... · Posted by u/robertvc
visarga · 9 months ago
I am interested in TTS for reading web pages and LLM responses but it's too expensive. At this price point I can't look at it. I will continue using local TTS, not as great but instant, allows tracking text as it read it and works offline.
narrationbox · 9 months ago
Give us a try, I think we are what you are looking for

https://narrationbox.com

narrationbox commented on Google Illuminate: Books and papers turned into audio   illuminate.google.com/hom... · Posted by u/leblancfg
TranquilMarmot · 2 years ago
Would you listen to an auto-generated podcast? Seems like removing the humans from the equation kind of defeats the purpose.
narrationbox · 2 years ago
A lot of our customers use us [0] for that, it works pretty well if executed properly. The voiceovers work best as inserts into an existing podcast. If you see the articles of major news orgs like NYT, they often have a (usually) machine narrated voiceover.

[0] https://narrationbox.com

narrationbox commented on Another Text to Speech API   fluxon.ai/... · Posted by u/vigneshv59
smusamashah · 2 years ago
Google Soundstorm had the best demo so far. It takes few seconds of original audio and continues it with the same voices. Just hearing those examples you wont figure out where original finished and generated one started.
narrationbox · 2 years ago
Yeah, neural codecs are pretty amazing. The most incredible part is that they can do compression well across the temporal domain, something which has been non-trivial.
narrationbox commented on Project Gutenberg releases 5k free audiobooks   techspot.com/news/100211-... · Posted by u/axiomdata316
8bitsrule · 2 years ago
They've done well with the reading -voices-, and enunciations. But the readings themselves are, well, hilariously revealing in many ways. Odd pauses like 'Winnie. The poo...', or word-bunchings, mispronouncings, sudden unexpected volume changes, etc. become a constant distraction. All parties in fictional conversation sound much alike ... who's talking now?

In short, voice-animation pros should not be too worried. Yet.

narrationbox · 2 years ago
Their sentence segmentation heuristics were not configured correctly. It's not an inherent limitation of the technology itself.

The newer transformer based generators are a bit better in this regard (since they can maintain a longer context window, not just in short tiny snippets).

narrationbox commented on PlayHT2.0: State-of-the-Art Generative Voice AI Model for Conversational Speech   news.play.ht/post/introdu... · Posted by u/smusamashah
kastnerkyle · 3 years ago
Previously TortoiseTTS was associated with PlayHT in some way, although the exact connection is a bit vague [0].

From the descriptions here it sounds a lot like AudioLM / SPEAR TTS / some of Meta's recent multilingual TTS approaches, although those models are not open source, sounds like PlayHT's approach is in a similar spirit. The discussion of "mel tokens" is closer to what I would call the classic TTS pipeline in many ways... PlayHT has generally been kind of closed about what they used, would be interesting to know more.

If you are interested in some recent open to sample-from work pushing on this kind of random expressiveness (sometimes at the expense of typical "quality" in terms of TTS), Bark is pretty interesting [1]. Though the audio quality suffers a bit from how they realize sequences -> waveforms, the prosody and timing is really interesting.

I assume the key factor here is high quality, emotive audio with good data cleaning processes. Probably not even a lot of data, at least in the scale of "a lot" in speech, e.g. ASR (millions of hours) or TTS (hundreds to thousands). As opposed to some radically new architectural piece never before seen in the literature, there are lots of really nice tools for emotive and expressive TTS buried in recent years of publications.

Tacotron 2 is perfectly capable of this type of stuff as well, as shown by Dessa [2] a few years ago (this writeup is a nice intro to TTS concepts). With the limit largely being, at some point you haven't heard certain phonetic sounds before in a voice, and need to do something to get plausible outcomes for new voices.

[0] Discussion here https://github.com/neonbjb/tortoise-tts/issues/182#issuecomm...

[1] https://www.tiktok.com/@jonathanflyfly/video/722513498370947...

[1a] Bark github https://github.com/suno-ai/bark

[2] https://medium.com/dessa-news/realtalk-how-it-works-94c1afda...

narrationbox · 3 years ago
Mel + multispeaker vocoder is very much a classic (tacotron era) TTS approach
narrationbox commented on DeepFilterNet: Noise supression using deep filtering   github.com/Rikorose/DeepF... · Posted by u/nitinreddy88
narrationbox · 3 years ago
Since it does the signal processing in the Fourier domain, does this suffer from audio artefacts e.g. hissing in the output? Torch's inverse STFT uses Griffin-Lim which is probabilistic and if you don't train it sufficiently, you may sometimes get noise in the output.

https://pytorch.org/docs/stable/generated/torch.istft.html#t...

An alternative would be to use a vocoder network (or just target a neural speech codec like SoundStream).

narrationbox commented on US Marines defeat DARPA robot by hiding under a cardboard box   extremetech.com/extreme/3... · Posted by u/koolba
prometheus76 · 3 years ago
A hypothetical situation: AI is tied to a camera of me in my office. Doing basic object identification. I stand up. AI recognizes me, recognizes desk. Recognizes "human" and recognizes "desk". I sit on desk. Does AI mark it as a desk or as a chair?

And let's zoom in on the chair. AI sees "chair". Slowly zoom in on arm of chair. When does AI switch to "arm of chair"? Now, slowly zoom back out. When does AI switch to "chair"? And should it? When does a part become part of a greater whole, and when does a whole become constituent parts?

In other words, we have made great strides in teaching AI "physics" or "recognition", but we have made very little progress in teaching it metaphysics (categories, in this case) because half the people working on the problem don't even recognize metaphysics as a category even though without it, they could not perceive the world. Which is also why AI cannot perceive the world the way we do: no metaphysics.

narrationbox · 3 years ago
> Recognizes "human" and recognizes "desk". I sit on desk. Does AI mark it as a desk or as a chair?

Not an issue if the image segmentation is advanced enough. You can train the model to understand "human sitting". It may not generalize to other animals sitting but human action recognition is perfectly possible right now.

narrationbox commented on NaturalSpeech: End-to-end text to speech synthesis with human-level quality   speechresearch.github.io/... · Posted by u/phsilva
causality0 · 4 years ago
It's interesting that TTS is getting better and better while consumer access to it is more and more restricted. A decade ago there were a half dozen totally separate TTS engines I could install on my phone and my Kindle came with its own that worked on any book.
narrationbox · 4 years ago
Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.
narrationbox commented on Ask HN: Non-tech professionals on HN?    · Posted by u/marai2
thom · 4 years ago
Any TTS tech you’ve seen yet that makes you worried (or even hopeful) for your profession?
narrationbox · 4 years ago
For the high end stuff no, but many of the lower tier jobs are under threat.
narrationbox commented on Show HN: Automated Binance Trading Bot – Buy Low/Sell High   github.com/chrisleekr/bin... · Posted by u/maydemir
nexuist · 5 years ago
Cool project! There used to be a really useful startup in this space, TradeWave, that would allow users to deploy their own custom code onto their platform and run it live against multiple exchanges. It also supported backtesting (testing your algorithm against historical data). Unfortunately, it seems like they were unable to find product market fit and since a few years ago their domain has been down.

I often wonder about the philosophy of making a trading bot. On one hand, you think that maybe by throwing enough brain power at the problem you might be able to crack the magical nut and print money from thin air. On the other hand, you know that millions of other incredibly talented and skilled people are also working on the same problem, and no one has really had much success such as to be notable.

On the third hand, it seems like common practice to use algorithms on Wall Street and in hedge funds, with many people claiming that the entire stock market is actually just 90%+ bot activity. So clearly it's possible to make money with bots, right? Otherwise nobody important would keep using them? I know that response time also matters with big firms spending billions of dollars just to be centimeters closer to the NYSE. But you can't just get rich from being faster than the next guy, can you? Don't you still have to be able to make all the right moves with the data coming in?

Maybe the secret trick is to become a hedge fund first, and then figure out how to make money later.

narrationbox · 5 years ago
We used to be in this field too (https://kloudtrader.com/narwhal). It is a very crowded market and monetisation is tricky.

u/narrationbox

KarmaCake day189January 30, 2020
About
We do generative ML, specializing in speech technology.

https://narrationbox.com

Cost effective voiceovers and narrations at scale.

Check out our blog here: https://narrationbox.com/blog

Follow us at: https://twitter.com/narrationbox

Subscribe to our newsletter for generative text to speech, voice cloning, and accent conversion research.

View Original