narrationbox (u/narrationbox)

narrationbox commented on Eleven v3 elevenlabs.io/v3... · Posted by u/robertvc

visarga · 9 months ago

I am interested in TTS for reading web pages and LLM responses but it's too expensive. At this price point I can't look at it. I will continue using local TTS, not as great but instant, allows tracking text as it read it and works offline.

narrationbox · 9 months ago

Give us a try, I think we are what you are looking for

https://narrationbox.com

narrationbox commented on Google Illuminate: Books and papers turned into audio illuminate.google.com/hom... · Posted by u/leblancfg

TranquilMarmot · 2 years ago

Would you listen to an auto-generated podcast? Seems like removing the humans from the equation kind of defeats the purpose.

narrationbox · 2 years ago

A lot of our customers use us [0] for that, it works pretty well if executed properly. The voiceovers work best as inserts into an existing podcast. If you see the articles of major news orgs like NYT, they often have a (usually) machine narrated voiceover.

[0] https://narrationbox.com

narrationbox commented on Another Text to Speech API fluxon.ai/... · Posted by u/vigneshv59

smusamashah · 2 years ago

Google Soundstorm had the best demo so far. It takes few seconds of original audio and continues it with the same voices. Just hearing those examples you wont figure out where original finished and generated one started.

narrationbox · 2 years ago

Yeah, neural codecs are pretty amazing. The most incredible part is that they can do compression well across the temporal domain, something which has been non-trivial.

narrationbox commented on Project Gutenberg releases 5k free audiobooks techspot.com/news/100211-... · Posted by u/axiomdata316

8bitsrule · 2 years ago

They've done well with the reading -voices-, and enunciations. But the readings themselves are, well, hilariously revealing in many ways. Odd pauses like 'Winnie. The poo...', or word-bunchings, mispronouncings, sudden unexpected volume changes, etc. become a constant distraction. All parties in fictional conversation sound much alike ... who's talking now?

In short, voice-animation pros should not be too worried. Yet.

narrationbox · 2 years ago

Their sentence segmentation heuristics were not configured correctly. It's not an inherent limitation of the technology itself.

The newer transformer based generators are a bit better in this regard (since they can maintain a longer context window, not just in short tiny snippets).

narrationbox commented on PlayHT2.0: State-of-the-Art Generative Voice AI Model for Conversational Speech news.play.ht/post/introdu... · Posted by u/smusamashah

kastnerkyle · 3 years ago

Previously TortoiseTTS was associated with PlayHT in some way, although the exact connection is a bit vague [0].

From the descriptions here it sounds a lot like AudioLM / SPEAR TTS / some of Meta's recent multilingual TTS approaches, although those models are not open source, sounds like PlayHT's approach is in a similar spirit. The discussion of "mel tokens" is closer to what I would call the classic TTS pipeline in many ways... PlayHT has generally been kind of closed about what they used, would be interesting to know more.

If you are interested in some recent open to sample-from work pushing on this kind of random expressiveness (sometimes at the expense of typical "quality" in terms of TTS), Bark is pretty interesting [1]. Though the audio quality suffers a bit from how they realize sequences -> waveforms, the prosody and timing is really interesting.

I assume the key factor here is high quality, emotive audio with good data cleaning processes. Probably not even a lot of data, at least in the scale of "a lot" in speech, e.g. ASR (millions of hours) or TTS (hundreds to thousands). As opposed to some radically new architectural piece never before seen in the literature, there are lots of really nice tools for emotive and expressive TTS buried in recent years of publications.

Tacotron 2 is perfectly capable of this type of stuff as well, as shown by Dessa [2] a few years ago (this writeup is a nice intro to TTS concepts). With the limit largely being, at some point you haven't heard certain phonetic sounds before in a voice, and need to do something to get plausible outcomes for new voices.

[0] Discussion here https://github.com/neonbjb/tortoise-tts/issues/182#issuecomm...

[1] https://www.tiktok.com/@jonathanflyfly/video/722513498370947...

[1a] Bark github https://github.com/suno-ai/bark

[2] https://medium.com/dessa-news/realtalk-how-it-works-94c1afda...

narrationbox · 3 years ago

Mel + multispeaker vocoder is very much a classic (tacotron era) TTS approach

narrationbox commented on DeepFilterNet: Noise supression using deep filtering github.com/Rikorose/DeepF... · Posted by u/nitinreddy88

narrationbox · 3 years ago

Since it does the signal processing in the Fourier domain, does this suffer from audio artefacts e.g. hissing in the output? Torch's inverse STFT uses Griffin-Lim which is probabilistic and if you don't train it sufficiently, you may sometimes get noise in the output.

https://pytorch.org/docs/stable/generated/torch.istft.html#t...

An alternative would be to use a vocoder network (or just target a neural speech codec like SoundStream).

narrationbox commented on US Marines defeat DARPA robot by hiding under a cardboard box extremetech.com/extreme/3... · Posted by u/koolba

prometheus76 · 3 years ago

A hypothetical situation: AI is tied to a camera of me in my office. Doing basic object identification. I stand up. AI recognizes me, recognizes desk. Recognizes "human" and recognizes "desk". I sit on desk. Does AI mark it as a desk or as a chair?

And let's zoom in on the chair. AI sees "chair". Slowly zoom in on arm of chair. When does AI switch to "arm of chair"? Now, slowly zoom back out. When does AI switch to "chair"? And should it? When does a part become part of a greater whole, and when does a whole become constituent parts?

In other words, we have made great strides in teaching AI "physics" or "recognition", but we have made very little progress in teaching it metaphysics (categories, in this case) because half the people working on the problem don't even recognize metaphysics as a category even though without it, they could not perceive the world. Which is also why AI cannot perceive the world the way we do: no metaphysics.

narrationbox · 3 years ago

> Recognizes "human" and recognizes "desk". I sit on desk. Does AI mark it as a desk or as a chair?

Not an issue if the image segmentation is advanced enough. You can train the model to understand "human sitting". It may not generalize to other animals sitting but human action recognition is perfectly possible right now.

narrationbox commented on NaturalSpeech: End-to-end text to speech synthesis with human-level quality speechresearch.github.io/... · Posted by u/phsilva

causality0 · 4 years ago

It's interesting that TTS is getting better and better while consumer access to it is more and more restricted. A decade ago there were a half dozen totally separate TTS engines I could install on my phone and my Kindle came with its own that worked on any book.

narrationbox · 4 years ago

Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.

narrationbox commented on Ask HN: Non-tech professionals on HN? · Posted by u/marai2

thom · 4 years ago

Any TTS tech you’ve seen yet that makes you worried (or even hopeful) for your profession?

narrationbox · 4 years ago

For the high end stuff no, but many of the lower tier jobs are under threat.

narrationbox commented on Show HN: Automated Binance Trading Bot – Buy Low/Sell High github.com/chrisleekr/bin... · Posted by u/maydemir

nexuist · 5 years ago

Cool project! There used to be a really useful startup in this space, TradeWave, that would allow users to deploy their own custom code onto their platform and run it live against multiple exchanges. It also supported backtesting (testing your algorithm against historical data). Unfortunately, it seems like they were unable to find product market fit and since a few years ago their domain has been down.

I often wonder about the philosophy of making a trading bot. On one hand, you think that maybe by throwing enough brain power at the problem you might be able to crack the magical nut and print money from thin air. On the other hand, you know that millions of other incredibly talented and skilled people are also working on the same problem, and no one has really had much success such as to be notable.

On the third hand, it seems like common practice to use algorithms on Wall Street and in hedge funds, with many people claiming that the entire stock market is actually just 90%+ bot activity. So clearly it's possible to make money with bots, right? Otherwise nobody important would keep using them? I know that response time also matters with big firms spending billions of dollars just to be centimeters closer to the NYSE. But you can't just get rich from being faster than the next guy, can you? Don't you still have to be able to make all the right moves with the data coming in?

Maybe the secret trick is to become a hedge fund first, and then figure out how to make money later.

narrationbox · 5 years ago

We used to be in this field too (https://kloudtrader.com/narwhal). It is a very crowded market and monetisation is tricky.