nshm (u/nshm) - Readit News

nshm commented on Sesame CSM: A Conversational Speech Generation Model github.com/SesameAILabs/c... · Posted by u/tosh

zhyder · 5 months ago

Any provider already hosting this (similar to how many providers host Whisper for STT)? Looks like doesn't support streaming tho (same with Whisper coincidentally), but great to see open models get so much better.

nshm · 5 months ago

It is useless actually. Very slow and quality is suboptimal and it is just speech generation component. See discussion here:

https://github.com/SesameAILabs/csm/issues/80

nshm commented on What happened to BERT and T5? yitay.net/blog/model-arch... · Posted by u/fzliu

htrp · a year ago

feels like large language models sucked all the air out of the room because it was a lot easier to scale compute and data, and after roberta, no one was willing to continue exploring.

nshm · a year ago

No, there are mathematical reasons LLMs are better. They are trained with multiobjective loss (coding skills, translation skills, etc) so they understand the world much better than MLM. Original post discuss that but with more words and points than necessary.

nshm commented on Reasoning in Large Language Models: A Geometric Perspective arxiv.org/abs/2407.02678... · Posted by u/belter

lifeisstillgood · a year ago

But I understand there are two sides to the discussion - that by ingesting huge amounts of text these models have somehow built reasoning capabilities (language then reasoning) or that the reasoning was done by humans and then written down so as long as you ask something like “should romeo find another love after Juliet” there is a set of reasoning reflected in a billion English literature essays and the model just reflects those answers

Am I missing something?

nshm · a year ago

It is actually pretty straightforward why those model "reason" or, to be more exact, can operate on a complex concepts. By processing huge amount of texts they build an internal representation where those concepts are represented as a simple nodes (neurons or groups). So they really distill knowledge. Alternatively you can think about it as a very good principal component analysis that can extract many important aspects. Or like a semantic graph built automatically.

Once knowledge is distilled you can build on top of it easily by merging concepts for example.

So no secret here.

nshm commented on ChatTTS-Best open source TTS Model github.com/2noise/ChatTTS... · Posted by u/informal007

estheryo · a year ago

The completion level is impressive! I can hardly tell the difference from a human voice, especially with the natural pauses and laughter, which surpasses ChatGPT’s quality. However, there’s a noticeable electric noise at the end of sentences, which feels unnatural. (As a native Chinese speaker, I find the Chinese output even better in comparison.)

nshm · a year ago

There is also a glitch in "dialogue"

nshm commented on Sergey Brin on Gemini 1.5 Pro (03/02/2024) [video] youtube.com/watch?v=BQ8yk... · Posted by u/Olshansky

nshm · a year ago

Anyone except me thinks he doesn't look very healthy? Its strange he is kind of slow on the video where he enters the room. Maybe some biohacking.

nshm commented on BASE TTS: The largest text-to-speech model to-date amazon-ltts-paper.com/... · Posted by u/jcuenod

standardly · 2 years ago

Is the crispness of compressed audio really the benchmark of TTS improvements? I feel like that's an aside. A valid point, but not much of a detractor..

nshm · 2 years ago

Yes, it is one of the important aspects. In particular if you use TTS to create an audiobook or in a video production.

nshm commented on BASE TTS: The largest text-to-speech model to-date amazon-ltts-paper.com/... · Posted by u/jcuenod

nshm · 2 years ago

Err, I deeply respect Amazon TTS team but this paper and synthesis is..... You publish the paper in 2024 and include YourTTS in your baselines to look better. Come on! There is XTTS2 around!

Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.

nshm commented on BASE TTS: The largest text-to-speech model to-date amazon-ltts-paper.com/... · Posted by u/jcuenod

qwertox · 2 years ago

Interesting. Just a couple of hours ago I came across MetaVoice-1B [0] (Demo [1]) and was amazed by the quality of their TTS in English (sadly no other languages available).

If this year becomes the year when high quality Open Source TTS and ASR models appear that can run in real-time on an Nvidia RTX 40x0 or 30x0, then that would be great. On CPU even better.

Also note the Ethical Statement on BASE TTS:

> An application of this model can be to create synthetic voices of people who have lost the ability to speak due to accidents or illnesses, subject to informed consent and rigorous data privacy reviews. However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

[0] https://github.com/metavoiceio/metavoice-src

[1] https://ttsdemo.themetavoice.xyz/

nshm · 2 years ago

Metavoice is one of a dozen GPT-based TTS systems around starting from Tortoise. And not that great honestly. You can clearly hear "glass scratches" in their sound, it is because they trained on MP3-compressed data.

There are much more clear sounding systems around. You can listen for StyleTTS2 to compare.

nshm commented on OpenAI releases Whisper v3, new generation open source ASR model github.com/openai/whisper... · Posted by u/crakenzak

nshm · 2 years ago

Good improvements for many languages, numbers here

https://github.com/openai/whisper/blob/main/language-breakdo...

nshm commented on Goodbye, Node.js Buffer sindresorhus.com/blog/goo... · Posted by u/ingve

nshm · 2 years ago

Ok, first we screwed buffers by making them globally tracked instead of just a piece of memory. Now its time to break all binary modules again.