Readit News logoReadit News
nshm commented on Sesame CSM: A Conversational Speech Generation Model   github.com/SesameAILabs/c... · Posted by u/tosh
zhyder · 5 months ago
Any provider already hosting this (similar to how many providers host Whisper for STT)? Looks like doesn't support streaming tho (same with Whisper coincidentally), but great to see open models get so much better.
nshm · 5 months ago
It is useless actually. Very slow and quality is suboptimal and it is just speech generation component. See discussion here:

https://github.com/SesameAILabs/csm/issues/80

nshm commented on What happened to BERT and T5?   yitay.net/blog/model-arch... · Posted by u/fzliu
htrp · a year ago
feels like large language models sucked all the air out of the room because it was a lot easier to scale compute and data, and after roberta, no one was willing to continue exploring.
nshm · a year ago
No, there are mathematical reasons LLMs are better. They are trained with multiobjective loss (coding skills, translation skills, etc) so they understand the world much better than MLM. Original post discuss that but with more words and points than necessary.
nshm commented on Reasoning in Large Language Models: A Geometric Perspective   arxiv.org/abs/2407.02678... · Posted by u/belter
lifeisstillgood · a year ago
But I understand there are two sides to the discussion - that by ingesting huge amounts of text these models have somehow built reasoning capabilities (language then reasoning) or that the reasoning was done by humans and then written down so as long as you ask something like “should romeo find another love after Juliet” there is a set of reasoning reflected in a billion English literature essays and the model just reflects those answers

Am I missing something?

nshm · a year ago
It is actually pretty straightforward why those model "reason" or, to be more exact, can operate on a complex concepts. By processing huge amount of texts they build an internal representation where those concepts are represented as a simple nodes (neurons or groups). So they really distill knowledge. Alternatively you can think about it as a very good principal component analysis that can extract many important aspects. Or like a semantic graph built automatically.

Once knowledge is distilled you can build on top of it easily by merging concepts for example.

So no secret here.

nshm commented on ChatTTS-Best open source TTS Model   github.com/2noise/ChatTTS... · Posted by u/informal007
estheryo · a year ago
The completion level is impressive! I can hardly tell the difference from a human voice, especially with the natural pauses and laughter, which surpasses ChatGPT’s quality. However, there’s a noticeable electric noise at the end of sentences, which feels unnatural. (As a native Chinese speaker, I find the Chinese output even better in comparison.)
nshm · a year ago
There is also a glitch in "dialogue"
nshm commented on Sergey Brin on Gemini 1.5 Pro (03/02/2024) [video]   youtube.com/watch?v=BQ8yk... · Posted by u/Olshansky
nshm · a year ago
Anyone except me thinks he doesn't look very healthy? Its strange he is kind of slow on the video where he enters the room. Maybe some biohacking.
nshm commented on BASE TTS: The largest text-to-speech model to-date   amazon-ltts-paper.com/... · Posted by u/jcuenod
standardly · 2 years ago
Is the crispness of compressed audio really the benchmark of TTS improvements? I feel like that's an aside. A valid point, but not much of a detractor..
nshm · 2 years ago
Yes, it is one of the important aspects. In particular if you use TTS to create an audiobook or in a video production.
nshm commented on BASE TTS: The largest text-to-speech model to-date   amazon-ltts-paper.com/... · Posted by u/jcuenod
nshm · 2 years ago
Err, I deeply respect Amazon TTS team but this paper and synthesis is..... You publish the paper in 2024 and include YourTTS in your baselines to look better. Come on! There is XTTS2 around!

Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.

nshm commented on BASE TTS: The largest text-to-speech model to-date   amazon-ltts-paper.com/... · Posted by u/jcuenod
qwertox · 2 years ago
Interesting. Just a couple of hours ago I came across MetaVoice-1B [0] (Demo [1]) and was amazed by the quality of their TTS in English (sadly no other languages available).

If this year becomes the year when high quality Open Source TTS and ASR models appear that can run in real-time on an Nvidia RTX 40x0 or 30x0, then that would be great. On CPU even better.

Also note the Ethical Statement on BASE TTS:

> An application of this model can be to create synthetic voices of people who have lost the ability to speak due to accidents or illnesses, subject to informed consent and rigorous data privacy reviews. However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

[0] https://github.com/metavoiceio/metavoice-src

[1] https://ttsdemo.themetavoice.xyz/

nshm · 2 years ago
Metavoice is one of a dozen GPT-based TTS systems around starting from Tortoise. And not that great honestly. You can clearly hear "glass scratches" in their sound, it is because they trained on MP3-compressed data.

There are much more clear sounding systems around. You can listen for StyleTTS2 to compare.

nshm commented on Goodbye, Node.js Buffer   sindresorhus.com/blog/goo... · Posted by u/ingve
nshm · 2 years ago
Ok, first we screwed buffers by making them globally tracked instead of just a piece of memory. Now its time to break all binary modules again.

u/nshm

KarmaCake day401March 22, 2013
About
nshmyrev@gmail.com
View Original