https://github.com/152334H/tortoise-tts-fast
The developer of tortoise tts fast was hired by Eleven labs.
https://github.com/152334H/tortoise-tts-fast
The developer of tortoise tts fast was hired by Eleven labs.
From the link: "the total compute cost it would take to replicate the paper"
It's not Google's cost. Google's cost is of course entirely different. It's the cost for the author if he were to rent the resources to replicate the paper.
For Google, all of it is running at a "best effort" resource tier, grabbing available resources when not requested by higher priority jobs. It's effectively free resources (except electricity consumption). If any "more important" jobs with a higher priority comes in and asks for the resources, the paper-writers jobs will just be preempted.
I was told by an employee that GDM internally has a credits system for TPU allocation, with which researchers have to budget out their compute usage. I may have completely misunderstood what they were describing, though.
[1] if the total cost estimate was relatively low, say less than 10k, then of course the lowest rental price and a random training codebase might make some sense in order to reduce administrative costs; once the cost is in the ballpark of millions of USD, it feels careless to avoid optimizing it further. There exist H100s in firesales or Ebay occasionally, which could reduce the cost even more, but the author already mentions 2USD/gpu/hour for bulk rental compute, which is better than the 3USD/gpu/hour estimate they used in the writeup.
MFU can certainly be improved beyond 40%, as I mention. But on the point of small models specifically: the paper uses FSDP for all models, and I believe a rigorous experiment should not vary sharding strategy due to numerical differences. FSDP2 on small models will be slow even with compilation.
The paper does not tie embeddings, as stated. The readout layer does lead to 6DV because it is a linear layer of D*V, which takes 2x for a forward and 4x for a backward. I would appreciate it if you could limit your comments to factual errors in the post.
For example, I tried asking ChatGPT-4o to commentate a soccer game, but I got pretty bad hallucinations, as the model couldn’t see any new video come in after my instruction.
So when using ChatGPT-4o you’ll have to point the camera first and then ask your question - it won’t work to first ask the question and then point the camera.
(I was able to play with the model early because I work at OpenAI.)
I wish they released a nano model for local hackers instead
I will be running the 120B on my 2x4090-48GB, though.