152334H (u/152334H) - Readit News

152334H commented on OpenAI Leaks 120B Open Model on Hugging Face twitter.com/main_horse/st... · Posted by u/skadamat

arnaudsm · a month ago

Who's the target of 120B open-weights models? You can only run this in the cloud, is it just PR?

I wish they released a nano model for local hackers instead

152334H · a month ago

They have a 20b for GPU poors, too.

I will be running the 120B on my 2x4090-48GB, though.

152334H commented on Eleven v3 elevenlabs.io/v3... · Posted by u/robertvc

palisade · 3 months ago

For reference in case anyone is wondering, it is based on:

https://github.com/152334H/tortoise-tts-fast

The developer of tortoise tts fast was hired by Eleven labs.

152334H · 3 months ago

'was'. I departed almost half a year prior to v3's release this week.

152334H commented on Serious issues in Llama 4 training. VP of AI at Meta resigned old.reddit.com/r/LocalLLa... · Posted by u/devops000

152334H · 5 months ago

VP of AI at FAIR, who is unrelated to the llama/genai team.

152334H commented on Calculating the cost of a Google DeepMind paper 152334H.github.io/blog/sc... · Posted by u/152334H

arcade79 · a year ago

A lot of misunderstandings among the commenters here.

From the link: "the total compute cost it would take to replicate the paper"

It's not Google's cost. Google's cost is of course entirely different. It's the cost for the author if he were to rent the resources to replicate the paper.

For Google, all of it is running at a "best effort" resource tier, grabbing available resources when not requested by higher priority jobs. It's effectively free resources (except electricity consumption). If any "more important" jobs with a higher priority comes in and asks for the resources, the paper-writers jobs will just be preempted.

152334H · a year ago

Is it free-priority based?

I was told by an employee that GDM internally has a credits system for TPU allocation, with which researchers have to budget out their compute usage. I may have completely misunderstood what they were describing, though.

152334H commented on Calculating the cost of a Google DeepMind paper 152334H.github.io/blog/sc... · Posted by u/152334H

pama · a year ago

3USD/hour on the H100 is much more expensive than a reasonable amortized full ownership cost, unless one assumes the GPU is useless within 18 months, which I find a bit dramatic. The MFU can be above 40% and certainly well above the 35% in the estimate, also for small models with plain pytorch and trivial tuning [1] I didnt read the linked paper carefully but I seriously doubt the google team used vocab embedding layers with 2 D V parameters stated in the link, because this would be suboptimal by not tying the weights of the token embedding layer in the decoder architecture (even if they did double the params in these layers, it would not lead to 6 D V compute because the embedding input is indexed). To me these assumptions suggested a somewhat careless attitude towards the cost estimation and so I stopped reading the rest of this analysis carefully. My best guess is that the author is off by a large factor in the upward direction, and a true replication with H100/200 could be about 3x less expensive.

[1] if the total cost estimate was relatively low, say less than 10k, then of course the lowest rental price and a random training codebase might make some sense in order to reduce administrative costs; once the cost is in the ballpark of millions of USD, it feels careless to avoid optimizing it further. There exist H100s in firesales or Ebay occasionally, which could reduce the cost even more, but the author already mentions 2USD/gpu/hour for bulk rental compute, which is better than the 3USD/gpu/hour estimate they used in the writeup.

152334H · a year ago

You are correct on true H100 ownership costs being far lower. As I mention in the H100 blurb, the H100 numbers are fungible and I don't mind if you halve them.

MFU can certainly be improved beyond 40%, as I mention. But on the point of small models specifically: the paper uses FSDP for all models, and I believe a rigorous experiment should not vary sharding strategy due to numerical differences. FSDP2 on small models will be slow even with compilation.

The paper does not tie embeddings, as stated. The readout layer does lead to 6DV because it is a linear layer of D*V, which takes 2x for a forward and 4x for a backward. I would appreciate it if you could limit your comments to factual errors in the post.

152334H commented on GPT-4o openai.com/index/hello-gp... · Posted by u/Lealen

tedsanders · a year ago

Yeah, the way the app currently works is that ChatGPT-4o only sees up to the moment of your last comment.

For example, I tried asking ChatGPT-4o to commentate a soccer game, but I got pretty bad hallucinations, as the model couldn’t see any new video come in after my instruction.

So when using ChatGPT-4o you’ll have to point the camera first and then ask your question - it won’t work to first ask the question and then point the camera.

(I was able to play with the model early because I work at OpenAI.)

152334H · a year ago

thanks

152334H commented on Non-determinism in GPT-4 is caused by Sparse MoE 152334H.github.io/blog/no... · Posted by u/152334H

dylan604 · 2 years ago

I want to know what a non-boring flight would be like

152334H · 2 years ago

https://www.youtube.com/watch?v=iFImKMjM-q4