amrb (u/amrb) - Readit News

amrb commented on DisTrO – a family of low latency distributed optimizers github.com/NousResearch/D... · Posted by u/SchwKatze

simonw · a year ago

Most of the information about this is in this PDF (I hate when people publish interesting information exclusively in PDFs): https://raw.githubusercontent.com/NousResearch/DisTrO/main/A...

I converted it to Markdown (using Gemini 1.5 Pro) and pasted it into a Gist here: https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdab...

From the abstract:

> Training large scale neural networks typically involves sharing gradients between all accelerators, which necessitates specialized, high-speed interconnects. To address this, we introduce DisTrO, a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.

This could be a HUGE deal.

Currently if you want to train giant LLMs you need a big pile of GPUs in the same location as each other due to the amount of information that needs to shuffle between them during training.

If DisTrO works as intended, it will be possible to train models using GPUs in different places - potentially enabling SETI@home style training where thousands of people with gaming PCs at home could donate their GPU time to a large training effort.

Their tweet about this has more: https://twitter.com/NousResearch/status/1828121648383566270

> Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by 1000x to 10,000x without relying on amortized analysis, and matches AdamW+All-Reduce in convergence rates. This enables low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.

> DisTrO can increase the resilience and robustness of training LLMs by minimizing dependency on a single entity for computation. DisTrO is one step towards a more secure and equitable environment for all participants involved in building LLMs.

> Without relying on a single company to manage and control the training process, researchers and institutions can have more freedom to collaborate and experiment with new techniques, algorithms, and models. This increased competition fosters innovation, drives progress, and ultimately benefits society as a whole.

amrb · a year ago

It's a red flag that the 1.2bil model has to fit in gpu memory, happy to be provided wrong when the code drops

amrb commented on Tokens are a big reason today's generative AI falls short techcrunch.com/2024/07/06... · Posted by u/anigbrowl

vessenes · a year ago

T-FREE is interesting, at least, I find it interesting in that I don’t really understand it. They take successive character triples of all words, and then hash them, and then use the hash table slots landed in as destinations to feed into an embedding space? Can I possibly be understanding that chart properly?

Can you explain this any better than the first few pages of the paper? I’d like some intuition about why T-FREE works; there are lots of reasons to prefer different tokenization schemes, but I can’t really get this one into my head from the paper, unfortunately.

amrb · a year ago

Can't say I mastered the concept either, I'm waiting for the code [0] to be release so I can run some head-to-head tests.

[0] https://github.com/Aleph-Alpha/trigrams

amrb commented on Tokens are a big reason today's generative AI falls short techcrunch.com/2024/07/06... · Posted by u/anigbrowl

amrb · a year ago

An alternative approache to BPE tokenization https://arxiv.org/abs/2406.19223

amrb commented on Distributed Inference and Fine-Tuning of Large Language Models over the Internet browse.arxiv.org/html/231... · Posted by u/cyanf

amrb · 2 years ago

Can check out their project at https://github.com/bigscience-workshop/petals

amrb commented on How to Reduce Memory Requirements with Product Quantization weaviate.io/blog/pq-resco... · Posted by u/hc32

amrb · 2 years ago

Speaking of quantized vectors https://huggingface.co/papers/2309.14717

amrb commented on Petals runs Llama 2 (70B) from Colab at 5 tokens/sec github.com/bigscience-wor... · Posted by u/borzunov

amrb · 2 years ago

Great project and I'm happy to see it expand to more models!

amrb commented on Reddark: Website to watch subreddits going dark reddark.netlify.app/... · Posted by u/morjom

amrb · 2 years ago

What's the new reddit to try?

amrb commented on Ask HN: Has an API key issuer ever leaked their own customers’ API keys · Posted by u/mathewpregasen

amrb · 2 years ago

Anything can end up in logs, then it depends on getting access to hosted splunk via employee creds, for a hypothetical breach.

amrb commented on 90% of laid-off H1-B visa holders were able to find new work fortune.com/2023/05/26/90... · Posted by u/rustoo

glitchc · 2 years ago

All that demonstrates is that it is cheaper to hire the H1-B visa holder over a local developer with the same experience.

amrb · 2 years ago

There is a salary requirement, as not to under cut local works. Of course if you working over 40 hours a week maybe the company gets it's pound of flesh!

amrb commented on Tree of Thoughts github.com/kyegomez/tree-... · Posted by u/kevinslin

amrb · 2 years ago

Good talk on the paper https://www.youtube.com/watch?v=ut5kp56wW_4