Mehdi2277 (u/Mehdi2277)

Mehdi2277 commented on Ruff: Python linter and code formatter written in Rust github.com/astral-sh/ruff... · Posted by u/modinfo

rikthevik · 8 months ago

I'm very impressed by the recent developer experience improvements in the python ecosystem. Between ruff, uv, and https://github.com/gauge-sh/tach we'll be able to keep our django monolith going for a long time.

Any opinions about the current state of the art type checker?

Mehdi2277 · 8 months ago

I'm very happy with pyright. Most bug reports are fixed within a week and new peps/features are added very rapidly usually before pep is accepted (under experimental flag). Enough that I ended up dropping pylint and consider pyright enough for lint purposes as well. The most valuable lints for my work require good multi-file/semantic analysis and pylint had various false positives.

Main tradeoff is this only works if your codebase/core dependencies are typed. For a while that was not true and we used pylint + pyright. Eventually most of our code is typed and we added type stubs for our main untyped dependencies.

edit: Also on pylint, it did work well mostly. tensorflow was main library that created most false positives. Other thing I found awkward was occasionally pylint produces non-deterministic lints on my codebase.

Mehdi2277 commented on Uv 0.3 – Unified Python packaging astral.sh/blog/uv-unified... · Posted by u/mantapoint

Lorak_ · a year ago

Does it support building native extensions and Cython modules or are setuptools still the only reasonable way to do this?

Mehdi2277 · a year ago

Uv is installer not a build backend. It’s similar to pip. If you install library with uv it will call backend like setuptools as needed. It is not a replacement for setuptools.

Mehdi2277 commented on Nvidia reportedly delays its next AI chip due to a design flaw theverge.com/2024/8/3/242... · Posted by u/mgh2

TheAlchemist · a year ago

Interesting, thanks.

Let me reframe the question - assume it's not only 100x GPUs, but all the performance bottlenecks you've mentioned are also solved or accelerated x100.

What kind of improvement would we observe, given the current state of the models and knowledge ?

Mehdi2277 · a year ago

If I assume you mean LLM like models similar to chatgpt that is pretty debated in the community. Several years ago many people in ML community believed we were at plateau and that throwing more compute/money would not give significant improvements. Well then LLMs did much better than expected as they scaled up and continue to iterate now on various benchmarks.

So are we now at performance plateau? I know people at openai like places that think AGI is likely in next 3-5 years and is mostly scaling up context/performance/a few other key bets away. I know others who think that is unlikely in next few decades.

My personal view is I would expect 100x speed up to make ML used even more broadly and to allow more companies outside big players to have there own foundation models tuned for their use cases or other specialized domain models outside language modeling. Even now I still see tabular datasets (recommender systems, pricing models, etc) as most common to work in industry jobs. As for impact 100x compute will have for leading models like openai/anthropic I honestly have little confidence what will happen.

The rest of this is very speculative and not sure of, but my personal gut is we still need other algorithmic improvements like better ways to represent storing memory that models can later query/search for, but honestly part of that is just math/cs background in me not wanting everything to end up being hardware problem. Other part is I’m doubtful human like intelligence is so compute expensive and we can’t find more cost efficient ways for models to learn but maybe our nervous system is just much faster at parallel computation?

Mehdi2277 commented on Nvidia reportedly delays its next AI chip due to a design flaw theverge.com/2024/8/3/242... · Posted by u/mgh2

TheAlchemist · a year ago

I like to think (hope) that the next breakthrough will come not from these huge clusters, but from somebody tinkering with new ideas on a small local system.

I also wonder - is the compute the main limiting factor today ? Let's imagine there is an unlimited number of NVidia chips available right now and energy is cheap - would using a cluster x100 of current biggest one result in a significant improvement ? My naive intuition is that not really.

Mehdi2277 · a year ago

My experience working on ml at couple faang like companies is gpus actually tend to be too fast compute wise and often models are unable to come close to theoretical nvidia flops numbers. In that very frequently bottlenecks from profiling are elsewhere. It is very easy to have your data reading code be bottleneck. I have seen some models where our networking was bottleneck and could not keep up with the compute and we had adjust model architecture in ways to reduce amount of data transferred in training steps across the cluster. Or maybe you have gpu memory bandwidth as bottleneck. Key idea in flash attention work is optimizing attention kernels to lower amount of vram usage and stick to smaller/faster sram. This is valuable work, but is also kind of work that is pretty rare engineer I have worked with would have cuda kernel experience to create custom efficient kernels. Some of the models I train use a lot of sparse tensors as features and tensorflow’s sparse gpu kernel is rather bad with many operations either falling back to cpu or sometimes I have had gpu sparse kernel that was slower than cpu equivalent kernel. Several times densifying and padding tensors with large fraction of 0’s was faster than using sparse kernel.

I’m sure a few companies/models are optimized enough to fit ideal case but it’s rare.

Edit: Another aspect of this is nature of model architecture that are good today is very hardware driven. Major advantage of transformers over recurrent lstm models is training efficiency on gpu. The gap in training efficiency is much more dramatic with gpu than cpu for these two architectures. Similarly other architectures with sequential components like tree structured/recursive dynamic models tend to fit badly for gpu performance wise.

Mehdi2277 commented on Maintaining large-scale AI capacity at Meta engineering.fb.com/2024/0... · Posted by u/samber

eugenhotaj · a year ago

This is because everyone is training with synchronous sgd. all gpus need to synchronize on each gradient step so tail latency will kill you.

Mehdi2277 · a year ago

I’ve worked at companies with async training. Async training does help on fault tolerance and also can assist with training thoroughput by being less reliant on slowest machine. It does add meaningful training noise and when we did experiments against sync training we got much more stable results with sync training and some of our less stable models would even sometimes have loss explosions/divergence issues with async training but be fine with sync training.

Although even for async training generally I see dataset just sharded and if worker goes down then shard of data may be loss/skipped not some kind of smarter dynamic file assignment factoring when workers go down. Even basic things like job fails continue from last checkpoint with same dataset state for large epoch is messy when major libraries like tensorflow lack a good dataset checkpointing mechanism.

Mehdi2277 commented on Maintaining large-scale AI capacity at Meta engineering.fb.com/2024/0... · Posted by u/samber

ldjkfkdsjnv · a year ago

Their whole advertising business model gets better with LLM understanding of text. They can target ads better.

Mehdi2277 · a year ago

This is fair guess on intuition but working in recommender space on both content/ad recommendation, content understanding signals have pretty consistently across two companies and many projects tended to underwhelm and key signals are generally engagement signals (including event sequences) and many embeddings (user embedding, creator embedding, ad embedding, etc).

The main place I’ve seen content understanding help is coldstart specially for new items by new creators.

Mehdi2277 commented on Llama3 implemented from scratch github.com/naklecha/llama... · Posted by u/Hadi7546

brcmthrowaway · a year ago

Wait, are you saying SoTA NN research hasnt evolved from hardcoding a bunch of layer structures and sizes?

I'm kind of shocked. I thought there would be more dynamism by now and I stopped dabbling in like 2018.

Mehdi2277 · a year ago

I’ve occasionally worked with more dynamic models (tree structured decoding). They are generally not a good fit for trying to max gpu thoroughput. A lot of magic of transformers and large language models is about pushing gpu as much we can and simpler static model architecture that trains faster can train on much more data.

So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.

Mehdi2277 commented on Google TPU v5p beats Nvidia H100 techradar.com/pro/google-... · Posted by u/wslh

kjkjhgkjyj · 2 years ago

TensorFlow and PyTorch support TPUs. It's pretty painless.

Mehdi2277 · 2 years ago

Having used it heavily it is nowhere near painless. Where can you get a TPU? To train models you basically need to use GCP services. There are multiple services that offer TPU support, Cloud AI Platform, GKE, and Vertex AI. For GPU you can have a machine and run any tf version you like. For tpu you need different nodes depending on tf version. Which tf versions are supported per GCP service is inconsistent. Some versions are supported on Cloud AI Platform but not Vertex AI and vice versa. I have had a lot of difficulty trying to upgrade to recent tf versions and discovering the inconsistent service support.

Additionally many operations that run on GPU but are just unsupported for TPU. Sparse tensors have pretty limited support and there's bunch of models that will crash on TPU and require refactoring. Sometimes pretty heavy thousands of lines refactoring.

edit: Pytorch is even worse. Pytorch does not implement efficient tpu device data loading and generally has poor performance no where comparable to tensorflow/jax numbers. I'm unaware of any pytorch benchmarks where tpu actually wins. For tensorflow/jax if you can get it running and your model suits tpu assumptions (so basic CNN) then yes it can be cost effective. For pytorch even simple cases tend to lose.

Mehdi2277 commented on datetime.utcnow() is now deprecated blog.miguelgrinberg.com/p... · Posted by u/Brajeshwar

selcuka · 2 years ago

> something like 2023.x as pip does

Not sure what you mean by that. The version of pip currently installed on my laptop is 23.3.1. Did you mean some other package, such as pytz?

Mehdi2277 · 2 years ago

23 comes from 2023. Pip uses calendar based versioning. The 23.3.1 means 2023, 3rd quarter, second release. The last number is for bug fix release. The first two numbers are purely date based and do not follow semvar.

Mehdi2277 commented on The different uses of Python type hints lukeplant.me.uk/blog/post... · Posted by u/BerislavLopac

OJFord · 2 years ago

I'm not sure how a python annotation/type system could possibly do that? If numpy/pandas had different types for different cardinalities it would work today.

You just need those libraries to embrace it really, then you could theoretically have type constructors that provide well-typed NxM matrix types or whatever, allowing you to enforce that [[1,2],[3,4]] is an instance of matrix_t(2, 2).

I don't see how python could possibly make such inferences for arbitrary libraries.

Mehdi2277 · 2 years ago

PEP 646 Variadic generics, https://peps.python.org/pep-0646/, was made for this specific use case but mypy is still working on implementing it. And even with it, it's expected several more peps are needed to make operations on variadic types powerful enough to handle common array operations. numpy/tensorflow/etc do broadcasting a lot and that probably would need a type level operator Broadcast just to encode that. I also expect the type definitions for numpy will go fairly complex similar to template heavy C++ code after they add shape types.