Let me reframe the question - assume it's not only 100x GPUs, but all the performance bottlenecks you've mentioned are also solved or accelerated x100.
What kind of improvement would we observe, given the current state of the models and knowledge ?
So are we now at performance plateau? I know people at openai like places that think AGI is likely in next 3-5 years and is mostly scaling up context/performance/a few other key bets away. I know others who think that is unlikely in next few decades.
My personal view is I would expect 100x speed up to make ML used even more broadly and to allow more companies outside big players to have there own foundation models tuned for their use cases or other specialized domain models outside language modeling. Even now I still see tabular datasets (recommender systems, pricing models, etc) as most common to work in industry jobs. As for impact 100x compute will have for leading models like openai/anthropic I honestly have little confidence what will happen.
The rest of this is very speculative and not sure of, but my personal gut is we still need other algorithmic improvements like better ways to represent storing memory that models can later query/search for, but honestly part of that is just math/cs background in me not wanting everything to end up being hardware problem. Other part is I’m doubtful human like intelligence is so compute expensive and we can’t find more cost efficient ways for models to learn but maybe our nervous system is just much faster at parallel computation?
I also wonder - is the compute the main limiting factor today ? Let's imagine there is an unlimited number of NVidia chips available right now and energy is cheap - would using a cluster x100 of current biggest one result in a significant improvement ? My naive intuition is that not really.
I’m sure a few companies/models are optimized enough to fit ideal case but it’s rare.
Edit: Another aspect of this is nature of model architecture that are good today is very hardware driven. Major advantage of transformers over recurrent lstm models is training efficiency on gpu. The gap in training efficiency is much more dramatic with gpu than cpu for these two architectures. Similarly other architectures with sequential components like tree structured/recursive dynamic models tend to fit badly for gpu performance wise.
Although even for async training generally I see dataset just sharded and if worker goes down then shard of data may be loss/skipped not some kind of smarter dynamic file assignment factoring when workers go down. Even basic things like job fails continue from last checkpoint with same dataset state for large epoch is messy when major libraries like tensorflow lack a good dataset checkpointing mechanism.
The main place I’ve seen content understanding help is coldstart specially for new items by new creators.
I'm kind of shocked. I thought there would be more dynamism by now and I stopped dabbling in like 2018.
So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.
Additionally many operations that run on GPU but are just unsupported for TPU. Sparse tensors have pretty limited support and there's bunch of models that will crash on TPU and require refactoring. Sometimes pretty heavy thousands of lines refactoring.
edit: Pytorch is even worse. Pytorch does not implement efficient tpu device data loading and generally has poor performance no where comparable to tensorflow/jax numbers. I'm unaware of any pytorch benchmarks where tpu actually wins. For tensorflow/jax if you can get it running and your model suits tpu assumptions (so basic CNN) then yes it can be cost effective. For pytorch even simple cases tend to lose.
Not sure what you mean by that. The version of pip currently installed on my laptop is 23.3.1. Did you mean some other package, such as pytz?
You just need those libraries to embrace it really, then you could theoretically have type constructors that provide well-typed NxM matrix types or whatever, allowing you to enforce that [[1,2],[3,4]] is an instance of matrix_t(2, 2).
I don't see how python could possibly make such inferences for arbitrary libraries.
Any opinions about the current state of the art type checker?
Main tradeoff is this only works if your codebase/core dependencies are typed. For a while that was not true and we used pylint + pyright. Eventually most of our code is typed and we added type stubs for our main untyped dependencies.
edit: Also on pylint, it did work well mostly. tensorflow was main library that created most false positives. Other thing I found awkward was occasionally pylint produces non-deterministic lints on my codebase.