These ML-compilers are being overhyped. It's all the same trade-off as a traditional compiler: you get a lot more throughput than hiring a specialist performance programmer, but the latter will typically outperform, possibly by orders of magnitude.
These things are inferior at many levels:
- Algorithmic: These things aren't feeding back to their human masters tips and tricks on how to modify the network to go faster beyond some very basic signals.
- Loss of intent: ML network designers are specifying architecture in python, and by the time it's gone through many layers of lowering, you can get some complete garbage. Highly efficient garbage, but still garbage. (recent example, we caught one of these compilers doing a slice update operation by first forming the range of all possible indices to the array, slicing that to get indices to update, and then doing a scatter; we replaced it with a single memcpy call).
- Inefficient kernels. Every time we see the output of these compilers go head-to-head with an expert assembly programmer, the compiler loses, often by 30%+. This always seems like the sort of thing that should be easy to solve, but given no-one seems to have cracked it in the past 50 years, it's obviously not as simple as it sounds.
Take a look at the chess engine Stockfish: they tossed out years and years of human written heuristics in board evaluation, to a small neural net that does the same but better.
Now consider all the heuristics for inlining, loop unrolling, vectorization etc in compilers, certainly a neural net can be beneficial and possibly easier to maintain than tons of human written heuristics.
We'll have to see. I could definitely see someone spending a lot of time training for a specific algorithmic kernel and microarchitecture and beating the best human results (by a few percent).
I'd be very surprised if that can be extended to a large complex algorithmic system that is amenable to mathematical reformulations (at least within the next 10 years).
> It's all the same trade-off as a traditional compiler: you get a lot more throughput than hiring a specialist performance programmer, but the latter will typically outperform, possibly by orders of magnitude.
That throughput is the point though? You cannot have performance specialists on every single ML workload. It's still significantly better than not having these kinds of optimization.
One of the easiest approache is torch.compile, it's the latest iteration of pytorch compiler (previous methods were : TorchScript and FX Tracing.)
You simply write
model = torch.compile(model)
"Across these 163 open-source models torch.compile works 93% of time, and the model runs 43% faster in training on an NVIDIA A100 GPU. At Float32 precision, it runs 21% faster on average and at AMP Precision it runs 51% faster on average."[1]
What google is trying to do, is to involve more people in the R&D of these kind of methods.
The near term promise is that you can use AMD, CUDA, TPUs, CPUs etc without explicit vendor support for the framework on which the model was developed.
Disclaimer: I will be very handwavey, reality is complex.
This is achieved by compiling the graph into some intermediate representation. And then implementing the right backend. For projects here, look at stableHLO, IREE, openXLA.
You can argue that Jax's jit compiler is a form of such compiler, mapping the traced operations down to XLA, which then does its own bit of magic to make it work on your backend.
It's transformations and abstractions all the way down.
summary: improve prediction of run-time performance of a computation graph using GNN, they use an embedding dictionary for each node's opcode along with some other node features (eg shape, bits, window size, see [1]), they released a big dataset of these graphs in [2] with varying XLA compilation configurations and their resulting perf on TPUs, they did some stuff to improve prediction on bigger graphs than before in [3] by partitioning the graph (METIS graph partition, new to me) and other training things
This is only about predicting performance of a given graph and not about improving/suggesting/editing a new equivalent graph. As in FunSearch, models which have decent predictive power could be used with evolutionary search.
Can anyone explain how conv works in that graph. You have a tensor of shape [2,4,16] and you convolve with a kernel of shape [4,16,8] and that gives you a [2,8] tensor? How's that possible?
* Tensor shape: [2,4,16]
* `2`: This represents the *batch size*, meaning there are two independent data samples being processed.
* `4`: This is the *input feature dimension*, indicating each sample has 4 features.
* `16`: This is the *input channel dimension*, suggesting each feature has 16 channels of information.
*2. Kernel:*
* Shape: [4,16,8]
* `4`: This is the *kernel size*, meaning the filter window used to convolve has a width of 4.
* `16`: This matches the *input channel dimension*, ensuring the filter operates on the same number of channels as the input.
* `8`: This is the *output channel dimension*, indicating the convolution produces 8 new channels of information per sample.
*3. Output:*
* Shape: [2,8]
* `2`: This remains the *batch size* as the operation is applied to each sample independently.
* `8`: This matches the *output channel dimension* of the kernel, signifying the final tensor has 8 new features extracted from the input.
*4. How is it possible?*
Despite the seemingly mismatched dimensions in the input and output, convolution on graphs works by leveraging the *neighborhood structure* of the graph. Here's a simplified explanation:
* The kernel slides across the graph, applying its weights to the features of the current node and its neighbors within a specific radius.
* This weighted sum is then aggregated to form a new feature for the current node in each output channel.
* As the kernel moves across the graph, it extracts information from the local neighborhood of each node, creating new features that capture relationships and patterns within the graph.
*Additional considerations:*
* The graph structure and edge weights likely play a role in how information propagates during the convolution process.
* Specific details of the convolution implementation, including padding and stride, might also influence the output shape.
Thanks. What was confusing me is the kernel size 4. Normally in (2D) convolutions you have (in_channels, out_channels, k, k) for a kxk kernel size. In the example above it the k is the first dimension instead of the last. This is in PyTorch, not sure about Keras
Off the top of my head, I can think for at least five foundation models (Llama, Claude, Gemini, Falcon, Mistral) that are all trading blows, but GPT is still a head above them and has been for a year now. Transformer LLMs are simple enough that, demonstrably, anyone with a million bucks of GPU time can make one, but they can't quite catch up with OpenAI. What's their special sauce?
Their special sauce is most probably the quality of data and the amount of data cleaning effort they put in.
I’m speculating here but I think Google always refrains from getting into the manual side of things. With LLMs, it became obvious so fast that data is what matters. Seeing Microsoft’s phi-2 play, I’m convinced more about this.
DeepMind understood the properties, came up with Chinchilla but DeepMind couldn’t integrate well with Google, in terms of understanding what kind of data Google should supply to increase model quality.
OpenAI put annotation/cleaning work almost right from the start. Not too familiar with this but human labor was heavily utilized to increase training data quality after ChatGPT started.
I kinda wonder if maybe it's at least partially due to openai hitting a kind of hyperparameter lottery. When each experiment costs millions it might be that (aside from good/ unique data) they just have a good set of hyperparameters used in training and it's too expensive for a competitor to find equal or better settings
Beside the fact that Gemini pro is more comparable to GPT-3.5, one more interesting observation is that even OpenAI themselves was not able (or didn't intend) to deliver a significantly better model than GPT-4 almost over a year. And OpenAI does not seem to hide their own magical "AGI" behind the scene as they've been more focused on efficiency and engineering works reportedly, primarily driven by Sam, rather than developing a new model. I'm reasonably sure that the current transformer itself as an architecture is at its peak and most improvements will be mostly incremental.
Note, Gemini Ultra, which they claim is competitive with or possibly even better than GPT-4, isn’t out yet. They have released a weaker model, Gemini Pro.
It will be interesting to see how capable Gemini Ultra actually is. For now we wait.
The pace that ML seems to be advancing right now is amazing. I don’t believe in the singularity but it’s changing software and then society in ways no one can predict.
At great risk of sounding completely ignorant, this approach is basically what I thought the point of machine learning was - cleverly using feedback loops to improve things automatically. The thing that sticks out to me as particularly cool about FunSearch is the use of programs as inputs/outputs and the fact that they managed to automate feedback.
I'm pretty naive in terms of granular understanding here as I am barely proficient in Python, to be clear, but when I daydream about things you could solve with machine learning/AI, this is the approach I always think of and I guess is how I thought it already worked. Load it up with the best information we have currently, define the desired results as clearly as possible, implement some form of automatic feedback, and let it run iteratively until it produces something better than what you had before.
Is this a case of "well no shit, but actually implementing that effectively is the hard part"? Is it being able to quickly apply it to a wide variety of problems? I guess I'm trying to understand whether this is a novel idea (and if so, what parts are novel), or if the idea has been around and it's a novel implementation.
The first 3 have and did result as of today in trillions in dollars of economic activity. And have changed societies, politics, political participation, access to knowledge etc worldwide for good and bad. So I don't get why you are so dismissive of them.
I think the biggest blind spot for many programers/coders is that yes it might not change much for them but it will allow many more people to code and do stuff that they were not able to before. As the the models get better and people use them more and learn how to use them more efficiently they will start changing things.
I am hoping we get to the point where the models are good enough that classes in schools are introduced on how to use them rather than just build them as the number of people wanting to or willing to learn programming is a lot smaller than the number of people to looking for ways to do things more efficiently.
I’ve been programming since middle school. That would be 30 years. Nothing really changed much. C++ is incrementally more convenient but fundamentally the same. Code editors are same. Debugger are same. Shell is same.
I am certain in 30 years everything will still be the same.
They’re making fun of your typo, but you’re right. Pretty much every software job in 5 years will be an AI job. This rustles a lot of feathers, but ignoring the truth will only hurt your career.
I think the era of big tech paying fat stacks to a rather larger number of technical staff will start to wane as well. Better hope you have top AI paper publications and deep experience with all parts of using LLMs/whatever future models there are, because if not, you’ll be in for a world of pain if you got used to cushy tech work and think it’s inevitable in a world where AI is advancing so fast.
I want to see it come out with a cure for a disease that is tough to cure first. Singularity itself is pointless unless it benefits humans which is mainly in health/lower suffering
I'd say advancement in mathematics, computer science, and heck, even art is far from "pointless". Why does it feel like goalposts get moved everytime there is a significant progress in AI?
In order for an AI to evaluate the effect of a small molecule on the brain, it would have to... simulate the operation of a human brain in a simulated environment. Similarly, to avoid Thalidomide-style disasters, it would have to simulate the conception, development and growth to adulthood of a human.
These things are... physically possible, but have WBE and uploads as a hard requirement. Those are going to affect a hell of a lot of things more than the drug industry!
Amusingly, machine-phase nanotechology and blood nanobots would be easier to evaluate, since simple cell-level mechanical interventions (reading surface proteins on cancer cells and chopping them up, say) will have fewer interactions than a small molecule that diffuses into every cell in the body.
“Cure” is a tough bar, but I believe Paxlovid, the anti-viral used to reduce Covid severity, was identified using ML. There’s many companies like Recursion Pharma which are entirely focused on using ML for drug discovery, and from what I can tell seem to have promising results, but drug development is slow enough that nothing will come of it for a while.
Also, while not medicine focused, Google’s GNOME project results announced a few weeks ago was pretty remarkable. They discovered more theoretical new materials using their ML approach than the rest of human history combined, and they are already confirming many of the results in laboratory settings. That has the potential to be a revolution in limitless scientific and engineering applications.
This is impossible, we can makesure that more possible scenario is that most of people lose job and starve. it is not sure whether we can reach to a society have UBI.
These things are inferior at many levels: - Algorithmic: These things aren't feeding back to their human masters tips and tricks on how to modify the network to go faster beyond some very basic signals. - Loss of intent: ML network designers are specifying architecture in python, and by the time it's gone through many layers of lowering, you can get some complete garbage. Highly efficient garbage, but still garbage. (recent example, we caught one of these compilers doing a slice update operation by first forming the range of all possible indices to the array, slicing that to get indices to update, and then doing a scatter; we replaced it with a single memcpy call). - Inefficient kernels. Every time we see the output of these compilers go head-to-head with an expert assembly programmer, the compiler loses, often by 30%+. This always seems like the sort of thing that should be easy to solve, but given no-one seems to have cracked it in the past 50 years, it's obviously not as simple as it sounds.
Now consider all the heuristics for inlining, loop unrolling, vectorization etc in compilers, certainly a neural net can be beneficial and possibly easier to maintain than tons of human written heuristics.
I'd be very surprised if that can be extended to a large complex algorithmic system that is amenable to mathematical reformulations (at least within the next 10 years).
Funny you should say that. Because traditional compilers have been incredibly useful.
That throughput is the point though? You cannot have performance specialists on every single ML workload. It's still significantly better than not having these kinds of optimization.
What's the actual state of these "ML compilers" currently, and what is rhe near term promise?
You simply write model = torch.compile(model)
"Across these 163 open-source models torch.compile works 93% of time, and the model runs 43% faster in training on an NVIDIA A100 GPU. At Float32 precision, it runs 21% faster on average and at AMP Precision it runs 51% faster on average."[1]
What google is trying to do, is to involve more people in the R&D of these kind of methods.
[1]https://pytorch.org/get-started/pytorch-2.0/
It actually sounds very useful and cool, I just completely did not get that from the article.
Disclaimer: I will be very handwavey, reality is complex.
This is achieved by compiling the graph into some intermediate representation. And then implementing the right backend. For projects here, look at stableHLO, IREE, openXLA.
You can argue that Jax's jit compiler is a form of such compiler, mapping the traced operations down to XLA, which then does its own bit of magic to make it work on your backend.
It's transformations and abstractions all the way down.
This is only about predicting performance of a given graph and not about improving/suggesting/editing a new equivalent graph. As in FunSearch, models which have decent predictive power could be used with evolutionary search.
[1] https://github.com/google-research-datasets/tpu_graphs#featu...
[2] TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs https://arxiv.org/abs/2308.13490
[3] Learning Large Graph Property Prediction via Graph Segment Training https://arxiv.org/abs/2305.12322
*1. Input:*
* Tensor shape: [2,4,16] * `2`: This represents the *batch size*, meaning there are two independent data samples being processed. * `4`: This is the *input feature dimension*, indicating each sample has 4 features. * `16`: This is the *input channel dimension*, suggesting each feature has 16 channels of information.
*2. Kernel:*
* Shape: [4,16,8] * `4`: This is the *kernel size*, meaning the filter window used to convolve has a width of 4. * `16`: This matches the *input channel dimension*, ensuring the filter operates on the same number of channels as the input. * `8`: This is the *output channel dimension*, indicating the convolution produces 8 new channels of information per sample.
*3. Output:*
* Shape: [2,8] * `2`: This remains the *batch size* as the operation is applied to each sample independently. * `8`: This matches the *output channel dimension* of the kernel, signifying the final tensor has 8 new features extracted from the input.
*4. How is it possible?*
Despite the seemingly mismatched dimensions in the input and output, convolution on graphs works by leveraging the *neighborhood structure* of the graph. Here's a simplified explanation:
* The kernel slides across the graph, applying its weights to the features of the current node and its neighbors within a specific radius. * This weighted sum is then aggregated to form a new feature for the current node in each output channel. * As the kernel moves across the graph, it extracts information from the local neighborhood of each node, creating new features that capture relationships and patterns within the graph.
*Additional considerations:*
* The graph structure and edge weights likely play a role in how information propagates during the convolution process. * Specific details of the convolution implementation, including padding and stride, might also influence the output shape.
Off the top of my head, I can think for at least five foundation models (Llama, Claude, Gemini, Falcon, Mistral) that are all trading blows, but GPT is still a head above them and has been for a year now. Transformer LLMs are simple enough that, demonstrably, anyone with a million bucks of GPU time can make one, but they can't quite catch up with OpenAI. What's their special sauce?
I’m speculating here but I think Google always refrains from getting into the manual side of things. With LLMs, it became obvious so fast that data is what matters. Seeing Microsoft’s phi-2 play, I’m convinced more about this.
DeepMind understood the properties, came up with Chinchilla but DeepMind couldn’t integrate well with Google, in terms of understanding what kind of data Google should supply to increase model quality.
OpenAI put annotation/cleaning work almost right from the start. Not too familiar with this but human labor was heavily utilized to increase training data quality after ChatGPT started.
It will be interesting to see how capable Gemini Ultra actually is. For now we wait.
https://deepmind.google/discover/blog/funsearch-making-new-d...
I'm pretty naive in terms of granular understanding here as I am barely proficient in Python, to be clear, but when I daydream about things you could solve with machine learning/AI, this is the approach I always think of and I guess is how I thought it already worked. Load it up with the best information we have currently, define the desired results as clearly as possible, implement some form of automatic feedback, and let it run iteratively until it produces something better than what you had before.
Is this a case of "well no shit, but actually implementing that effectively is the hard part"? Is it being able to quickly apply it to a wide variety of problems? I guess I'm trying to understand whether this is a novel idea (and if so, what parts are novel), or if the idea has been around and it's a novel implementation.
Deleted Comment
I am hoping we get to the point where the models are good enough that classes in schools are introduced on how to use them rather than just build them as the number of people wanting to or willing to learn programming is a lot smaller than the number of people to looking for ways to do things more efficiently.
I am certain in 30 years everything will still be the same.
I like programming how I do now. I don’t plan to stop.
People do lots of things manually that machines have been able to do for a long time.
I think the era of big tech paying fat stacks to a rather larger number of technical staff will start to wane as well. Better hope you have top AI paper publications and deep experience with all parts of using LLMs/whatever future models there are, because if not, you’ll be in for a world of pain if you got used to cushy tech work and think it’s inevitable in a world where AI is advancing so fast.
These things are... physically possible, but have WBE and uploads as a hard requirement. Those are going to affect a hell of a lot of things more than the drug industry!
Amusingly, machine-phase nanotechology and blood nanobots would be easier to evaluate, since simple cell-level mechanical interventions (reading surface proteins on cancer cells and chopping them up, say) will have fewer interactions than a small molecule that diffuses into every cell in the body.
Also, while not medicine focused, Google’s GNOME project results announced a few weeks ago was pretty remarkable. They discovered more theoretical new materials using their ML approach than the rest of human history combined, and they are already confirming many of the results in laboratory settings. That has the potential to be a revolution in limitless scientific and engineering applications.
AlphaFold was a nice surprise when it happened, too.
Deleted Comment
Deleted Comment