Really impressed at the speed with which Hugging Face ported this to their transformers library -- Google released the model and source code Oct 21 [1] and it was available in the library just 8 days later [2].
This is nice work. Summarized: by untying word piece embedding sizes from the hidden layer size + sharing parameters between layers, the number of parameters in the model are drastically reduced. They use this headroom to make the hidden layer sizes larger and given that layers share parameters, they can also use more layers without increasing model size.
Another interesting finding is that this speeds up training due to the smaller number of parameters.
However, my worry is that those of use that do not readily have access to TPUs will get even slower models when using CPUs for prediction due to the additional and wider hidden layers. (Of course, one could use ALBERT base, which still have 12 layers and a hidden layer size of 768, at a small loss.) Did anyone measure the CPU prediction performance of ALBERT models compared to BERT?
Edit: I guess one solution would be to use a pretrained ALBERT and finetune to get the initial model and then use model distillation to get a smaller, faster model.
> I guess one solution would be to use a pretrained ALBERT and finetune to get the initial model and then use model distillation to get a smaller, faster model.
IME this is the way to go; take an ensemble of big, accurate models, then distill them down to the smallest model you can get away with. There are really good tricks in this paper
Summarized: by untying word piece embedding sizes from the hidden layer size + sharing parameters between layers, the number of parameters in the model are drastically reduced. They use this headroom to make the hidden layer sizes larger and given that layers share parameters, they can also use more layers without increasing model size.
This summary is wrong.
It's true that they use parameter sharing to reduce the number of parameters.
But for any given "class" of model (base/large/xlarge) ALBERT has the same size hidden layer and the same number of layers.
If you try to compare by model size (measured by number of parameters) then ALBERT xxlarge (235M parameters) has less layers than BERT large (334M parameters) - 12 vs 24 - a larger hidden layer (4096 vs 1024) and smaller embedding size (128 vs 1024).
they can also use more layers without increasing model size.
Additionally, in section 4.9 they compare more layers and find "The difference between 12-layer and 24-layerALBERT-xxlarge configurations in terms of downstream accuracy is negligible, with the Avg score being the same. We conclude that, when sharing all cross-layer parameters (ALBERT-style), there is no need for models deeper than a 12-layer configuration"
The summary is only wrong if you want to compare the categories. I am not sure why the category names are important, except for identifying individual models.
Yeah, trading off embedding dim vs hidden layer dim is a common trick that effectively means trading size on disk (and in memory) for inference speed. It's obvious the model will need to be more powerful if the embedding carry less information. Still cool they got the size down by 90%.
I've been using ALBERT (the HuggingFace port) for a few weeks. It works fine on GPUs, and it isn't noticeably slower for inference than other large models on CPUs.
It's worth noting that TPUs are available for free on Google Colab.
It's worth noting that TPUs are available for free on Google Colab.
Yes, and you can also get a research grant, which gives you several TPUs for a month. But that does not mean that you can easily deploy TPUs in your own infrastructure, unless you use Google Cloud and suck up the costs (which may not be possible in academia).
I'm sorry, but why would you run a model like ALBERT on a CPU in the first place?
It's common knowledge not to bother running a deep model like ALBERT, BERT, or XLNet, etc without at least a GPU. Training and inference with models of this size on CPU's is typically considered to be intractable.
Obviously, this paper is from Google, so they have free access to TPU's, which is arguably optimal for an arbitrary TensorFlow model in terms of performance, but if you don't have access to TPU's, the clear choice is to use GPU's instead. It won't be quite as fast, but it is sufficient and GPU's are relatively cheap and available.
It's common knowledge not to bother running a deep model like ALBERT, BERT, or XLNet, etc without at least a GPU. Training and inference with models of this size on CPU's is typically considered to be intractable.
I am not sure if this is common knowledge, and if it is, it is wrong. With read-ahead and sorting batches by length, we can easily reach 100 sentences per second on modern CPUs (8 threads) with BERT base. We use BERT in a multi-task setup, typically annotating 5 layers at the same time. This is many times faster than old HPSG parsers (which typically had exponential time complexity) and as fast as other neural methods used in a pipelined setup.
Is there a speed-up? In their paper in table 3, once you compare each ALBERT model with the smaller BERT model, you're looking at similar accuracies and longer training times.
> Edit: I guess one solution would be to use a pretrained ALBERT and finetune to get the initial model and then use model distillation to get a smaller, faster model.
On CPU, assuming inference is compute bound rather than bandwidth bound, the compute time will scale quadratically with the size of the FC layers (which account for almost all compute time in these networks). So if the hidden size was 768 in BERT-Base, and 4096 in ALBERT, inference will approximately be 28.4x slower... yikes.
RACE test accuracy of 89.4. The latter appears to be a particularly strong improvement,a jump of +17.4% absolute points over BERT (Devlin et al., 2019), +7.6% over XLNet (Yang et al.,2019), +6.2% over RoBERTa (Liu et al., 2019), and 5.3% over DCMI+ (Zhang et al., 2019), an ensemble of multiple models specifically designed for reading comprehension tasks. Our single model achieves an accuracy of86.5%, which is still2.4%better than the state-of-the-art ensemble model
This is an amazing result.
RACE has a performance ceiling of 94.5% set by inaccuracies in the data. Mechanical Turk performance is 73.3%.
It's a hard, reading comprehension test where you can't just extract spans from the text and match against answers. Section 2 of https://www.aclweb.org/anthology/D17-1082.pdf has a sample.
The progress academics continuously make in NLP take us closer and closer to a local maxima that, when we reach it will be the mark of the longest and coldest AI winter ever experienced by man, because of how far we are from a global maxima and, progress made by academically untrained researchers will, in the end, be what melts the snow, because of how "out-of-the-current-AI-box" they are in their theories about language and intelligence in general.
We hear that opinion all the time. As someone working in neural net-based computer vision I'd basically agree that the current approaches are tending towards a non-AGI local maximum, but I'd note that as compared to the 80s, this is an economically productive local maximum, which will likely help fuel new developments more efficiently than in previous waves. The next breakthrough may be made by someone academically untrained, but you can bet they'll have learned a whole lot of math, computer science, data science, and maybe neuroscience first.
Agreed. I'm nowhere near expert enough to opine on how far state-of-the-art is from some global maxima.
I'd contend that, for the most part, it doesn't matter. It's a bit like the whole ML vs AGI debate ("but ML is just curve fitting, it's not real intelligence"). The more pertinent question for human society is the impact it has - positive or negative. ML, with all its real or perceived weaknesses, is having a significant impact on the economy specifically and society generally.
It'll be little consolation for white collar workers who lose their jobs that the bot replacing them isn't "properly intelligent". Equally, few people using Siri to control their room temperature or satnav will care that the underlying "intelligence" isn't as clever as we like to think we are.
Maybe current approaches will prove to have a cliff-edge limitation like previous AI approaches did. That will be interesting from a scientific progress perspective. But even in its current state, contemporary ML has plently scope to bring about massive changes in society (and already is). We should be careful not to miss that in criticising current limitations.
> The next breakthrough may be made by someone academically untrained, but you can bet they'll have learned a whole lot of math, computer science, data science, and maybe neuroscience first.
I found this sentence particularly intriguing given that John Carmack recently announced that he was switching his main focus to AI.
> [..] this is an economically productive local maximum, which will likely help fuel new developments more efficiently than in previous waves.
This is exactly it. Every previously seen AI Winter had in common that funding was cut back. However, Google or other companies in the realm could approach a point where further investment wouldn't make sense to them. Until then there won't be a Winter, maybe Autumn, because smaller players disappear.
The progress that has been made so far is already good enough to deliver tons of real business value. The tech is way ahead of application, as the tech has jumped forward so much in the last 5 years, and progress continues to be rapid.
I've direct evidence of that from my day job (building NLP chat bots for Intercom).
That business value will increase as we NLP progresses, even if we're moving towards a local optimum.
Even if we do get stuck, real products and real revenue powered by NLP will help fund research on successive generations.
Of course theres tons of hype about AI. But theres also a big virtuous cycle which just wasn't present in the setup which created previous AI winters.
People thought that with LSTMs, and then we got transformers. People thought that with CNNs, and then we got Res-nets.
Progress is always that way. It plateaus, then suddenly jumps and then plateaus again.
If you complaint is about the general move away from statistics and deep learning becoming the norm, then there are a pretty decent number of labs who are working on coming up with whatever the next deep learning is. There is probabilistic programming and there models are some models with newer biologically inspired computation structures.
Even inside ML and deep learning, people are trying to come up with ways to better leverage unsupervised learning and building large common sense representations of the world.
There is certainly an oversupply of applied deeplearning practitioners, but there are other approaches being explored in the AI/ML community too.
Like the local maxima that the GLUE benchmark was for a few weeks (months?) before SuperGLUE got released ? This field is moving so fast, it's probably wiser to hold off over the top ominous predictions for a little while.
It is not a summer/winter choice only if you chose to think this way. Such construction is superficial and drama oriented while reflecting very little what reality actually presents.
The current A.I. B.O.O.M is due to end or it is ending already, but this only means now we are equipped with really powerful approximators previous generation of researchers would not even dream of, that we left us with a really tantalizing question:
What is the right question to ask?
We have undoubtedly proved machine are superior to fit, now we need to make them curious.
Yeah, some people even started giving it a name [0]: Schmidhuber's (local) optimum. It is a bit tongue-in-cheek, but the idea is that as long as Schmidhuber says he did it before, we are probably in the same basin of attraction as we were in the nineties.
The open question is whether AGI is the same as Schmidhuber's optimum, or even lies within Schmidhuber's basin.
[0] Cambridge style debate on the topic at Neurips 2019.
But why does it have to be the longest AI winter? I would agree that current NLP approaches do not get us any closer to NLU. They won't hurt either though. They may even help to motivate people. I started working on NLU because the current state of voice assistants is so frustrating...
But why does it have to be the longest AI winter?
Because we have explored both paradigms (symbolic, subsymbolic and hybrid AI)
The research has explored both existing paradigms and no other paradigm exist.
Curve fitting (subsymbolic) is inherently limited.
Maybe we need to reinvent symbolic AI, but almost nobody is working on it, and I'm not aware of any promising research paths/ideas for symbolic AI.
Isn't physics in the same situation? The theory is useful for countless applications but is ultimately flawed, and researchers are aware of that, of course. But we don't hear it every time there is a new application of physics. Why should the ultimate high standard be invoked so often in discussion only for ML? Other fields like psychology or economics are probably in an even worse position with their theories vs the reality.
ML is an empirical science, or a craft if you want, with useful applications. It's not the ultimate theory of intelligence.
Personally, seeing that connexionitsts and symbolists start talking with each other gives me hope that there won't be another AI winter before the AI singularity.
> progress made by academically untrained researchers will, in the end, be what melts the snow, because of how "out-of-the-current-AI-box" they are in their theories
This is an unnecessarily uncharitable view of academia.
"Outside the box thinking" is frequently just ignorance and Dunning-Kruger.
Current academic NLP would have been considered quite out-of-the-current-box 10 years ago. Most academic progress is driven by young graduate students who think similarly to you.
Really impressed at the speed with which Hugging Face ported this to their transformers library -- Google released the model and source code Oct 21 [1] and it was available in the library just 8 days later [2].
[1] https://github.com/google-research/google-research/commit/b5...
[2] https://github.com/huggingface/transformers/commit/c0c208833...
Another interesting finding is that this speeds up training due to the smaller number of parameters.
However, my worry is that those of use that do not readily have access to TPUs will get even slower models when using CPUs for prediction due to the additional and wider hidden layers. (Of course, one could use ALBERT base, which still have 12 layers and a hidden layer size of 768, at a small loss.) Did anyone measure the CPU prediction performance of ALBERT models compared to BERT?
Edit: I guess one solution would be to use a pretrained ALBERT and finetune to get the initial model and then use model distillation to get a smaller, faster model.
IME this is the way to go; take an ensemble of big, accurate models, then distill them down to the smallest model you can get away with. There are really good tricks in this paper
https://www.aclweb.org/anthology/D19-5632/
This summary is wrong.
It's true that they use parameter sharing to reduce the number of parameters.
But for any given "class" of model (base/large/xlarge) ALBERT has the same size hidden layer and the same number of layers.
If you try to compare by model size (measured by number of parameters) then ALBERT xxlarge (235M parameters) has less layers than BERT large (334M parameters) - 12 vs 24 - a larger hidden layer (4096 vs 1024) and smaller embedding size (128 vs 1024).
Additionally, in section 4.9 they compare more layers and find "The difference between 12-layer and 24-layerALBERT-xxlarge configurations in terms of downstream accuracy is negligible, with the Avg score being the same. We conclude that, when sharing all cross-layer parameters (ALBERT-style), there is no need for models deeper than a 12-layer configuration"
It's worth noting that TPUs are available for free on Google Colab.
Yes, and you can also get a research grant, which gives you several TPUs for a month. But that does not mean that you can easily deploy TPUs in your own infrastructure, unless you use Google Cloud and suck up the costs (which may not be possible in academia).
It's common knowledge not to bother running a deep model like ALBERT, BERT, or XLNet, etc without at least a GPU. Training and inference with models of this size on CPU's is typically considered to be intractable.
Obviously, this paper is from Google, so they have free access to TPU's, which is arguably optimal for an arbitrary TensorFlow model in terms of performance, but if you don't have access to TPU's, the clear choice is to use GPU's instead. It won't be quite as fast, but it is sufficient and GPU's are relatively cheap and available.
It's pretty common to run inference on CPUs. There are lots of operational and cost reasons why this makes sense in at least some cases.
I am not sure if this is common knowledge, and if it is, it is wrong. With read-ahead and sorting batches by length, we can easily reach 100 sentences per second on modern CPUs (8 threads) with BERT base. We use BERT in a multi-task setup, typically annotating 5 layers at the same time. This is many times faster than old HPSG parsers (which typically had exponential time complexity) and as fast as other neural methods used in a pipelined setup.
In section 4.8 they compare accuracy at the same amount of training time for the biggest of each model and show that ALBERT is substantially better.
Huggingface did just that: https://medium.com/huggingface/distilbert-8cf3380435b5
This is an amazing result.
RACE has a performance ceiling of 94.5% set by inaccuracies in the data. Mechanical Turk performance is 73.3%.
It's a hard, reading comprehension test where you can't just extract spans from the text and match against answers. Section 2 of https://www.aclweb.org/anthology/D17-1082.pdf has a sample.
I'd contend that, for the most part, it doesn't matter. It's a bit like the whole ML vs AGI debate ("but ML is just curve fitting, it's not real intelligence"). The more pertinent question for human society is the impact it has - positive or negative. ML, with all its real or perceived weaknesses, is having a significant impact on the economy specifically and society generally.
It'll be little consolation for white collar workers who lose their jobs that the bot replacing them isn't "properly intelligent". Equally, few people using Siri to control their room temperature or satnav will care that the underlying "intelligence" isn't as clever as we like to think we are.
Maybe current approaches will prove to have a cliff-edge limitation like previous AI approaches did. That will be interesting from a scientific progress perspective. But even in its current state, contemporary ML has plently scope to bring about massive changes in society (and already is). We should be careful not to miss that in criticising current limitations.
I found this sentence particularly intriguing given that John Carmack recently announced that he was switching his main focus to AI.
This is exactly it. Every previously seen AI Winter had in common that funding was cut back. However, Google or other companies in the realm could approach a point where further investment wouldn't make sense to them. Until then there won't be a Winter, maybe Autumn, because smaller players disappear.
I've direct evidence of that from my day job (building NLP chat bots for Intercom).
That business value will increase as we NLP progresses, even if we're moving towards a local optimum.
Even if we do get stuck, real products and real revenue powered by NLP will help fund research on successive generations.
Of course theres tons of hype about AI. But theres also a big virtuous cycle which just wasn't present in the setup which created previous AI winters.
Progress is always that way. It plateaus, then suddenly jumps and then plateaus again.
If you complaint is about the general move away from statistics and deep learning becoming the norm, then there are a pretty decent number of labs who are working on coming up with whatever the next deep learning is. There is probabilistic programming and there models are some models with newer biologically inspired computation structures.
Even inside ML and deep learning, people are trying to come up with ways to better leverage unsupervised learning and building large common sense representations of the world.
There is certainly an oversupply of applied deeplearning practitioners, but there are other approaches being explored in the AI/ML community too.
Making progress of NLP benchmarks? Must be a sign that we're moving even closer to an even longer and more bitter AI winter.
The current A.I. B.O.O.M is due to end or it is ending already, but this only means now we are equipped with really powerful approximators previous generation of researchers would not even dream of, that we left us with a really tantalizing question:
What is the right question to ask?
We have undoubtedly proved machine are superior to fit, now we need to make them curious.
The open question is whether AGI is the same as Schmidhuber's optimum, or even lies within Schmidhuber's basin.
[0] Cambridge style debate on the topic at Neurips 2019.
ML is an empirical science, or a craft if you want, with useful applications. It's not the ultimate theory of intelligence.
This is an unnecessarily uncharitable view of academia.
"Outside the box thinking" is frequently just ignorance and Dunning-Kruger.