ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

This is nice work. Summarized: by untying word piece embedding sizes from the hidden layer size + sharing parameters between layers, the number of parameters in the model are drastically reduced. They use this headroom to make the hidden layer sizes larger and given that layers share parameters, they can also use more layers without increasing model size.

Another interesting finding is that this speeds up training due to the smaller number of parameters.

However, my worry is that those of use that do not readily have access to TPUs will get even slower models when using CPUs for prediction due to the additional and wider hidden layers. (Of course, one could use ALBERT base, which still have 12 layers and a hidden layer size of 768, at a small loss.) Did anyone measure the CPU prediction performance of ALBERT models compared to BERT?

Edit: I guess one solution would be to use a pretrained ALBERT and finetune to get the initial model and then use model distillation to get a smaller, faster model.

samcodes · 6 years ago

> I guess one solution would be to use a pretrained ALBERT and finetune to get the initial model and then use model distillation to get a smaller, faster model.

IME this is the way to go; take an ensemble of big, accurate models, then distill them down to the smallest model you can get away with. There are really good tricks in this paper

https://www.aclweb.org/anthology/D19-5632/

nl · 6 years ago

Summarized: by untying word piece embedding sizes from the hidden layer size + sharing parameters between layers, the number of parameters in the model are drastically reduced. They use this headroom to make the hidden layer sizes larger and given that layers share parameters, they can also use more layers without increasing model size.

This summary is wrong.

It's true that they use parameter sharing to reduce the number of parameters.

But for any given "class" of model (base/large/xlarge) ALBERT has the same size hidden layer and the same number of layers.

If you try to compare by model size (measured by number of parameters) then ALBERT xxlarge (235M parameters) has less layers than BERT large (334M parameters) - 12 vs 24 - a larger hidden layer (4096 vs 1024) and smaller embedding size (128 vs 1024).

nl · 6 years ago

they can also use more layers without increasing model size.

Additionally, in section 4.9 they compare more layers and find "The difference between 12-layer and 24-layerALBERT-xxlarge configurations in terms of downstream accuracy is negligible, with the Avg score being the same. We conclude that, when sharing all cross-layer parameters (ALBERT-style), there is no need for models deeper than a 12-layer configuration"

danieldk · 6 years ago

The summary is only wrong if you want to compare the categories. I am not sure why the category names are important, except for identifying individual models.

Nimitz14 · 6 years ago

Yeah, trading off embedding dim vs hidden layer dim is a common trick that effectively means trading size on disk (and in memory) for inference speed. It's obvious the model will need to be more powerful if the embedding carry less information. Still cool they got the size down by 90%.

nl · 6 years ago

I've been using ALBERT (the HuggingFace port) for a few weeks. It works fine on GPUs, and it isn't noticeably slower for inference than other large models on CPUs.

It's worth noting that TPUs are available for free on Google Colab.

danieldk · 6 years ago

It's worth noting that TPUs are available for free on Google Colab.

Yes, and you can also get a research grant, which gives you several TPUs for a month. But that does not mean that you can easily deploy TPUs in your own infrastructure, unless you use Google Cloud and suck up the costs (which may not be possible in academia).

ZeroCool2u · 6 years ago

I'm sorry, but why would you run a model like ALBERT on a CPU in the first place?

It's common knowledge not to bother running a deep model like ALBERT, BERT, or XLNet, etc without at least a GPU. Training and inference with models of this size on CPU's is typically considered to be intractable.

Obviously, this paper is from Google, so they have free access to TPU's, which is arguably optimal for an arbitrary TensorFlow model in terms of performance, but if you don't have access to TPU's, the clear choice is to use GPU's instead. It won't be quite as fast, but it is sufficient and GPU's are relatively cheap and available.

nl · 6 years ago

I'm sorry, but why would you run a model like ALBERT on a CPU in the first place?

It's pretty common to run inference on CPUs. There are lots of operational and cost reasons why this makes sense in at least some cases.

danieldk · 6 years ago

I am not sure if this is common knowledge, and if it is, it is wrong. With read-ahead and sorting batches by length, we can easily reach 100 sentences per second on modern CPUs (8 threads) with BERT base. We use BERT in a multi-task setup, typically annotating 5 layers at the same time. This is many times faster than old HPSG parsers (which typically had exponential time complexity) and as fast as other neural methods used in a pipelined setup.

zuzun · 6 years ago

Is there a speed-up? In their paper in table 3, once you compare each ALBERT model with the smaller BERT model, you're looking at similar accuracies and longer training times.

nl · 6 years ago

They are comparing the speed to execute training to 125K steps, not speed to a given accuracy.

In section 4.8 they compare accuracy at the same amount of training time for the biggest of each model and show that ALBERT is substantially better.

halflings · 6 years ago

> Edit: I guess one solution would be to use a pretrained ALBERT and finetune to get the initial model and then use model distillation to get a smaller, faster model.

Huggingface did just that: https://medium.com/huggingface/distilbert-8cf3380435b5

mapleshamrock · 6 years ago

On CPU, assuming inference is compute bound rather than bandwidth bound, the compute time will scale quadratically with the size of the FC layers (which account for almost all compute time in these networks). So if the hidden size was 768 in BERT-Base, and 4096 in ALBERT, inference will approximately be 28.4x slower... yikes.

The progress academics continuously make in NLP take us closer and closer to a local maxima that, when we reach it will be the mark of the longest and coldest AI winter ever experienced by man, because of how far we are from a global maxima and, progress made by academically untrained researchers will, in the end, be what melts the snow, because of how "out-of-the-current-AI-box" they are in their theories about language and intelligence in general.

heyitsguay · 6 years ago

We hear that opinion all the time. As someone working in neural net-based computer vision I'd basically agree that the current approaches are tending towards a non-AGI local maximum, but I'd note that as compared to the 80s, this is an economically productive local maximum, which will likely help fuel new developments more efficiently than in previous waves. The next breakthrough may be made by someone academically untrained, but you can bet they'll have learned a whole lot of math, computer science, data science, and maybe neuroscience first.

spinningslate · 6 years ago

Agreed. I'm nowhere near expert enough to opine on how far state-of-the-art is from some global maxima.

I'd contend that, for the most part, it doesn't matter. It's a bit like the whole ML vs AGI debate ("but ML is just curve fitting, it's not real intelligence"). The more pertinent question for human society is the impact it has - positive or negative. ML, with all its real or perceived weaknesses, is having a significant impact on the economy specifically and society generally.

It'll be little consolation for white collar workers who lose their jobs that the bot replacing them isn't "properly intelligent". Equally, few people using Siri to control their room temperature or satnav will care that the underlying "intelligence" isn't as clever as we like to think we are.

Maybe current approaches will prove to have a cliff-edge limitation like previous AI approaches did. That will be interesting from a scientific progress perspective. But even in its current state, contemporary ML has plently scope to bring about massive changes in society (and already is). We should be careful not to miss that in criticising current limitations.

mrec · 6 years ago

> The next breakthrough may be made by someone academically untrained, but you can bet they'll have learned a whole lot of math, computer science, data science, and maybe neuroscience first.

I found this sentence particularly intriguing given that John Carmack recently announced that he was switching his main focus to AI.

dna_polymerase · 6 years ago

> [..] this is an economically productive local maximum, which will likely help fuel new developments more efficiently than in previous waves.

This is exactly it. Every previously seen AI Winter had in common that funding was cut back. However, Google or other companies in the realm could approach a point where further investment wouldn't make sense to them. Until then there won't be a Winter, maybe Autumn, because smaller players disappear.

feral · 6 years ago

The progress that has been made so far is already good enough to deliver tons of real business value. The tech is way ahead of application, as the tech has jumped forward so much in the last 5 years, and progress continues to be rapid.

I've direct evidence of that from my day job (building NLP chat bots for Intercom).

That business value will increase as we NLP progresses, even if we're moving towards a local optimum.

Even if we do get stuck, real products and real revenue powered by NLP will help fund research on successive generations.

Of course theres tons of hype about AI. But theres also a big virtuous cycle which just wasn't present in the setup which created previous AI winters.

screye · 6 years ago

People thought that with LSTMs, and then we got transformers. People thought that with CNNs, and then we got Res-nets.

Progress is always that way. It plateaus, then suddenly jumps and then plateaus again.

If you complaint is about the general move away from statistics and deep learning becoming the norm, then there are a pretty decent number of labs who are working on coming up with whatever the next deep learning is. There is probabilistic programming and there models are some models with newer biologically inspired computation structures.

Even inside ML and deep learning, people are trying to come up with ways to better leverage unsupervised learning and building large common sense representations of the world.

There is certainly an oversupply of applied deeplearning practitioners, but there are other approaches being explored in the AI/ML community too.

teshier-A · 6 years ago

Like the local maxima that the GLUE benchmark was for a few weeks (months?) before SuperGLUE got released ? This field is moving so fast, it's probably wiser to hold off over the top ominous predictions for a little while.

octbash · 6 years ago

This sounds like some sort of perverse inverse of "This is good for bitcoin."

Making progress of NLP benchmarks? Must be a sign that we're moving even closer to an even longer and more bitter AI winter.

tanilama · 6 years ago

It is not a summer/winter choice only if you chose to think this way. Such construction is superficial and drama oriented while reflecting very little what reality actually presents.

The current A.I. B.O.O.M is due to end or it is ending already, but this only means now we are equipped with really powerful approximators previous generation of researchers would not even dream of, that we left us with a really tantalizing question:

What is the right question to ask?

We have undoubtedly proved machine are superior to fit, now we need to make them curious.

317070 · 6 years ago

Yeah, some people even started giving it a name [0]: Schmidhuber's (local) optimum. It is a bit tongue-in-cheek, but the idea is that as long as Schmidhuber says he did it before, we are probably in the same basin of attraction as we were in the nineties.

The open question is whether AGI is the same as Schmidhuber's optimum, or even lies within Schmidhuber's basin.

[0] Cambridge style debate on the topic at Neurips 2019.

tjansen · 6 years ago

But why does it have to be the longest AI winter? I would agree that current NLP approaches do not get us any closer to NLU. They won't hurt either though. They may even help to motivate people. I started working on NLU because the current state of voice assistants is so frustrating...

The_rationalist · 6 years ago

But why does it have to be the longest AI winter? Because we have explored both paradigms (symbolic, subsymbolic and hybrid AI) The research has explored both existing paradigms and no other paradigm exist. Curve fitting (subsymbolic) is inherently limited. Maybe we need to reinvent symbolic AI, but almost nobody is working on it, and I'm not aware of any promising research paths/ideas for symbolic AI.

visarga · 6 years ago

Isn't physics in the same situation? The theory is useful for countless applications but is ultimately flawed, and researchers are aware of that, of course. But we don't hear it every time there is a new application of physics. Why should the ultimate high standard be invoked so often in discussion only for ML? Other fields like psychology or economics are probably in an even worse position with their theories vs the reality.

ML is an empirical science, or a craft if you want, with useful applications. It's not the ultimate theory of intelligence.

Iv · 6 years ago

Personally, seeing that connexionitsts and symbolists start talking with each other gives me hope that there won't be another AI winter before the AI singularity.

2sk21 · 6 years ago

Agreed. I should mention that reading Gary Marcus' new book really drives home your point.

slumdev · 6 years ago

> progress made by academically untrained researchers will, in the end, be what melts the snow, because of how "out-of-the-current-AI-box" they are in their theories

This is an unnecessarily uncharitable view of academia.

"Outside the box thinking" is frequently just ignorance and Dunning-Kruger.

laretluval · 6 years ago

Current academic NLP would have been considered quite out-of-the-current-box 10 years ago. Most academic progress is driven by young graduate students who think similarly to you.