Geoffrey Hinton publishes new deep learning algorithm

jasonjmcghee · 3 years ago

Maybe I'm missing something, but from the paper https://www.cs.toronto.edu/~hinton/FFA13.pdf, they use non-conv nets on CIFAR-10 for back prop, resulting in 63% accuracy. And FF achieves 59% accuracy (at best).

Those are relatively close figures, but good accuracy on CIFAR-10 is 99%+ and getting ~94% is trivial.

So, if an improper architecture for a problem is used and the accuracy is poor, how compelling is using another optimization approach and achieving similar accuracy?

It's a unique and interesting approach, but the article specifically mentions it gets accuracy similar to backprop, but if this is the experiment that claim is based on, it loses some credibility in my eyes.

habitue · 3 years ago

I think you have to set expectations based on how much of the ground you're ripping up. If you're adding some layers or some little tweak to an existing architecture, then yeah, going backwards on cifar-10 is a failure.

If, however, you are ripping out backpropagation like this paper is, then you get a big pass. This is not the new paradigm yet, but it's promising that it doesn't just completely fail.

jxcole · 3 years ago

This seems to be Hinton's MO though. A few years back he ripped out convolutions for capsules and while he claims it's better and some people might claim it "has potential", no one really uses it for much because, as with this, the actual numerical performance is worse on the tests people care about (e.g. imagenet accuracy).

https://en.wikipedia.org/wiki/Capsule_neural_network

LudwigNagasena · 3 years ago

There is no shortage of paradigms that rip out backprop and deliver worse results.

Nevermark · 3 years ago

The best context to view the paper, is as part of an algorithm search.

Until the brain's algorithm is "solved", half steps are important. We need as many alternate half steps as we can find until one or more lead to a better understanding of the brain. (And potentially, better than backdrop efficiency or results.)

whimsicalism · 3 years ago

You have to start with toy models before scaling up.

electrograv · 3 years ago

Achieving <80% on CIFAR10 in the year >2020 is an example of a failed toy model, not a successful toy model.

Almost any ML algorithm can be thrown at CIFAR10 and achieve ~60% accuracy; this ballpark of accuracy is really not sufficient to demonstrate viability, no matter how aesthetically interesting the approach might feel.

yobbo · 3 years ago

This is not a benchmark of some model on cifar-10, it's a benchmark of the training algorithm.

But, model size and complexity also matters. MLP with backprop gets about 63% on cifar-10, for various reasons. So achieving 59% accuracy means this algorithm is about 93% as good as backprop in this case.

However, 63% accuracy on cifar-10 can be achieved with two (maybe three) layers IIRC. The output is a 10-way classifier, which is handled in one layer. If the output requires multi-layer transformations, then gradients need to be back-propagated.

As long as the batch activation vectors are trained to max separation (or orthogonality or whatever) at each layer, one output layer can match them to labels. But this is unlikely in problems where the output is more "transformed or complicated".

cschmid · 3 years ago

The article links to an old draft of the paper (it seems that the results in 4.1 couldn't be replicated). The arxiv has a more recent one: https://arxiv.org/abs/2212.13345

goethes_kind · 3 years ago

I skimmed through the paper and am a bit confused. There's only one equation and I feel like he rushed to publish a shower thought without even bothering to flesh it out mathematically.

So how do you optimize a layer? Do you still use gradient descent? So you are have a per layer loss with a positive and negative component and then do gradient descent?

So then what is the label for each layer? Do you use the same label for each layer?

And what does he mean by the forward pass not being fully known? I don't get this application of the blackbox between layers. Why would you want to do that?

harveywi · 3 years ago

Those details have to be omitted from manuscripts in order to avoid having to cite the works of Jürgen Schmidhuber.

bitL · 3 years ago

Jürgen did it all before in the 80s, however it was never translated to English so Geoffrey could happily reinvent it.

manytree5 · 3 years ago

Just curious, why would one want to avoid citing Schmidhuber's work?

bjourne · 3 years ago

Probably because the idea is trivial in hindsight (always is) so publishing fast is important. Afaict, the idea is to compute the gradients layer by layer and applying them immediately without bothering to back-propagate from the outputs. So in between layers would learn what orientation vectors previous layers emit for positive samples and themselves emit orientation vectors. Imagine a layer learning what regions of a sphere's (3d) surface are good and outputting what regions of a circle's (2d) perimeter are good. This is why he mentions the need for normalizing vectors otherwise layers would cheat and just look at the vector's magnitude.

The idea is imo similar to how random word embeddings are generated.

godelski · 3 years ago

> because the idea is trivial in hindsight (always is) so publishing fast is important.

Unfortunately I've also seen papers get rejected because their idea was "trivial", yet no one had thought of it before. Hinton has an edge here though.

naasking · 3 years ago

> Afaict, the idea is to compute the gradients layer by layer and applying them immediately without bothering to back-propagate from the outputs.

I'm not sure where you get that impression. Forward-Forward [1] seems to eschew gradients entirely:

    The Forward-Forward algorithm replaces the forward and backward passes of backpropagation by two forward passes, one with positive (i.e. real) data and the other with negative data which could be generated by the network itself

[1] https://www.cs.toronto.edu/~hinton/FFA13.pdf

duvenaud · 3 years ago

Perhaps a better place to find algorithmic details is this related paper, also with Hinton as a co-author, which implements similar ideas in more standard networks:

Scaling Forward Gradient With Local Losses Mengye Ren, Simon Kornblith, Renjie Liao, Geoffrey Hinton https://arxiv.org/abs/2210.03310

and has code: https://github.com/google-research/google-research/tree/mast...

civilized · 3 years ago

> There's only one equation

Not accurate for the version another commenter linked: https://www.cs.toronto.edu/~hinton/FFA13.pdf

I see four equations.

maurits · 3 years ago

Deep dive tutorial for learning in a forward pass [1]

[1] https://amassivek.github.io/sigprop

cochne · 3 years ago

> There are many choices for a loss L (e.g. gradient, Hebbian) and optimizer (e.g. SGD, Momentum, ADAM). The output(), y, is detailed in step 4 below.

I don't get it, don't all of those optimizers work via backprop?

mkaic · 3 years ago

The optimizers take parameters and their gradients as inputs and apply update rules to them, but the gradients you supply can come from anywhere. Backdrop is the most common way to assign gradients to parameters, but other methods can work too—as long as the optimizer is getting both parameters and gradients, it doesn't care where they come from.

sva_ · 3 years ago

I found this paragraph from the paper very interesting:

> 7 The relevance of FF to analog hardware

> An energy efficient way to multiply an activity vector by a weight matrix is to implement activities as voltages and weights as conductances. Their products, per unit time, are charges which add themselves. This seems a lot more sensible than driving transistors at high power to model the individual bits in the digital representation of a number and then performing O(n^2) single bit operations to multiply two n-bit numbers together. Unfortunately, it is difficult to implement the backpropagation procedure in an equally efficient way, so people have resorted to using A-to-D converters and digital computations for computing gradients (Kendall et al., 2020). The use of two forward passes instead of a forward and a backward pass should make these A-to-D converters unnecessary.

It was my impression that it is difficult to properly isolate an electronic system to use voltages in this way (hence computers sort of "cut" voltages into bits 0/1 using a step function).

Have these limitations been overcome or do they not matter as much, as neural networks can work with more fuzzy data?

Interesting to imagine such a processor though.

btown · 3 years ago

Photonic/optical neural networks are an interesting related area of research, using light interference to implement convolution and other operations without (I believe?) needing a bitwise representation of intensity.

https://www.nature.com/articles/s41467-020-20719-7

https://opg.optica.org/optica/fulltext.cfm?uri=optica-5-7-86...

CuriouslyC · 3 years ago

The small deltas resulting from electrical noise generally aren't an issue for probabilistic computations. Interestingly, people have quantized many large DL models down to 8/16 bits, and accuracy reduction is often on the order of 2-5%. Additionally, adding random noise to weights during training tends to act as a form of regularization.

Animats · 3 years ago

There's been unhappiness in some quarters that back propagation doesn't seem to be something that biology does. That may be part of the motivation here.

singularity2001 · 3 years ago

The paragraph about Mortal Computation is worth repeating:

If these FF networks can be proven to scale or made to scale similarly to BP networks, this would enable making hardware several orders of magnitude more efficient, for the price of loosing the ability to make exact copies of models to other computers. (The loss of reproducibility sits well with the tradition of scientific papers anyway/s;)

2.) How does this paper relate to Hintons feedback alignment from 5 years ago? I remember it was feedback without derivatives. What are the key new ideas? To adjust the output of each individual layer to be big for positive cases and small for negative cases without any feedback? Have these approaches been combined?

rsfern · 3 years ago

Discussion last month when the preprint was released: https://news.ycombinator.com/item?id=33823170

Deleted Comment

rytill · 3 years ago

Paper: https://www.cs.toronto.edu/~hinton/FFA13.pdf