“Deep Learning has outlived its usefulness as a buzz-phrase”

[Text from post]

OK, Deep Learning has outlived its usefulness as a buzz-phrase. Deep Learning est mort. Vive Differentiable Programming!

Yeah, Differentiable Programming is little more than a rebranding of the modern collection Deep Learning techniques, the same way Deep Learning was a rebranding of the modern incarnations of neural nets with more than two layers.

But the important point is that people are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization.

An increasingly large number of people are defining the network procedurally in a data-dependant way (with loops and conditionals), allowing them to change dynamically as a function of the input data fed to them. It's really very much like a regular progam, except it's parameterized, automatically differentiated, and trainable/optimizable. Dynamic networks have become increasingly popular (particularly for NLP), thanks to deep learning frameworks that can handle them such as PyTorch and Chainer (note: our old deep learning framework Lush could handle a particular kind of dynamic nets called Graph Transformer Networks, back in 1994. It was needed for text recognition).

People are now actively working on compilers for imperative differentiable programming languages. This is a very exciting avenue for the development of learning-based AI.

Important note: this won't be sufficient to take us to "true" AI. Other concepts will be needed for that, such as what I used to call predictive learning and now decided to call Imputative Learning. More on this later....

platz · 8 years ago

> It's really very much like a regular progam, except it's parameterized, automatically differentiated, and trainable/optimizable.

> People are now actively working on compilers for imperative differentiable programming languages.

Do you have an example of either of these things?

joshuamorton · 8 years ago

I believe https://github.com/google/tangent counts, though I'm not 100% sure.

zengid · 8 years ago

From the thread, LeCun mentions:

"Look at papers by Jeff Siskind and Barak Barak A. Pearlmutter particularly their work on VLAD, Stalingrad and VLD."

chombier · 8 years ago

Erik Meijer gave a talk about this at KotlinConf last year https://www.youtube.com/watch?v=NKeHrApPWlo

bra-ket · 8 years ago

it's really a pity that after 75 years of AI research the best thing we've got is still based on gradient descent, a brute force trial and error.

log_base_login · 8 years ago

As much of a pity that, 70 years later, we are still using transistor based computers originally derived from three wires stuck in a piece of rock[1] by some very innovative fellows at Bell Labs[2]?

[1]http://images.computerhistory.org/revonline/images/500004836... [2]http://www.computerhistory.org/revolution/digital-logic/12/2...

nostrademons · 8 years ago

As much of a pity as that after 3 billion years, all life is still based upon selection bias and random genetic mutations, a brute force trial and error?

mturmon · 8 years ago

Following a gradient is smarter than trial and error. You can make an argument that, in high-dimensional parameter spaces, it’s hard to do better (because, gradient descent is linear in the number of dimensions).

Ordinary Metropolis-Hastings, for example, is closer to trial and error.

jclos · 8 years ago

Its simplicity is its power. More complex methods (e.g. second order methods) tend to get attracted to saddle points and produce bad results. Some metaheuristics like evolution strategies are also used in some specific cases (reinforcement learning). Minibatch gradient descent + reasonable minibatch size + some form of momentum is the best we have.

electricslpnsld · 8 years ago

If you can’t reasonably get at or use second order information, how else are you going to optimize arbitrary objectives?

Well, come to think of it it, why don’t DL approaches use BFGS instead of gradient descent?

pacavaca · 8 years ago

Assuming that AI tries to mimic the way humans learn and evolve, those methods haven't changed for hundreds of thousands of years and brute-force trial and error is just one of them. It's kind of fundamental...

maxtollenar · 8 years ago

Mostly because the loss function space is not well understood, we need to do some kind of descent

Deleted Comment

BucketSort · 8 years ago

You may have seen DeepMind's results last year where it trained 3D models to move through space in different ways, entitled "Emergence of Locomotion Behaviours in Rich Environments" ( https://arxiv.org/pdf/1707.02286.pdf , https://www.youtube.com/watch?v=hx_bgoTF7bs&feature=youtu.be). If you have a look in the paper, the method they use "Proximal Policy Optimization" is a great example of differentiable programming that does not include a neural network. I actually realized this last month when I was preparing a talk on deep learning, because I thought it used deep neural nets in its application, but found that it didn't.

guillefix · 8 years ago

Scanning through the paper, I see this "We structure our policy into two subnetworks, one of which receives only proprioceptive information, and the other which receives only exteroceptive information. As explained in the previous paragraph with proprioceptive information we refer to information that is independent of any task and local to the body while exteroceptive information includes a representation of the terrain ahead. We compared this architecture to a simple fully connected neural network and found that it greatly increased learning speed."

It seems to me they do use neural nets. Proximal Policy Optimization is just a more novel way of optimizing them.

cs702 · 8 years ago

I wish we could come up with a catchier name, but I LOVE the idea of calling this programming, because that is precisely what we do when we compose deep neural nets.

For example, here's how you compose a neural net consisting of two "dense" layers (linear transformations), using Keras's functional API, and then apply these two layers to some tensor x to obtain a tensor y:

  f = Dense(n)
  g = Dense(n)

  y = f(g(x))

This looks, smells, and tastes like programming (in this case with a strong functional flavor), doesn't it?

Imagine how interesting things will get once we have nice facilities for composing large, complex applications made up of lots of components and subcomponents that are differentiable, both independently and end-to-end.

Andrej Karpathy has a great post about this: https://medium.com/@karpathy/software-2-0-a64152b37c35

sitkack · 8 years ago

NN are _just_ transfer functions. Look up tables. Or really dense maps. So f(g(x)) make total sense. But I dont think these are the interesting combinations. I think giving one NN the training experience of another, plus the feedback on "correct inference" will be when on NN trains its replacement.

cs702 · 8 years ago

Yes, of course. f(g(x)) was the simplest possible example I could come up with to illustrate the point :-)

bmc7505 · 8 years ago

  I hate the name, but LOVE the idea of calling this programming...

What would you call it instead?

letlambda · 8 years ago

∇programming

cs702 · 8 years ago

I changed "hate the name" to "wish we could come up with a catchier name," which better reflects what I meant to write.

Deleted Comment

seanmcdirmid · 8 years ago

LeCun specifically calls out imperative programming, not just typical data flow methods.

cs702 · 8 years ago

You're right. I softened the reference to functional programming.

BucketSort · 8 years ago

I believe this paper by Marcus ( https://arxiv.org/ftp/arxiv/papers/1801/1801.00631.pdf ) earlier this week inspired this.

Edit: I don't mean Marcus inspired the term differentiable programming; he inspired LeCun to emphasize the wider scope of deep learning after Marcus attacked it. In fact, LeCun liked a post on twitter rebutting Marcus' paper that also talks about differentiable programming: https://twitter.com/tdietterich/status/948811917593780225

sytelus · 8 years ago

I don't think so. LeCun seems to oppose Marcus's views...

I think LeCun doesn't want a repeat of AI winter because of exponentially rising hype and expectations out of Deep Learning. There have been few examples like Selena which he seems to think that people are trying to ride the deep learning wave to generate false buzz (and cash!) for themselves.

BucketSort · 8 years ago

He does oppose Marcus' views, but he also knows neural nets are only one approach to differentiable programming. The term is confusing though. It should read like "linear programming" does, but people are not interpreting it that way.

albertzeyer · 8 years ago

Discussion about that paper here: https://news.ycombinator.com/item?id=16083469

Ormus · 8 years ago

I can guarantee you that Marcus has in no way ever inspired LeCun.

seanmcdirmid · 8 years ago

Or the other way around... He has been throwing around the term for awhile now.

BucketSort · 8 years ago

It was most certainly the other way around. Marcus does not focus on differentiable programming in his paper. See my edit.

saycheese · 8 years ago

Past HN coverage of Differentiable Programming:

https://news.ycombinator.com/item?id=10828386

ehsankia · 8 years ago

1. Differentiable Programming is horrible branding. It's hard to say, not catchy, and not as easily decipherable.

2. Isn't the evolution of Deep Networks more advance setups such as GANs, RNNs, and so on?

kinkrtyavimoodh · 8 years ago

> Differentiable Programming is horrible branding. It's hard to say, not catchy, and not as easily decipherable

Tell that to the people who deliberately popularized the term Dynamic Programming for something that was neither dynamic nor programming.

____

(From Wiki)

Bellman explains the reasoning behind the term dynamic programming in his autobiography, Eye of the Hurricane: An Autobiography (1984, page 159). He explains:

"I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying. I thought, let's kill two birds with one stone. Let's take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities."

YeGoblynQueenne · 8 years ago

>> Let's take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible.

Now I have to try:

  "Dynamic, multimodal failure" (fail).

  "Dynamic instigation of pain for information retrieval" (torture).

  "Dynamic evisceration of underage humans" (slaughtering of children).

  "Dynamic destruction of useful resources" (environment destruction).

  "An algorithm for calculating dynamic stool-rotor collision physics" (shit hits the fan).

Not terribly good I guess but I think not that bad either.

hyperpallium · 8 years ago

Deep Learning -> "Dynamic Learning"? Post even describes it "dynamic".

bennofs · 8 years ago

I like it, at least from the little that I read about it. The name describes the core of what it is: differentiate programs, in order to figure out how changes the the program affect the output and using that for optimization purposes. Do we really need to invent obscure, new names for everything just so that it sounds catchy?

tomdre · 8 years ago

Couldn't agree more. A technique should be judged by its usefulness. Not by its catchiness.

jostmey · 8 years ago

> 2. Isn't the evolution of Deep Networks more advance setups such as GANs, RNNs, and so on?

That's not what Yann LeCun is getting at I think. Most neural networks models are sort of like a non-programmable computer. They are built around assuming the data is a fixed size, fixed dimensional array. But in computer programming there is an ocean of data structures. You have sorted sets, linked lists, dictionaries, and everything else. Imagine that we knew that a data-set was arranged as a sort of "fuzzy" dictionary and we wanted the computer to do the rest. All we need to do is load up the right deep neural network (I mean differential programming something or other) and wallah.

Something like where the the value in a piece of data dictate the layers that get stacked together and how those layers connect to layers for the next value in that piece of data

currymj · 8 years ago

it seems like it's made by analogy with "probabilistic programming", i.e. defining complex probability distributions by writing familiar-looking imperative code (with loops and whatnot).

I think the idea is that thinking in terms of passing data through layers in a graph is cumbersome sometimes, and that expressing it as a "regular" program that just happens to come with gradients could be more comfortable.

I'd argue that GANs in particular are a natural fit for this style. The training procedure doesn't really fit exactly into the standard "minimize loss function of many layers using backprop".

elchief · 8 years ago

Remember back in 2017 when Deep Learning wasn't legacy? Those were good times

fjsolwmv · 8 years ago

"Google uses Bayes nets like Microsoft uses 'if' statements" -+ Joel Spolsky, 15 years ago