Can anybody explain what is meant by „gradient“ in this context? Is this a way of speeding up the search for a desirable improvement of a pixel (or cluster) in each pass?
They refer to score models, where the score is the density of the dataset at the input.
Imagine you have a huuuuge landscape, with peaks and valleys. This landscape defines the density of the distribution. Now, imagine that you have a bunch of samples drawn from the distribution, this is the dataset.
Sampling an image is a mixture of dirac delta functions (or a mixture of Gaussian with variance approaching zero).
We can increase the variance to smoothen the landscape. This sort of builds an empirical estimate of the true landscape.
The height of this landscape is “the score”.
It turns out that we can actually compute an approximation of the gradient of the score.
The gradient of the score always points towards where the score (aka density) will increase, and you essentially apply gradients to “walk” towards a high density region.
This idea is actually very similar to using gradients to create poisoned inputs that are falsely predicted.
I am being a bit handwavey here, we don’t just increase the variance, but this is a decent enough approximation of what’s happening and what the gradient refers to in this case.
For all intents and purposes, you can think of the gradient as a vector pointing towards the direction that the density of the dataset increases.
In this context, it refers to the gradient of the error function (which is a function of the parameters).
In other words, for each tuple of parameters you can estimate the direction in which the reduction in the resulting error would be steepest.
Gradient descent is then a method whereby you iteratively move towards better and better parameterisations, by following the direction of the (negative) gradient at each step, which guarantees you will reach a local minimum much more efficiently than if you sampled new parameters in the area of the previous iteration at random.
And if your error function is convex, the minimum error is a global minimum.
> The key insight that makes DALL·E 2’s images possible — as well as those of its competitors Stable Diffusion and Imagen — comes from the world of physics.
Imagine you have a huuuuge landscape, with peaks and valleys. This landscape defines the density of the distribution. Now, imagine that you have a bunch of samples drawn from the distribution, this is the dataset.
Sampling an image is a mixture of dirac delta functions (or a mixture of Gaussian with variance approaching zero).
We can increase the variance to smoothen the landscape. This sort of builds an empirical estimate of the true landscape.
The height of this landscape is “the score”.
It turns out that we can actually compute an approximation of the gradient of the score.
The gradient of the score always points towards where the score (aka density) will increase, and you essentially apply gradients to “walk” towards a high density region.
This idea is actually very similar to using gradients to create poisoned inputs that are falsely predicted.
I am being a bit handwavey here, we don’t just increase the variance, but this is a decent enough approximation of what’s happening and what the gradient refers to in this case.
For all intents and purposes, you can think of the gradient as a vector pointing towards the direction that the density of the dataset increases.
In other words, for each tuple of parameters you can estimate the direction in which the reduction in the resulting error would be steepest.
Gradient descent is then a method whereby you iteratively move towards better and better parameterisations, by following the direction of the (negative) gradient at each step, which guarantees you will reach a local minimum much more efficiently than if you sampled new parameters in the area of the previous iteration at random.
And if your error function is convex, the minimum error is a global minimum.
Gradient descent uses the gradient, but is itself not the gradient.
The gradient itself is just the vector of partial derivatives with respect to each parameter.
Gradient descent is just following the gradient "downwards" towards a minima.
The article even describes it pretty well in non-mathematical terms.
> the gradient of the distribution (think of it as the slope of the high-dimensional surface).
I always hear devs say they never use calculus but I suspect it's because you can't use what you don't understand.
This is made even more prominent by the lack of any mention of the competitors.