The way that really made me understand gradients and derivative was when visualizing them as Arrow Maps. I even made a small tool https://github.com/GistNoesis/VisualizeGradient . This visualization helps understand optimization algorithm.
Jacobians can be understood as a collection of gradients when considering each coordinates of the output independently.
My mental picture for Hessian is to associate each point with the shape of a parabola (or saddle), which best match the function locally. It's easy to visualize once you realize it's the shape of what you see when you zoom-in on the point. (Technically this mental picture is more of a hessian + gradient tangent plane simultaneously multivariate Taylor expansion but I find them hard to mentally separate the slope from the curvature).
The "eigenchris" Youtube channel teaches tensor algebra, differential calculus, general relativity, and some other topics.
When I started thinking of vector calculus in terms of multiplying both vector components and the corresponding basis vectors, there was a nice unification of ordinary vector operations, jacobians, and the metric tensor.
Some times it's useful to see the elemental transformations, but often striving for a higher level view makes understanding easier.
It's particularly true when you try to apply it to physics. Often where you are introduced to vector calculus for the first time, things like Maxwell's equation, or fluid mechanics.
In physics there are often additional constraints, like conserved quantities. Like you calculate a scalar the total energy of a system, call it the Hamiltonian, auto-differentiate it with respect of the degrees of freedom, and you got very complex vector equations.
But taking a step back you realise it's just a field of "objects" of a certain type and you are just locally comparing these "objects" to their neighbors. And the whole mess of vector calculus is reduced to in which direction, you rotate, stretch and project these objects. (As an example you can imagine a field of balls in various orientation, whose energy is defined by the difference of orientation between neighboring balls)
When you wrap around that the whole point of vector calculus (and why it was invented) is to describe a state as an integral (a sum of successive) of linear infinitesimal transformations, this make more sense. These "objects" being constrained and continuous, by moving in infinitesimal steps along the tangent space (and then reproject to satisfy the constraints (or exponentiating in a Clifford algebra to stay in the space) ).
All the infinitesimal transformations are "linear", and linear transformations are rotating, stretching, mirroring, projecting to various degrees.
I'm also a visual learner and my class on dynamical systems really put a lot into perspective, particularly the parts about classifying stable/unstable/saddle points by finding eigenvectors/values of Jacobians.
A lot of optimization theory becomes intuitive once you work through a few of those and compare your understanding to arrow maps like you suggest.
There's something that's always been deeply confusing to me about comparing the Jacobian and the Hessian because their nature is very different.
The Hessian shouldn't have been called a matrix.
The Jacobian describes all the first order derivatives of a vector valued function (of multiple inputs), while the Hessian is all the second order derivatives of a scalar valued output function (of multiple inputs). Why doesn't the number of dimensions of the array increase by one as the derivation order increases? It does! The object that fully describes second order derivation of a vector valued function of multiple inputs is actually a 3 dimensionnal tensor. One dimension for the original vector valued output, and one for each derivation order. Mathematicians are afraid of tensors of more than 2 dimensions for some reason and want everything to be a matrix.
In other words, given a function R^n -> R^m:
Order 0: Output value: 1d array of shape (m) (a vector)
Order 1: First order derivative: 2d array of shape (m, n) (Jacobian matrix)
Order 2: Second order derivative: 3d array of shape (m, n, n) (array of Hessian matrices)
It all makes sense!
Talking about "Jacobian and Hessian" matrices as if they are both naturally matrices is highly misleading.
At least in my undergrad multivariate real analysis class, I remember the professor arranging things to strongly suggest that the Hessian should be thought of as ∇⊗∇, and that this was the second term in a higher dimensional Taylor series, so that the third derivative term would be ∇⊗∇⊗∇ etc. Things like tensor products or even quotient spaces weren't assumed knowledge, so it wasn't explicitly covered, but I remember feeling the connection was obvious enough at the time. Then an introductory differential geometry class got into (n,m) tensors. So I'm quite sure mathematicians are fine dealing with tensors. My experience was undergrad engineering math tries to avoid even covectors though, so that will stay well clear of a coherent picture of multi-variable calculus. e.g. my engineering professors would talk of dirac δ as an infinite spike/spooky doesn't-really-exist thing that makes integrals work or whatever. My analysis professor just said δ(f) = f(0) is a linear functional.
The Hessian is defined as the second order partial derivative of a scalar function. Therefore it will always give you a matrix.
What you're doing with the shape (m,n,n) isn't actually guaranteed at all since the output shape of an arbitrary function can be any tensor and you can apply the Hessian to each scalar value in the tensor to get another arbitrary tensor that has two dimensions more.
It's the Jacobian that is weird, since it is just a vector of gradients and therefore its partial derivative must also be a vector of Hessians.
This doesn't really help with programming, but in physics it's traditional to use up- and down-stairs indices, which makes the distinction you want very clear.
If input x has components xⁿ, and output f(x) components fᵐ, then the Jacobian is ∂ₙfᵐ which has one index upstairs and one downstairs. The derivative has a downstairs index... because x is in the denominator of d/dx, roughly? If x had units seconds, then d/dx has units per second.
Whereas if g(x) is a number, the gradient is ∂ₙg, and the Hessian is ∂ₙ∂ₙ₂g with two downstairs indices. You might call this a (0,2) tensor, while the Jaconian is (1,1). Most of the matrices in ordinary linear algebra are (1,1) tensors.
I’ve been introduced to the Hessian in the context of finding the extrema of functions with multiple variables, where it does not make sense to consider arbitrary output dimensions (what is the minimum of a vector function?). In this context, it is also important to find the definiteness of the underlying quadratic form, which is easier if you treat it as a matrix se you can apply Sylvester’s rule.
I agree it is confusing, because starting with notation will confuse you. I personally don't like the partial derivative-first definition of those concepts, as it all sounds a bit arbitrary.
What made sense to me is to start from the definition of derivative (the best linear approximation in some sense), and then everything else is about how to represent this. vectors, matrices, etc. are all vectors in the appropriate vector space, the derivative is always the same form in a functional form, etc.
E.g. you want the derivative of f(M) ? Just write f(M+h) - f(M), and then look for the terms in h / h^2 / etc. Apply chain rules / etc. for more complicated cases. This is IMO a much better way to learn about this.
Mathematicians are afraid of higher order tensors because they are unruly monsters.
There's a whole workshop of useful matrix tools. Decompositions, spectral theory, etc. These tools really break down when you generalize them to k-tensors. Even basic concepts like rank become sticky. (Iirc, the set of 3-tensors of tensor rank ≤k is not even topologically closed in general. Terrifying.) If you hand me some random 5-tensor, it's quite difficult to begin to understand it without somehow turning it into a matrix first by flattening or slicing or whatever.
Don't get me wrong. People work with these things. They do their best. But in general, mathematicians are afraid of higher order tensors. You should be too.
A bit more advanced than this post, but for calculating Jacobians and Hessians, the Julia folks have done some cool work recently building on classical automatic differentiation research: https://iclr-blogposts.github.io/2025/blog/sparse-autodiff/
Have you tried using Enzyme (https://enzyme.mit.edu/)? It operates on the LLVM IR, so it's available in any language that breaks down into LLVM (e.g., Julia, where I've used it for surface gradients) and it produces highly optimized AD code. Pretty cool stuff.
Yeah I've used it (cool project indeed!), albeit mostly just in a project I and others in the autodiff community maintain which benchmarks many different autodiff tools against each other: https://github.com/gradbench/gradbench
About a decade ago I was interviewed for Apple's self driving car project and an exec on the project asked me to define these exact 4 things in great detail and provide examples. Shrugs.
Thank you so much for posting. I finally understand Jacobian matrix. The key is to know this applies to a function that returns multiple values. The wikipedia article was difficult to understand, until now! Note: Technically, a function can map an input to a single output only. Here, when we say a function returns multiple values, we mean a single set of multiple values. For example, a function that outputs the heating and cooling cost of a building. Where as, a circle is not a function because it outputs two y values for a single x.
"What I just described is an iterative optimization method that is similar to gradient descent. Gradient descent simulates a ball rolling down hill to find the lowest point that we can, adjusting step size, and even adding momentum to try and not get stuck in places that are not the true minimum."
That is so much easier to understand than most descriptions. The whole opening was.
Mmh, this is a bit sloppy. The derivative of a function f::a -> b is a function Df::a -> a -o b where the second funny arrow indicates a linear function. I.e. the derivative Df takes a point in the domain and returns a linear approximation of f (the jacobian) at that point. And it’s always the jacobian, it’s just that when f is R -> R we conflate the jacobian (a 1x1 matrix in this case) with the number inside of it.
Sorry to actually your actually, but the derivative of a function f from a space A to a space B at the point a is a linear function Df_a from the tangent space of A at a to the tangent space of B at b = f(a).
When the spaces are Euclidean spaces then we conflate the tangent space with the space itself because they're identical.
By the way, this makes it easy to remember the chain rule formula in 1 dimension. There's only one logical thing it could be between spaces of arbitrary dimensions m, n, p: composition of linear transformations from T_a A to T_f(a) B to T_g(f(a)) C. Now let m = n = p = 1, and composition of linear transformations just becomes multiplication.
The distinction between the space A and the tangent space of A becomes visually clear if we consider a function whose domain is a sphere. The derivative is properly defined on the tangent plane, which only touches the sphere at a single point. However in the neighborhood of that point, the plane and sphere are very, very close together. But are inevitably pulled away by the curvature of the sphere.
Of course that picture is not formally correct. We formally define the tangent space without having to embed the manifold in Euclidean space. But that picture is a correct description of an embedding of both the sphere and the tangent space at a single point.
Oh I appreciate you actualling my actually ^^ but isn’t this case a special case of the one I wrote? I.e. when an and b are manifolds and admit tangent bundles?
A perhaps nicer way to look at things[0] is to hold onto your base points explicitly and say Df:: a -> (b, a -o b) = (f(p),A(p)) where f(p+v)≈f(p)+A(p)v. Then you retain the information you need to define composition Dg∘Df=D(g∘f)=(Dg._1∘Df._1, Dg(Df._1)_.2∘Df._2). i.e the chain rule.
Yes! I love Conal Eliot’s work. The one you wrote is the compositional derivative which augments the regular derivative by also returning the function itself (otherwise composition won’t work well). For anyone interested look up “the simple essence of automatic differentiation”.
Jacobians can be understood as a collection of gradients when considering each coordinates of the output independently.
My mental picture for Hessian is to associate each point with the shape of a parabola (or saddle), which best match the function locally. It's easy to visualize once you realize it's the shape of what you see when you zoom-in on the point. (Technically this mental picture is more of a hessian + gradient tangent plane simultaneously multivariate Taylor expansion but I find them hard to mentally separate the slope from the curvature).
When I started thinking of vector calculus in terms of multiplying both vector components and the corresponding basis vectors, there was a nice unification of ordinary vector operations, jacobians, and the metric tensor.
It's particularly true when you try to apply it to physics. Often where you are introduced to vector calculus for the first time, things like Maxwell's equation, or fluid mechanics.
In physics there are often additional constraints, like conserved quantities. Like you calculate a scalar the total energy of a system, call it the Hamiltonian, auto-differentiate it with respect of the degrees of freedom, and you got very complex vector equations.
But taking a step back you realise it's just a field of "objects" of a certain type and you are just locally comparing these "objects" to their neighbors. And the whole mess of vector calculus is reduced to in which direction, you rotate, stretch and project these objects. (As an example you can imagine a field of balls in various orientation, whose energy is defined by the difference of orientation between neighboring balls)
When you wrap around that the whole point of vector calculus (and why it was invented) is to describe a state as an integral (a sum of successive) of linear infinitesimal transformations, this make more sense. These "objects" being constrained and continuous, by moving in infinitesimal steps along the tangent space (and then reproject to satisfy the constraints (or exponentiating in a Clifford algebra to stay in the space) ).
All the infinitesimal transformations are "linear", and linear transformations are rotating, stretching, mirroring, projecting to various degrees.
A lot of optimization theory becomes intuitive once you work through a few of those and compare your understanding to arrow maps like you suggest.
The Hessian shouldn't have been called a matrix.
The Jacobian describes all the first order derivatives of a vector valued function (of multiple inputs), while the Hessian is all the second order derivatives of a scalar valued output function (of multiple inputs). Why doesn't the number of dimensions of the array increase by one as the derivation order increases? It does! The object that fully describes second order derivation of a vector valued function of multiple inputs is actually a 3 dimensionnal tensor. One dimension for the original vector valued output, and one for each derivation order. Mathematicians are afraid of tensors of more than 2 dimensions for some reason and want everything to be a matrix.
In other words, given a function R^n -> R^m:
Order 0: Output value: 1d array of shape (m) (a vector)
Order 1: First order derivative: 2d array of shape (m, n) (Jacobian matrix)
Order 2: Second order derivative: 3d array of shape (m, n, n) (array of Hessian matrices)
It all makes sense!
Talking about "Jacobian and Hessian" matrices as if they are both naturally matrices is highly misleading.
The Hessian is defined as the second order partial derivative of a scalar function. Therefore it will always give you a matrix.
What you're doing with the shape (m,n,n) isn't actually guaranteed at all since the output shape of an arbitrary function can be any tensor and you can apply the Hessian to each scalar value in the tensor to get another arbitrary tensor that has two dimensions more.
It's the Jacobian that is weird, since it is just a vector of gradients and therefore its partial derivative must also be a vector of Hessians.
If input x has components xⁿ, and output f(x) components fᵐ, then the Jacobian is ∂ₙfᵐ which has one index upstairs and one downstairs. The derivative has a downstairs index... because x is in the denominator of d/dx, roughly? If x had units seconds, then d/dx has units per second.
Whereas if g(x) is a number, the gradient is ∂ₙg, and the Hessian is ∂ₙ∂ₙ₂g with two downstairs indices. You might call this a (0,2) tensor, while the Jaconian is (1,1). Most of the matrices in ordinary linear algebra are (1,1) tensors.
Upstairs/downstairs is kinda cute tho xD
What made sense to me is to start from the definition of derivative (the best linear approximation in some sense), and then everything else is about how to represent this. vectors, matrices, etc. are all vectors in the appropriate vector space, the derivative is always the same form in a functional form, etc.
E.g. you want the derivative of f(M) ? Just write f(M+h) - f(M), and then look for the terms in h / h^2 / etc. Apply chain rules / etc. for more complicated cases. This is IMO a much better way to learn about this.
As for notation, you use vec/kronecker product for complicated cases: https://janmagnus.nl/papers/JRM093.pdf
There's a whole workshop of useful matrix tools. Decompositions, spectral theory, etc. These tools really break down when you generalize them to k-tensors. Even basic concepts like rank become sticky. (Iirc, the set of 3-tensors of tensor rank ≤k is not even topologically closed in general. Terrifying.) If you hand me some random 5-tensor, it's quite difficult to begin to understand it without somehow turning it into a matrix first by flattening or slicing or whatever.
Don't get me wrong. People work with these things. They do their best. But in general, mathematicians are afraid of higher order tensors. You should be too.
That is so much easier to understand than most descriptions. The whole opening was.
When the spaces are Euclidean spaces then we conflate the tangent space with the space itself because they're identical.
By the way, this makes it easy to remember the chain rule formula in 1 dimension. There's only one logical thing it could be between spaces of arbitrary dimensions m, n, p: composition of linear transformations from T_a A to T_f(a) B to T_g(f(a)) C. Now let m = n = p = 1, and composition of linear transformations just becomes multiplication.
(Only half kidding)
Of course that picture is not formally correct. We formally define the tangent space without having to embed the manifold in Euclidean space. But that picture is a correct description of an embedding of both the sphere and the tangent space at a single point.
Deleted Comment
[0] which I learned from this talk https://youtube.com/watch?v=17gfCTnw6uE