CPython is build in C. Can you differentiate through that? I.e. then Python programs also become differentiable? Similar as JAX.
How much control do you have about the gradient? In some cases, it can be useful to explicitly define a custom gradient, or to stop the gradient, or to change the gradient, etc.
Can you define gradients on integral types (int, char)?
Regarding differentiating python via CPython, theoretically yes, though practically it is likely more wise to use something like Numba which takes Python to LLVM directly to avoid a bunch of abstraction overhead that would otherwise have to be differentiated through. Also fun fact JaX can be told to simply emit LLVM and we've used that as an input for tests :)
You can explicitly define custom gradients by attaching metadata to the function you want to have the custom gradient (and Enzyme will use that even if it could differentiate the original function).
Integral types: mayyybe, depending what exactly you mean. I can imagine using custom gradient definitions to try specifying how an integral type can be used in a differentiable way (say representing a fixed point). We don't support differentiating integral types by approximating them as continuous values if that's what you're asking. There's no reason why we couldn't add this (besides perhaps bit tricks being annoying to differentiate), but haven't come across a use case.
Thank you for sharing and releasing usable code! Do you know if this would work for GPU based applications? Tensorflow models that are trained on a GPU, for example?
For GPU's, there's a couple of different things that you might want to do.
You can use existing tools within LLVM to automatically generate GPU code out of existing code, and this works perfectly fine, even running Enzyme first to synthesize the derivative.
You can also consider taking an existing GPU kernel and then automatically differentiating it. We currently support a limited set of cases for this (certain CUDA instructions, shared memory etc), and are working on expanding as well as doing performance improvements. AD of existing general GPU kernels is interesting [and more challenging] since racey reads in your original code become racey writes in the gradient -- which must have extra care taken to make sure they don't conflict. To my knowledge GPU AD on general programs (e.g. not a specific code) really hasn't been done before, so it's a fun research problem to work on (and if someone knows of existing tools for this please email me at wmoses at mit dot edu).
Enzyme needs to be able to access the IR of any potentially active functions (calls that it deduced could impact the gradient) to be able to differentiate them.
If all of the code you care about is in one compilation unit, you're immediately good to go.
Multiple compilation units can be handled in a couple of ways, depending on how much energy you want to set it up (and we're working on making this easier).
The easiest way is to compile with Link-Time Optimization (LTO) and have Enzyme run during LTO, which ensures it has access to bitcode for all potentially differentiated functions.
The slightly more difficult approach is to have Enzyme ahead-of-time rather than lazily emit derivatives for any functions you may call in an active way (and incidentally this is where Enzyme's rather aggressive activity analysis is super useful). Leveraging Enzyme's support for custom derivatives in which an LLVM function declaration can have metadata that marks its derivative function, Enzyme can then be told to use the "custom" derivatives it generated while compiling other compilation units. This obviously requires more setup so I'm usually lazy and use LTO, but this can definitely be made easier as a workflow.
We go into more details in the Limitations section of the paper, but in short Enzyme requires the following properties:
* IR of active functions must be accessible when Enzyme is called (e.g. cannot differentiate dlopen'd functions)
* Enzyme must be able to deduce the types of operations being performed (see paper section on interprocedural type analysis for details why)
* Support for exceptions is limited (and running with -fno-exceptions, equivalent in a diff language, or LLVM's exception lowering pass removes these).
* Support for parallel code (CPU/GPU) is ongoing [and see the prior comment on GPU parallelism for details]
Also see the Julia package that makes it acessible with a high level interface and probably one of the easier ways to play with it: https://github.com/wsmoses/Enzyme.jl.
> The Enzyme project is a tool for performing reverse-mode automatic differentiation (AD) of statically-analyzable LLVM IR. This allows developers to use Enzyme to automatically create gradients of their source code without much additional work.
Can someone please explain applications of creating gradients of my source code?
It constructs an analytical gradient from the code. The reason is that you can compute the gradient directly. This can enable optimizations such as avoiding caching big matrices because you don't need to keep track of states/trace the graph, or you can compute the 2nd, 3rd, 4th... and so on derivatives because you have an analytical gradient.
For example in an affine function, the gradient of the bias/intercept is the gradient of the loss wrt the activation function and for the weights, it's the product of loss wrt activation function and the input to the layer.
With automatic graph construction e.g. eager Tensorflow/Pytorch, the layer needs to cache the input of the layer, so that it can compute the gradient of the weights. If the layer receives inputs multiple times within the computation graph, you end up caching it multiple times.
With analytical gradients, you may be able to save memory by finding optimizations because you have the analytical gradient, e.g. above you can sum the inputs ie (dL/dz)input1 + (dL/dz)input2 = (dL/dz)(input1+input2).
Isn't the input of the layer fundamentally a part of the gradient computation? So even in this case (inspecting LLVM code) the computation still needs to look at the input.
AFAIK, It's mainly used for implementing gradient descent, which is used for training neural networks.
Frameworks like pytorch, tensorflow, probably used back propagation to calculate the gradient of a multidimensional function. But in involves tracing, and storing the network state during the forward pass.
Static automatic differentiation should be faster and should look a lot like differentiation is done mathematically rather than numerically.
Of course there are more applications to AD in scientific computing.
I don't see how static AD removes the need to store the network state. Is this a fundamental property of static AD?
Also, your statement sounds like pyTorch/TF are doing AD numerically, which is not the case. They build the analytical gradient from the traced computation graph.
Say you have some existing virus simulation codebase that you want to use ML on to derive an effective policy on. Without an AD tool like Enzyme, you'd have to spend significant time and effort understanding and rewriting that obnoxious 100K lines of fortran into TensorFlow, when you could've been spending it solving your problem. The reason you need to do this rewriting is because many ML algorithms require the derivatives of functions to be able to use them and Enzyme provides an easy way to generate derivatives of existing code.
This is also useful in the scientific world where derivatives of functions are commonplace.
You could also use it in more performance-engineering/computer systems ways as well by using the derivatives to perform uncertainty quantification and perhaps decide to use 32-bit floats rather than 64-bit doubles.
I'm a big believer in auto-diff, but I'm skeptical that any autodiff tool would differentiate a 100k line simulation code correctly and efficiently without manual intervention. I'd certainly love to be proven wrong, though and absolutely AD can be a big time saver :)
One could, but automatic differentiation is much more efficient than numerical differentiation, thus for high performance applications it is preferable to use automatic differentiation.
Funny, I worked on Tapenade (one of the compared automatic differentiation software). I'm happy that it still reaches 60% of the performance of something written directly inside an optimizing compiler.
The first time my co-advisor said "you're going to take the derivative of that code by the end of the day" I felt like I'd taken a wrong turn into crazy-town. I enjoyed reading about & understanding Tapenade — such a great piece of code!
Enzyme is named such as it's a tool that "synthesizes derivatives" and also as a pun referencing Zygote (another AD tool) since Enzyme operates at a lower level (LLVM rather than Julia).
Tapenade is just a traditional recipe from south-east of France, where the research team developing Tapenane is based (they are at INRIA Sophia-Antipolis).
do you have any sense for how this would integrate with rust? as someone who isn't familiar with how it works, it's not clear whether that would be as easy as normal C ffi interop or more involved.
Enzyme does indeed handle mutable arrays (both in Enzyme.jl and any other frontend)! If you want to try it out forewarned that we're currently upgrading Enzyme.jl for better JIT integration (dynamic re-entry, custom derivative passthrough, nicer garbage collection) so there may be some falling debris.
Some more relevant links for the curious
Github: https://github.com/wsmoses/Enzyme
Paper: https://proceedings.neurips.cc/paper/2020/file/9332c513ef44b...
Basically the long story short is that Enzyme has a couple of interesting contributions:
1) Low-level Automatic Differentiation (AD) IS possible and can be high performance
2) By working at LLVM we get cross-language and cross-platform AD
3) Working at the LLVM level actually can give more speedups (since it's able to be performed after optimization)
4) We made a plugin for PyTorch/TF that uses Enzyme to import foreign code into those frameworks with ease!
CPython is build in C. Can you differentiate through that? I.e. then Python programs also become differentiable? Similar as JAX.
How much control do you have about the gradient? In some cases, it can be useful to explicitly define a custom gradient, or to stop the gradient, or to change the gradient, etc.
Can you define gradients on integral types (int, char)?
You can explicitly define custom gradients by attaching metadata to the function you want to have the custom gradient (and Enzyme will use that even if it could differentiate the original function).
Integral types: mayyybe, depending what exactly you mean. I can imagine using custom gradient definitions to try specifying how an integral type can be used in a differentiable way (say representing a fixed point). We don't support differentiating integral types by approximating them as continuous values if that's what you're asking. There's no reason why we couldn't add this (besides perhaps bit tricks being annoying to differentiate), but haven't come across a use case.
Thank you for sharing and releasing usable code! Do you know if this would work for GPU based applications? Tensorflow models that are trained on a GPU, for example?
You can use existing tools within LLVM to automatically generate GPU code out of existing code, and this works perfectly fine, even running Enzyme first to synthesize the derivative.
You can also consider taking an existing GPU kernel and then automatically differentiating it. We currently support a limited set of cases for this (certain CUDA instructions, shared memory etc), and are working on expanding as well as doing performance improvements. AD of existing general GPU kernels is interesting [and more challenging] since racey reads in your original code become racey writes in the gradient -- which must have extra care taken to make sure they don't conflict. To my knowledge GPU AD on general programs (e.g. not a specific code) really hasn't been done before, so it's a fun research problem to work on (and if someone knows of existing tools for this please email me at wmoses at mit dot edu).
What happens if the function you want to differentiate calls multiple other functions, in multiple other compilation units?
(I haven't read the paper yet but definitely will)
If all of the code you care about is in one compilation unit, you're immediately good to go.
Multiple compilation units can be handled in a couple of ways, depending on how much energy you want to set it up (and we're working on making this easier).
The easiest way is to compile with Link-Time Optimization (LTO) and have Enzyme run during LTO, which ensures it has access to bitcode for all potentially differentiated functions.
The slightly more difficult approach is to have Enzyme ahead-of-time rather than lazily emit derivatives for any functions you may call in an active way (and incidentally this is where Enzyme's rather aggressive activity analysis is super useful). Leveraging Enzyme's support for custom derivatives in which an LLVM function declaration can have metadata that marks its derivative function, Enzyme can then be told to use the "custom" derivatives it generated while compiling other compilation units. This obviously requires more setup so I'm usually lazy and use LTO, but this can definitely be made easier as a workflow.
* IR of active functions must be accessible when Enzyme is called (e.g. cannot differentiate dlopen'd functions)
* Enzyme must be able to deduce the types of operations being performed (see paper section on interprocedural type analysis for details why)
* Support for exceptions is limited (and running with -fno-exceptions, equivalent in a diff language, or LLVM's exception lowering pass removes these).
* Support for parallel code (CPU/GPU) is ongoing [and see the prior comment on GPU parallelism for details]
Can someone please explain applications of creating gradients of my source code?
For example in an affine function, the gradient of the bias/intercept is the gradient of the loss wrt the activation function and for the weights, it's the product of loss wrt activation function and the input to the layer.
With automatic graph construction e.g. eager Tensorflow/Pytorch, the layer needs to cache the input of the layer, so that it can compute the gradient of the weights. If the layer receives inputs multiple times within the computation graph, you end up caching it multiple times.
With analytical gradients, you may be able to save memory by finding optimizations because you have the analytical gradient, e.g. above you can sum the inputs ie (dL/dz)input1 + (dL/dz)input2 = (dL/dz)(input1+input2).
Frameworks like pytorch, tensorflow, probably used back propagation to calculate the gradient of a multidimensional function. But in involves tracing, and storing the network state during the forward pass.
Static automatic differentiation should be faster and should look a lot like differentiation is done mathematically rather than numerically.
Of course there are more applications to AD in scientific computing.
https://github.com/apple/swift/blob/main/docs/Differentiable...
Which leads to "Swift for Tensorflow" that unlike other languages like Java, Go or Python is not just about bindings to the C++ tensorflow library.
Also, your statement sounds like pyTorch/TF are doing AD numerically, which is not the case. They build the analytical gradient from the traced computation graph.
This is also useful in the scientific world where derivatives of functions are commonplace.
You could also use it in more performance-engineering/computer systems ways as well by using the derivatives to perform uncertainty quantification and perhaps decide to use 32-bit floats rather than 64-bit doubles.
Deleted Comment