There's a whole class of deterministic Global Optimization solvers on COIN-OR [0] that don't seem to get noticed outside the Operations Research/Applied math community. (many are free and open source, e.g. Couenne)
These solvers (BARON [1], ANTIGONE [2], Couenne [3] etc.) are capable of solving nonconvex global optimization problems that are structurally more complicated (i.e. non-ML applications) than the ones presented in the article linked above.
They employ a bunch of mathematically-rigorous methods such as interval analysis, branch-and-cut and McCormick relaxations. There's a treasure trove of little ideas from decades of global optimization research that I feel the Deep Learning community could benefit from taking a peek at.
I've been looking at open source solvers for a while now to solve an MIP, and this one seems to be the best.
There is this annoying thing about the whole ecosystem though, aside from the proprietary bits.
I started by looking at GLPK and wrote my MIP in GMPL. For some reason a lot of tools I looked at don't "just work". I have to export it to some other format. It works, but feels like a work-around.
I wish there were a lingua franca in this domain and sort of have support from all major solver implementations, but each one seems to have their own way, or just provide an API (ehem google/or-tools)
I am happy I can make it work, but it may not be as easy for others.
Pyomo is also pretty useful if you'd rather use Python than Julia, though it's not as well-documented and doesn't make installation of the open-source solvers quite as easy for you.
You might also be interested in looking at MiniZinc (http://minizinc.org/). It is mostly used int the constraint programming community, but the language is solver technology neutral and there are back-ends that translate to MIP systems.
Personally I enjoy using MiniZinc as a language for prototyping models in, and enabling me to experiment and compare various different solver technologies easily.
Speaking of optimization libraries, I've been checking out https://developers.google.com/optimization/ for scheduling and assignment problems. The documentation is not... wonderful, but it has a lot of examples and a decent Python API.
I recently did an "Intro to OR" class and most of the tools we used were proprietary. I've been looking for more open source tools (and more info on OR in general, I think it's fascinating!).
I highly recommend taking a look at the COIN-OR (Computational Infrastructure for Operations Research) project page. All the code is open-source. There are some amazing solvers in that list.
https://www.coin-or.org/projects/
The only issue is that much of the code on that site is very cutting-edge and academic. You may need to read a few benchmarks/publications to figure which solvers are production quality and which are not.
That said, a handful of these free solvers are so well-written that they exceed the performance of many commercial solvers for certain types of optimization problems.
(A notable exception is MIPs. Commercial MIP solvers like Gurobi or CPLEX are orders of magnitude better than the open-source CBC or GLPK. CBC is still pretty ok though, and I bundle it when I need to include a free solver in my code to solve smaller problems.)
I do not have anything to contribute regarding this specific optimization algorithm, but I want to recommend the dlib [0] library it comes from, which is a very high quality library that has machine learning algorithms, numerical analysis algorithms, image processing, network, threading, gui, and a few other things.
The (sole, afaik) library author and maintainer Davis King is also extremely responsive and helpful. dlib and Davis deserve a lot more recognition.
Of course, global optimization using the Lipschitz constant has been around for decades. As far as I understand, this paper also estimates the constant. Is that the novelty or is it the convergence proofs?
Another method for global optimization is interval arithmetic, but it is not a black-box method.
Yes indeed! (to this stuff being around a while) And then what it can do is a subset of the interval toolkit (also around a while). Somehow I get the idea that interval methods are underappreciated and thus lesser known - it seems anecdotally that people hear of the outward rounding bounds growth problem and go no further. (maybe evaluation efficiency compared to floats is the other gripe)
"Interval algorithms for constrained and unconstrained optimization are based on adaptive, exhaustive search of the domain. Their overall structure is virtually identical to Lipschitz optimization as in [4], since interval evaluations of an objective function ϕ over an interval vector x correspond to estimation of the range of ϕ over x with Lipschitz constants. However, there are additional opportunities for acceleration of the process with interval algorithms, and use of outwardly rounded interval arithmetic gives the computations the rigor of a mathematical proof."
I'm also curious what this adds that's new. I'll add one other bit: If anyone ever claims they have an efficient global optimization algorithm for continuous optimization, ask whether or not P=NP. We can encode the integer constraint that x \in {0,1} as x(x-1)=0, so if we can find the global optimum to a continuous problem efficiently, then we can do the same for a discrete. That's probably unlikely, so I'm always a bit skeptical.
By the way, the parent mentioned interval arithmetic. If anyone is looking for optimization algorithm look for "interval Newton". It's a cool algorithm.
Not all continuous optimization algorithms can handle arbitrary equality constraints, especially if they effectively make the optimization domain discrete. You might be able to do an encoding of NP-complete problems that is purely continuous, but if I read the LIPO paper correctly, there is a known exponential lower bound for their problem, so that would just make the NP-complete problem harder.
Instead, the algorithm promises to actually match that lower bound, while always being at least as good as random search, and offering exponentially large improvements in certain cases. Not exactly the kind of claim that has implications for P =? NP.
Interval Newton is probably my favorite algorithm. Intervals in concert with the fixed point theorems - amazing stuff. (and all the rest! - contractors, inclusion/exclusion tests exploiting derivative info, etc.) Good tools for optimization, indeed.
> If anyone ever claims they have an efficient global optimization algorithm for continuous optimization, ask whether or not P=NP
Why? Continuous global optimization on convex problems can be solved in polynomial time. Did you mean on nonconvex problems?
Also, x*(x-1)=0 is a complementarity constraint and it is mathematically degenerate (it violates constraint qualifications) when solved as a continuous optimization problem. There are many ways to work-around it [0, 1, 2] and even solvers built for this of problems (MPECs = Mathematical Programs with Equilibrium Constraints) which I suspect only work well for narrow classes of problems like Linear Complementarity Problem (LCPs).
I have found that ultimately for nonlinear non-convex problems, the solution technique for continuous optimization of complementarity constraint problems has been empirically unreliable or the solution is often low equality (even if it can be shown that the complementarity penalty method does not violate constraint qualifications).
Bilinear constraints generally add nonconvexities and add a lot of difficult terrain for a continuous optimizer.
For general nonconvex problems, it's much more reliable to cast the problem as an NP-hard MINLP.
And if you're solving an LCP, you might as well recast it as an MIP rather than mucking around with an LCP. MIP solvers like Gurobi and CPLEX are so amazingly performant these days that it's just a no-brainer
Anyone considering using this for continuous hyper-parameter optimization in machine learning should be warned that the automatic Lipschitz constant optimization (the novelty of the method) will very likely not work well with a noisy objective function (any stochastic learning algorithm).
edit: Any grad students looking for a paper idea might consider extending this result to allow stochastic objective functions satisfying a stochastic Lipschitz condition.
> However, if you want to use LIPO in practice there are some issues that need to be addressed. The rest of this blog post discusses these issues and how the dlib implementation addresses them. First, if f(x) is noisy or discontinuous even a little it's not going to work reliably since k will be infinity.
> ... Now each sample from f(x) has its own noise term, σi, which should be 0 most of the time unless xi is really close to a discontinuity or there is some stochasticity.
The strategy that the dlib implementation uses of combining a global optimization method (LIPO) with one that works better locally (TR) reminds me of another recent paper I ran across, which takes a black-box global optimization algorithm (StoSOO) and improves local convergence behavior using CMA-ES [1]. A main difference seems to be that StoSOO focuses on robustness to noisy functions more than LIPO does.
I'd like to share another global optimization algorithm 'worth using' that I've recently discovered for myself: Hyperband [0]. It's extremely simple to implement (i.e. a dozen lines of code), is based on successive halving principle and is particularly suitable for iterative machine learning algorithms where early stopping is applicable (i.e. NN, GBM).
I don't speak fluent maths, but i'm extremely interested. Is anyone in a position to explain what the initial equation for U(x) means?
I'll be in your debt forever!
A Lipschitz function is one that doesn't change to quickly, such that the difference between two values |f(x) - f(y)| is at most as large as the distance between x and y times a constant k i.e. |f(x) - f(y)| <= k|x-y|. In particular this implies that:
f(x) - f(y) <= k|x-y|
hence
f(x) <= f(y) + k|x-y|.
Therefore the function:
u(x) = f(y) + k|x-y|
is an upper bound for the function f(x). Repeating this for multiple points x1, x2, x3,... you can construct a stricter upper bound by taking the minimum:
U(x) starts by evaluating f(x), but instead of checking for every value of x, it only checks certain values x1, x2, x3... xt. Then U(x) gives an upper bound for what f(x) might be for any x between the ones you actually calculated. Take the difference between any x and the nearest xi for which you know f(xi), and multiply it by the maximum possible slope. Add that to f(xi), and you have the upper bound for how high f(x) might be.
Consider the case where f is an elementary function of one variable and you know that its slope between any two points is at most k and at least -k. And suppose you want an upper bound on f(0) but you know the values of f(-2) and f(3). Then f(0) <= f(-2) + 2 * k and f(0) <= f(3) + 3 * k. Taking the minimum of these two right-hand sides leads to a formula analogous to the one for U(x) that you're asking about (with the norm || x - x_i || taking the place of |0 - (-2)| and |0 - 3| in our example).
if you can program, you can read maths (understanding its purpose is a different issue)
this formula defines a function U(x). This means that it gives a recipe that takes a number x and produces a number U(x). The recipe depends on an already prepared set of ingredients: a set of numbers x1, x2, ..., xt, k, and another function f. Now, given a number x, to compute U(x), you first compute all the numbers f(xi)-k*|x-xi|, and then you pick the smallest one. This smallest value is then U(x).
The graph below the formula displays its interpretation. The graph of the function f is the red curve, the points (xi, f(xi)) are the black dots, and U(x) is in green.
The doc uses x to refer to the new point, and x_i for the ith previously evaluated point. To reduce ambiguity, i'll use a capital X to refer to the new point.
The equation says that U(X) is equal to the minimum value of the expression f(x_i) + k * (Euclidean distance from x to x_i), where i ranges over integers 1 to t.
In Go pseudo-code:
func U(X []float64, x [][]float64, f func([]float64) float64, k float64) float64 {
min := math.Inf(1)
for _, x_i := range x {
e := f(x_i) + k * EuclideanDistance(X, x_i)
if e < min {
min := e
}
}
return min
}
Can someone explain the intuition behind how the lipschitz constant k is chosen?
It seems like even in his demo video, k starts off underestimated which made the upperbound not a upperbound (f is sometimes above the green line U) and is only slowly corrected towards the end. Why does it still work?
These solvers (BARON [1], ANTIGONE [2], Couenne [3] etc.) are capable of solving nonconvex global optimization problems that are structurally more complicated (i.e. non-ML applications) than the ones presented in the article linked above.
They employ a bunch of mathematically-rigorous methods such as interval analysis, branch-and-cut and McCormick relaxations. There's a treasure trove of little ideas from decades of global optimization research that I feel the Deep Learning community could benefit from taking a peek at.
[0] https://neos-server.org/neos/solvers/index.html
[1] http://archimedes.cheme.cmu.edu/?q=baron
[2] http://helios.princeton.edu/ANTIGONE/index.html
[3] https://www.coin-or.org/Couenne/
[1] https://github.com/chrisstroemel/Simple
[2] https://news.ycombinator.com/item?id=16241659
I've been looking at open source solvers for a while now to solve an MIP, and this one seems to be the best.
There is this annoying thing about the whole ecosystem though, aside from the proprietary bits.
I started by looking at GLPK and wrote my MIP in GMPL. For some reason a lot of tools I looked at don't "just work". I have to export it to some other format. It works, but feels like a work-around.
I wish there were a lingua franca in this domain and sort of have support from all major solver implementations, but each one seems to have their own way, or just provide an API (ehem google/or-tools)
I am happy I can make it work, but it may not be as easy for others.
JuMP is kind of that in terms of modeling software - http://www.juliaopt.org/
Pyomo is also pretty useful if you'd rather use Python than Julia, though it's not as well-documented and doesn't make installation of the open-source solvers quite as easy for you.
The MiniZinc Challenge (http://www.minizinc.org/challenge2017/results2017.html) is a competition between various solvers where the organizers enter some MIP solver back-ends also.
Personally I enjoy using MiniZinc as a language for prototyping models in, and enabling me to experiment and compare various different solver technologies easily.
I recently did an "Intro to OR" class and most of the tools we used were proprietary. I've been looking for more open source tools (and more info on OR in general, I think it's fascinating!).
Edit: NLOpt (https://nlopt.readthedocs.io/en/latest/) is also a nice suite of optimization algorithms for non-linear optimization.
The only issue is that much of the code on that site is very cutting-edge and academic. You may need to read a few benchmarks/publications to figure which solvers are production quality and which are not.
That said, a handful of these free solvers are so well-written that they exceed the performance of many commercial solvers for certain types of optimization problems.
(A notable exception is MIPs. Commercial MIP solvers like Gurobi or CPLEX are orders of magnitude better than the open-source CBC or GLPK. CBC is still pretty ok though, and I bundle it when I need to include a free solver in my code to solve smaller problems.)
The (sole, afaik) library author and maintainer Davis King is also extremely responsive and helpful. dlib and Davis deserve a lot more recognition.
[0] http://dlib.net
However I just wish it's Python binding were published as binary wheels. Dlib is very large and the compilation times suck when installing from pip.
In fact, I might as well contact him and help setup this!
Deleted Comment
Another method for global optimization is interval arithmetic, but it is not a black-box method.
About Lipschitz interval methods, as stated here (first googled example, this from 2001): https://link.springer.com/referenceworkentry/10.1007%2F0-306...
"Interval algorithms for constrained and unconstrained optimization are based on adaptive, exhaustive search of the domain. Their overall structure is virtually identical to Lipschitz optimization as in [4], since interval evaluations of an objective function ϕ over an interval vector x correspond to estimation of the range of ϕ over x with Lipschitz constants. However, there are additional opportunities for acceleration of the process with interval algorithms, and use of outwardly rounded interval arithmetic gives the computations the rigor of a mathematical proof."
By the way, the parent mentioned interval arithmetic. If anyone is looking for optimization algorithm look for "interval Newton". It's a cool algorithm.
Instead, the algorithm promises to actually match that lower bound, while always being at least as good as random search, and offering exponentially large improvements in certain cases. Not exactly the kind of claim that has implications for P =? NP.
Why? Continuous global optimization on convex problems can be solved in polynomial time. Did you mean on nonconvex problems?
Also, x*(x-1)=0 is a complementarity constraint and it is mathematically degenerate (it violates constraint qualifications) when solved as a continuous optimization problem. There are many ways to work-around it [0, 1, 2] and even solvers built for this of problems (MPECs = Mathematical Programs with Equilibrium Constraints) which I suspect only work well for narrow classes of problems like Linear Complementarity Problem (LCPs).
I have found that ultimately for nonlinear non-convex problems, the solution technique for continuous optimization of complementarity constraint problems has been empirically unreliable or the solution is often low equality (even if it can be shown that the complementarity penalty method does not violate constraint qualifications).
Bilinear constraints generally add nonconvexities and add a lot of difficult terrain for a continuous optimizer.
For general nonconvex problems, it's much more reliable to cast the problem as an NP-hard MINLP.
And if you're solving an LCP, you might as well recast it as an MIP rather than mucking around with an LCP. MIP solvers like Gurobi and CPLEX are so amazingly performant these days that it's just a no-brainer
[0] http://www.tandfonline.com/doi/abs/10.1080/10556780410001654...
[1] http://epubs.siam.org/doi/abs/10.1137/S1052623403429081
[2] http://epubs.siam.org/doi/abs/10.1137/040621065
[4] https://www.artelys.com/tools/knitro_doc/2_userGuide/complem...
Deleted Comment
edit: Any grad students looking for a paper idea might consider extending this result to allow stochastic objective functions satisfying a stochastic Lipschitz condition.
> However, if you want to use LIPO in practice there are some issues that need to be addressed. The rest of this blog post discusses these issues and how the dlib implementation addresses them. First, if f(x) is noisy or discontinuous even a little it's not going to work reliably since k will be infinity.
> ... Now each sample from f(x) has its own noise term, σi, which should be 0 most of the time unless xi is really close to a discontinuity or there is some stochasticity.
Is that not what you were pointing out?
[1] https://link.springer.com/article/10.1007/s10898-016-0482-9 / http://ssamot.me/papers/global-optimisation.pdf
I'd like to share another global optimization algorithm 'worth using' that I've recently discovered for myself: Hyperband [0]. It's extremely simple to implement (i.e. a dozen lines of code), is based on successive halving principle and is particularly suitable for iterative machine learning algorithms where early stopping is applicable (i.e. NN, GBM).
[0] https://people.eecs.berkeley.edu/~kjamieson/hyperband.html
f(x) - f(y) <= k|x-y|
hence
f(x) <= f(y) + k|x-y|.
Therefore the function:
u(x) = f(y) + k|x-y|
is an upper bound for the function f(x). Repeating this for multiple points x1, x2, x3,... you can construct a stricter upper bound by taking the minimum:
U(x) = min_i f(xi) + k|x - xi|
this formula defines a function U(x). This means that it gives a recipe that takes a number x and produces a number U(x). The recipe depends on an already prepared set of ingredients: a set of numbers x1, x2, ..., xt, k, and another function f. Now, given a number x, to compute U(x), you first compute all the numbers f(xi)-k*|x-xi|, and then you pick the smallest one. This smallest value is then U(x).
The graph below the formula displays its interpretation. The graph of the function f is the red curve, the points (xi, f(xi)) are the black dots, and U(x) is in green.
The equation says that U(X) is equal to the minimum value of the expression f(x_i) + k * (Euclidean distance from x to x_i), where i ranges over integers 1 to t.
In Go pseudo-code:
It seems like even in his demo video, k starts off underestimated which made the upperbound not a upperbound (f is sometimes above the green line U) and is only slowly corrected towards the end. Why does it still work?