Instructions - click to expand
- Fork the rfcs repo: https://github.com/pytorch/rfcs
- Copy
RFC-0000-template.md
toRFC-00xx-my-feature.md
, or write your own open-ended proposal. Put care into the details. - Submit a pull request titled
RFC-00xx-my-feature
.- Assign the
draft
label while composing the RFC. You may find it easier to use a WYSIWYG editor (like Google Docs) when working with a few close collaborators; feel free to use whatever platform you like. Ideally this document is publicly visible and is linked to from the PR. - When opening the RFC for general discussion, copy your document into the
RFC-00xx-my-feature.md
file on the PR and assign thecommenting
label.
- Assign the
- Build consensus for your proposal, integrate feedback and revise it as needed, and summarize the outcome of the discussion via a resolution template.
- If the RFC is idle here (no activity for 2 weeks), assign the label
stalled
to the PR.
- If the RFC is idle here (no activity for 2 weeks), assign the label
- Once the discussion has settled, assign a new label based on the level of support:
accepted
if a decision has been made in the RFCdraft
if the author needs to rework the RFC’s proposalshelved
if there are no plans to move ahead with the current RFC’s proposal. We want neither to think about evaluating the proposal nor about implementing the described feature until some time in the future.
- A state of
accepted
means that the core team has agreed in principle to the proposal, and it is ready for implementation. - The author (or any interested developer) should next open a tracking issue on Github corresponding to the RFC.
- This tracking issue should contain the implementation next steps. Link to this tracking issue on the RFC (in the Resolution > Next Steps section)
- Once all relevant PRs are merged, the RFC’s status label can be finally updated to
closed
.
Author’s note—This RFC is a work-in-progress.
- Allen Goodman (@0x00b1)
Combine the implicit function theorem and automatic differentation to unify and expand PyTorch optimization module (torch.optim).
Users define an objective function to capture the optimality conditions of their problem:
import functorch
from torch import Tensor
def f(x: Tensor, y: Tensor) -> Tensor:
…
g = functorch.grad(objective, argnums=0)
torch.optim
would provide a robust set of operations to easily express g
.
g
is passed to the @torch.optim.root
decorator and the decorator is prepended to a differentiable solver:
@torch.optim.root(function)
def h(x: Tensor, y: Tensor) -> Tensor:
…
PyTorch will combine the implicit function theorem and automatic differentation of g
to automatically differentiate the solution:
functorch.jacrev(h, argnums=0)(y, 10.0)
torch.optim
would provide a robust set of solvers.
Automatic differentiation encourages PyTorch users to express complex computations by creatively composing elementary computations, removing the tediousness of manually computing derivatives. It’s the fundamental feature of PyTorch. Meanwhile, the differentiation of optimization solutions has become fundamental to machine learning practitioners. A prominent example is bi-level optimization, computing the derivatives of an inner optimization problem to solve an outer one. Applications in machine learning include hyper-parameter optimization, neural networks, and meta learning. Unfortunately, optimization solutions usually do not enjoy an explicit formula for their inputs, so these functions cannot use automatic differentiation.
Two strategies have been commonly used in recent years to circumvent this problem.
-
The first strategy, unrolling the iterations of an optimization method using the final iteration as a proxy for the solution, enables the explicit construction of an automatically differentiable graph. But, unrolling requires reimplementing the optimization method using automatically differentiable operators, and many algorithms are unfriendly to automatic differentiation. Furthermore, forward-mode automatic differentiation has a time complexity that scales linearly with the number of variables, and reverse-mode automatic differentiation has a memory complexity that scales linearly with the number of iterations.
-
The second strategy, implicitly relating an optimization solution to its inputs using optimality conditions, is comparatively advantageous since reimplementation is unnecessary. In machine learning, implicit differentiation is successfully used in stationarity conditions, Karush–Kuhn–Tucker (KKT) conditions, and the proximal gradient methods. Yet, implicit differentiation has, so far, remained difficult to use for practitioners, as it requires case-by-case, tedious mathematical derivation and implementation.
Here, I'll propose the adoption of a third strategy, automatic implicit differentiation, an approach that adds implicit differentiation to existing optimization methods. First, the practitioner defines a mapping,
This strategy was first summarized in the following paper:
@article{jaxopt_implicit_diff,
title={Efficient and Modular Implicit Differentiation},
author={Blondel, Mathieu and Berthet, Quentin and Cuturi, Marco and Frostig, Roy
and Hoyer, Stephan and Llinares-L{\'o}pez, Felipe and Pedregosa, Fabian
and Vert, Jean-Philippe},
journal={arXiv preprint arXiv:2105.15183},
year={2021}
}
It is the primary reference for this proposal.
Hessian of
Jacobian of
If
The standard simplex is denoted:
For any set
where
For a vector or matrix,
An optimal solution,
For
For all
Using the chain rule, the Jacobian
Comparing
When
is straightforward because we have
Many existing and new implicit differentiation methods reduce to this principle. This strategy is efficient as it can be added to any solver and modular as the optimality condition is decoupled from the implicit differentiation method.
In many applications,
where
In this case, when
and
In most practical scenarios, it is not necessary to explicitly form the Jacobian matrix, and instead it is sufficient to left-multiply or right-multiply by
The right-multiplication (JVP) between
To solve these linear systems, we can use the conjugate gradient method when
Oftentimes, the goal of a practitioner is not to differentiate
One example of such pre-processing is to convert the parameters to be differentiated from one form to another canonical form, such as quadratic or conic programs. Another example is when
PyTorch should leave the differentiation of such pre-and-post-processing mappings to the automatic differentiation system, allowing to compose functions in complex ways.
Let
Applying the chain and product rules:
So
Newton’s method is obtained by
$$T(x,\theta)=x-\eta[\nabla^{2}{1}f(x,\theta)]^{-1}\nabla{1}f(x,\theta).$$
The LU decomposition of
Decorators and Functions
Decorators and Functions
@torch.optim.root
from typing import Callable
def root(f: Callable, g: Callable):
"""
Add implicit differentiation to a root method.
Args:
f: equation, ``f(parameters, *args)``.
Invariant is ``f(solution, *args) == 0`` at ``solution``.
g: linear solver of the form ``g(a, b)``.
Returns:
A solver function decorator, i.e., ``root(f)(g)``.
"""
pass
When a solver function is decorated with @torch.optim.root
, PyTorch adds custom JVP and VJP methods to the Function instance, overriding PyTorch’s default behavior. Linear system solvers based on matrix-vector products are used and only need access to
@torch.optim.fixed_point
from typing import Callable
def fixed_point(f: Callable, g: Callable):
"""
Add implicit differentiation to a fixed point method.
Args:
f: equation, ``f(parameters, *args)``.
Invariant is ``f(solution, *args) == 0`` at ``solution``.
g: linear solver of the form ``g(a, b)``.
Returns:
A solver function decorator, i.e., ``fixed_point(f)(g)``.
"""
pass
torch.optim.root_jvp
from typing import Any, Callable, Tuple
def root_jvp(
f: Callable,
g: Callable,
x: Tuple[Any],
y: Tuple[Any],
z: Any,
):
"""
Jacobian-vector product of a root.
Args:
f:
g: linear solver of the form ``g(a, b)``.
x: arguments.
y: tangents.
z: solution.
"""
pass
torch.optim.root_vjp
from typing import Any, Callable, Tuple
def root_vjp(
f: Callable,
g: Callable,
x: Tuple[Any],
y: Tuple[Any],
z: Any,
):
"""
Vector-Jacobian product of a root.
Args:
f:
g: linear solver of the form ``g(a, b)``.
x: arguments.
y: cotangents.
z: solution.
"""
pass
Bracketing Optimizers
Bracketing Optimizers
PyTorch provides a variety of bracketing methods for univariate functions, or functions involving a single variable.
Bracketing is the process of identifying an interval in which a local minimum lies and then successively shrinking the interval. For many functions, derivative information can be helpful in directing the search for an optimum, but, for some functions, this information may not be available or might not exist.
Bisection Method
from typing import Callable, Optional, Tuple
from torch import Tensor
from torch.optim import Optimizer
class Bisection(Optimizer):
def __init__(
self,
f: Callable[[Tensor], Tensor],
bracket: Tuple[float, float],
tolerance: Optional[float] = None,
maximum_iterations: Optional[int] = None,
):
raise NotImplementedError
Example
import functorch
from torch import Tensor
from torch.optim import Bisection
import torch
def f(x: Tensor) -> Tensor:
return x ** 3 - x - 2
assert Bisection(f, (1, 2)) == torch.tensor([1.521])
def g(x: Tensor, k: Tensor) -> Tensor:
return k * x ** 3 - x - 2
def root(k: Tensor) -> Tensor:
return Bisection(g, (1, 2))(k).params
# Derivative of the root of `f` with respect to `k` where $k = 2.0$.
functorch.grad(root)(2.0)
Brent’s Method
Brent’s method is an extension of the bisection method. It is a root-finding algorithm that combines elements of the secant method and inverse quadratic interpolation. It has reliable and fast convergence properties, and it is the univariate optimization algorithm of choice in many popular numerical optimization packages.
from typing import Callable, Optional, Tuple, Union
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class Brent(Optimizer):
def __init__(
self,
f: Callable[[Tensor], Tensor],
bracket: Union[Tuple[float, float], Tuple[float, float, float]],
tolerance: float,
maximum_iterations: Optional[int] = None,
):
raise NotImplementedError
Local Descent Optimizers
Local Descent Optimizers
Local models incrementally improve a design point until some convergence criterion is met.
A common approach to optimization is to incrementally improve a design point
The iterative descent direction procedure involves the following steps:
- Check whether
$x^{k}$ satisfies the termination conditions. If it does, terminate; otherwise proceed to the next step. - Determine the descent direction
$d^{k}$ using local information such as the gradient or Hessian. Some algorithms assume$||d^{k}|| = 1$ , but others do not. - Determine the step size or learning rate
$\alpha^{k}$ . Some algorithms attempt to optimize the step size so that the step maximally decreases$f$ . - Compute the next design point according to:
There are many different optimization methods, each with their own ways of determining
Line Search
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class LineSearch(Optimizer):
def __init__():
raise NotImplementedError
Trust Region
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class TrustRegion(Optimizer):
def __init__():
raise NotImplementedError
Backtracking Line Search
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class BacktrackingLineSearch(Optimizer):
def __init__():
raise NotImplementedError
Non-Linear Least Squares
Non-Linear Least Squares
Problems of the form:
where
Gauss–Newton
The update equation is solved for every iteration to find the update to the parameters:
where
from typing import Callable, Optional
from torch import Optimizer
class GaussNewton(Optimizer):
def __init__(
self,
f: Callable,
g: Optional[Callable],
maximum_iterations: int,
tolerance: float,
):
pass
Levenberg–Marquardt
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class LevenbergMarquardt(Optimizer):
def __init__():
raise NotImplementedError
First-Order Method Optimizers
First-order methods rely on gradient information to help direct the search for a minimum, which can be obtained using derivatives and gradients.
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class GradientDescent(Optimizer):
def __init__():
raise NotImplementedError
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class ConjugateGradientDescent(Optimizer):
def __init__():
raise NotImplementedError
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class Adagrad(Optimizer):
def __init__():
raise NotImplementedError
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class RMSProp(Optimizer):
def __init__():
raise NotImplementedError
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class Adadelta(Optimizer):
def __init__():
raise NotImplementedError
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class Adam(Optimizer):
def __init__():
raise NotImplementedError
Second-Order Method Optimizers
Second-Order Method Optimizers
Second-order methods leverage second-order approximations that use the second derivative in univariate optimization or the Hessian in multivariate optimization to direct the search. This additional information can help improve the local model used for informing the selection of directions and step lengths in descent algorithms.
Newton’s Method
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class Newton(Optimizer):
def __init__():
raise NotImplementedError
Secant Method
Newton’s method for univariate function minimization requires the first and second derivatives
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class Secant(Optimizer):
def __init__():
raise NotImplementedError
Limited-Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) Method
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class LBFGS(Optimizer):
def __init__():
raise NotImplementedError
Derivative-Free Optimization
Derivative-Free Optimization
Derivative-free methods rely only on the objective function,
Coordinate Descent
Powell’s Method
Powell’s method can search in directions that are not orthogonal to each other. The method can automatically adjust for long, narrow valleys that might otherwise require a large number of iterations for cyclic coordinate descent or other methods that search in axis-aligned directions.
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class Powell(Optimizer):
def __init__():
raise NotImplementedError
Nelder-Mead Method
The Nelder-Mead method uses a simplex to traverse the space in search of a minimum. A simplex is a generalization of a tetrahedron to
from torch import Tensor
from torch import Tensor
from torch.optim import Optimizer
class NelderMead(Optimizer):
def __init__():
raise NotImplementedError
- The automatic implicit differentiation strategy theoretically only applies to situations where the implicit function theorem is valid, namely, where optimality conditions satisfy the differentiability and invertibility conditions. While this covers a wide range of situations, even for non-smooth optimization problems (e.g., under mild assumptions the solution of a Lasso regression can be differentiated a.e. with respect to the regularization parameter), this strategy could be extended to handle cases where the differentiability and invertibility conditions are not satisfied (e.g., using non-smooth implicit function theorems).
CasADi is an open-source tool for nonlinear optimization and algorithmic differentiation. It facilitates rapid — yet efficient — implementation of different methods for numerical optimal control, both in an offline context and for nonlinear model predictive control (NMPC).