We consider the following minimization problem.
$
\min_\theta f(\rho(\theta))
$
Here, $f$ is a functional to minimise, $\rho$ is a state space variable and $\theta$ is a parameter set. A typical application is $f$ being a loss functional $\rho$ being a function $u$, and $\theta$ being the parameters of a neural network.
## Gradient Descent Methods
Gradient descent methods minimise $f$ by following the negative gradient flow with respect to the parameters $\theta$. This means, that if $\theta$ is parameterised as $\theta(t)$ the minimisation can be described as the discretisation of the ODE
$
\dot{\theta} = -\nabla_{\theta} f(\rho(\theta)).
$
This is the typical approach for neural networks. The problem here is that we minimise with respect to the gradient of the artificial parameter variable $\theta$, and not with respect to the natural state-space variable $\rho$. This is where natural gradient descent methods come in.
## Natural Gradient Descent
Assume that $\theta$ is a smooth parameterisation. We can define the tangent manifold $\mathcal{T}M$ as
$
\{\partial_{\theta_1}\rho, \dots, \partial_{\theta_p}\rho\}\subset\mathcal{T}M.
$
We can define a smooth curve $t\rightarrow \rho(t)$ as $\rho$ moves through parameter space. We obtain
$
\frac{df(\rho(t))}{dt} = \left\langle \partial_{\rho}f(\rho(t)), \partial_t\rho(t)\right\rangle
$
Define
$
\frac{d\theta}{dt} = \dot{\theta} = (\eta_1,\dots,\eta_p)^T.
$
Then
$
\partial_t \rho(\theta) = \sum_{i=1}^p \eta_i \cdot \partial_{\theta_i}\rho(\theta).
$
With this we can now write
$
\frac{df(\rho(t))}{dt} = \left\langle \partial_{\rho}f(\rho(t)), \sum_{i=1}^p \eta_i \partial_{\theta_i}\rho(\theta)\right\rangle.
$
This shows that to maximise the gradient we need to choose the $\eta\_i$ such that the linear combination $\sum_{i=1}^p\eta_i\partial_{\theta_i}\rho(\theta)$ is just the orthogonal projection of $\partial_\rho f(\rho(\theta)$ onto the tangent space $\mathcal{T}M$. To compute this orthogonal projection we require that
$
\langle \partial_\rho f(\rho(t)), \partial_{\theta_j} \rho(\theta)\rangle = \sum_{i=1}^p \eta_i\langle \partial_{\theta_i}\rho(\theta), \partial_{\theta_j}\rho(\theta)\rangle,\quad j=1,\dots,p
$
which is just the usual normal equation condition that the residual must be orthogonal onto the tangent space $\mathcal{T}M$. But let's consider the left-hand side of the above equation. We have
$
\partial_{\theta_j}f(\rho(\theta)) = \langle \partial_\rho f(\rho(t)), \partial_{\theta_j} \rho(\theta)\rangle.
$
But this is just the gradient with respect to the parameterisation $\theta$, used for the gradient descent method. Define $\eta^{GD}$ by $\eta^{GD}_j = \langle\partial_\rho f(\rho(t)), \partial_{\theta_j}\rho(\theta)\rangle$ as the vector computed in the gradient descent method. Furthermore, we denote by $\eta^{nat}$ the coordinates $\eta_i$ that define the orthogonal projection of $\partial_\rho f(\rho(t))$ onto $\mathcal{T}M$. From the normal equation above we obtain
$
G(\theta) \eta^{nat} = \eta^{GD},
$
with $G(\theta)_{i, j} = \langle\partial_{\theta_i}\rho(\theta), \partial_{\theta_j}\rho(\theta)\rangle$ the Gram matrix of the derivatives of the state variable $\rho$ with respect to the parameterisation. Hence, the flow of the natural gradient descent follows the ODE
$
\dot{\theta} = -G(\theta)^{-1}\eta^{GD}
$
and is therefore just a preconditioned version of the standard Gradient Descent flow. The effect of this preconditioning is that we minimise $f$ with respect to its $\rho$-gradient, and not the $\theta$-gradient. Not only is this what we actually want, it also has the desired effect that the gradient descent is independent of the parameterisation.
## References
[1]
L. Nurbekyan, W. Lei, and Y. Yang, ‘Efficient Natural Gradient Descent Methods for Large-Scale PDE-Based Optimization Problems’, _SIAM J. Sci. Comput._, vol. 45, no. 4, pp. A1621–A1655, Aug. 2023, doi: [10.1137/22M1477805](https://doi.org/10.1137/22M1477805).