Natural Gradient Descent

We consider the following minimization problem. $ \min_\theta f(\rho(\theta)) $ Here, $f$ is a functional to minimise, $\rho$ is a state space variable and $\theta$ is a parameter set. A typical application is $f$ being a loss functional $\rho$ being a function $u$, and $\theta$ being the parameters of a neural network. ## Gradient Descent Methods Gradient descent methods minimise $f$ by following the negative gradient flow with respect to the parameters $\theta$. This means, that if $\theta$ is parameterised as $\theta(t)$ the minimisation can be described as the discretisation of the ODE $ \dot{\theta} = -\nabla_{\theta} f(\rho(\theta)). $ This is the typical approach for neural networks. The problem here is that we minimise with respect to the gradient of the artificial parameter variable $\theta$, and not with respect to the natural state-space variable $\rho$. This is where natural gradient descent methods come in. ## Natural Gradient Descent Assume that $\theta$ is a smooth parameterisation. We can define the tangent manifold $\mathcal{T}M$ as $ \{\partial_{\theta_1}\rho, \dots, \partial_{\theta_p}\rho\}\subset\mathcal{T}M. $ We can define a smooth curve $t\rightarrow \rho(t)$ as $\rho$ moves through parameter space. We obtain $ \frac{df(\rho(t))}{dt} = \left\langle \partial_{\rho}f(\rho(t)), \partial_t\rho(t)\right\rangle $ Define $ \frac{d\theta}{dt} = \dot{\theta} = (\eta_1,\dots,\eta_p)^T. $ Then $ \partial_t \rho(\theta) = \sum_{i=1}^p \eta_i \cdot \partial_{\theta_i}\rho(\theta). $ With this we can now write $ \frac{df(\rho(t))}{dt} = \left\langle \partial_{\rho}f(\rho(t)), \sum_{i=1}^p \eta_i \partial_{\theta_i}\rho(\theta)\right\rangle. $ This shows that to maximise the gradient we need to choose the $\eta\_i$ such that the linear combination $\sum_{i=1}^p\eta_i\partial_{\theta_i}\rho(\theta)$ is just the orthogonal projection of $\partial_\rho f(\rho(\theta)$ onto the tangent space $\mathcal{T}M$. To compute this orthogonal projection we require that $ \langle \partial_\rho f(\rho(t)), \partial_{\theta_j} \rho(\theta)\rangle = \sum_{i=1}^p \eta_i\langle \partial_{\theta_i}\rho(\theta), \partial_{\theta_j}\rho(\theta)\rangle,\quad j=1,\dots,p $ which is just the usual normal equation condition that the residual must be orthogonal onto the tangent space $\mathcal{T}M$. But let's consider the left-hand side of the above equation. We have $ \partial_{\theta_j}f(\rho(\theta)) = \langle \partial_\rho f(\rho(t)), \partial_{\theta_j} \rho(\theta)\rangle. $ But this is just the gradient with respect to the parameterisation $\theta$, used for the gradient descent method. Define $\eta^{GD}$ by $\eta^{GD}_j = \langle\partial_\rho f(\rho(t)), \partial_{\theta_j}\rho(\theta)\rangle$ as the vector computed in the gradient descent method. Furthermore, we denote by $\eta^{nat}$ the coordinates $\eta_i$ that define the orthogonal projection of $\partial_\rho f(\rho(t))$ onto $\mathcal{T}M$. From the normal equation above we obtain $ G(\theta) \eta^{nat} = \eta^{GD}, $ with $G(\theta)_{i, j} = \langle\partial_{\theta_i}\rho(\theta), \partial_{\theta_j}\rho(\theta)\rangle$ the Gram matrix of the derivatives of the state variable $\rho$ with respect to the parameterisation. Hence, the flow of the natural gradient descent follows the ODE $ \dot{\theta} = -G(\theta)^{-1}\eta^{GD} $ and is therefore just a preconditioned version of the standard Gradient Descent flow. The effect of this preconditioning is that we minimise $f$ with respect to its $\rho$-gradient, and not the $\theta$-gradient. Not only is this what we actually want, it also has the desired effect that the gradient descent is independent of the parameterisation. ## References [1] L. Nurbekyan, W. Lei, and Y. Yang, ‘Efficient Natural Gradient Descent Methods for Large-Scale PDE-Based Optimization Problems’, _SIAM J. Sci. Comput._, vol. 45, no. 4, pp. A1621–A1655, Aug. 2023, doi: [10.1137/22M1477805](https://doi.org/10.1137/22M1477805).