Energy Natural Gradient Descent

Energy Natural Gradient Descent is a modification of [[Natural Gradient Descent]] that performs a Newton step on the minimisation functional in the tangent plane of the parameterisation. We use the same notation as in [[Natural Gradient Descent]] ## Preliminary: Second derivatives of functionals We will require the second derivative of the functional $f(\rho)$ for $\rho$ a function. In the following we give a simple example of how to derive second derivatives for functionals. Consider $ f(\rho) = \frac{1}{2}\|\rho - g\|^2. $ The norm is assumed to be associated with a corresponding inner product $\langle \cdot, \cdot\rangle$. The typical inner product is the average of the evaluation of the training data as defined in [[Natural Gradient Descent]] The first functional derivative is obtained by expanding $ f(\rho + \epsilon h) = f(\rho) + \epsilon\langle \rho - g, h\rangle + \epsilon \langle h, h\rangle $ for some function $h$. We denote by $\partial_{\rho} f(\rho)[v]$ the functional derivative of $f$ evaluated in the direction of the function $v$. It follows that $ \partial_\rho f(\rho)[v] = \langle \rho - g, v\rangle $ for the above quadratic functional. For the second functional derivative we need to repeat the same process to obtain $ \partial_{\rho}^2 f(\rho + \epsilon h)[v] = \partial_{\rho} f(\rho) + \epsilon \langle h, v\rangle $ giving $ \partial_\rho f(\rho)[v, w] = \langle v, w \rangle. $ Hence, the second derivative can be interpreted as the quadratic form over an identity operator as expected if we start with a quadratic functional. ## Newton steps over functional derivatives We now want to write down a Newton step for a functional equation. Consider the minimisation problem $ \min_\rho f(\rho) $ for some functional $f$. A necessary condition for a minimum is that $ \partial_{\rho} f(\rho)[v] = 0 \quad \forall v. $ We will solve this equation using Newton, making the ansatz $ \partial_{\rho} f(\rho + h)[v] = 0\quad \forall v $ where $h$ is a function, the direction in which we take the Newton step. Expanding gives $ \partial_{\rho} f(\rho + h)[v] = \partial_{\rho}f(\rho)[v] + \partial_{\rho}^2 f(\rho)[v, h] + \text{h.o.t.} $ Hence, our Newton step is obtained as the solution $h$ of $ \partial_{\rho} f(\rho)[v] + \partial_{\rho}^2 f(\rho)[v, h] = 0\quad \forall v. $ ## Projected Newton step within the tangent manifold. We now introduce a parameterisation $\theta$ of $\rho$. Let $M$ be the manifold of all functions $\rho(\theta)$ and let $\mathcal{T}M$ be the tangental manifold spanned by $\{\partial_{\theta_1}\rho(\theta), \dots, \partial_{\theta_p}\rho(\theta)\}$, where $\theta_j$ is the $j$th scalar parameter. We now restrict $v$ and $h$ above onto this tangent space, meaning $ v = \sum_{i = 1}^p \eta_i^{(v)} \partial_{\theta_i} \rho(\theta), \quad h = \sum_{j = 1}^p \eta_j^{(h)} \partial_{\theta_j} \rho(\theta) $ Notice that $\partial_{\rho} f(\rho)[\partial_{\theta_i}\rho(\theta)]$ is just the gradient $\partial_{\theta_i} f(\rho(\theta))$, which is used in the standard Gradient Descent Method (see [[Natural Gradient Descent]]. We define $ \eta^{GD}:= [ \partial_{\theta_1} f(\rho(\theta)),\dots, \partial_{\theta_p} f(\rho(\theta))]^T $ as the standard gradient descent direction. Further, let $G_E$ be the *Energy Gram Matrix* given as $[G_E]_{i, j} = \partial_\rho^2 f(\rho)[\partial_{\theta_i} \rho(\theta), \partial_{\theta_j}\rho(\theta)]$. With $ \eta^{(v)} = [\eta_1^{(v)}, \dots, \eta_p^{(v)}]^T \quad \eta^{(h)} = [\eta_1^{(h)}, \dots, \eta_p^{(h)}]^T $ the functional Newton step on the tangent plane $\mathcal{T}M$ now takes the simple form $ [\eta^{(v)}]^T \eta^{GD} + [\eta^{(v)}]^T G_E \eta^{(h)} = 0,\quad \forall \eta^{(v)}. $ This is a variational form of a square linear system of equations with the solution $ \eta^{(h)} = - G_E^{-1}\eta^{(GD)}. $ The direction $\eta^{(h)}$ is called the *energy gradient direction* . The corresponding ODE for the parameterisation takes the form $ \dot{\theta} = - G_{E}^{-1}\eta^{GD}. $ Note that while the gradient descent direction is contained in this equation, the method overall is not a gradient descent method. The flow now follows the projected Newton direction and not a gradient descent, giving rise to the assumption that close to the minimiser we can expect quadratic convergence of the minimisation as opposed to linear convergence with simple gradient descent. ## References [1] J. Müller and M. Zeinhofer, ‘Achieving High Accuracy with PINNs via Energy Natural Gradients’, _arXiv_, 2023, doi: [10.48550/arxiv.2302.13163](https://doi.org/10.48550/arxiv.2302.13163).