Energy Natural Gradient Descent is a modification of [[Natural Gradient Descent]] that performs a Newton step on the minimisation functional in the tangent plane of the parameterisation. We use the same notation as in [[Natural Gradient Descent]]
## Preliminary: Second derivatives of functionals
We will require the second derivative of the functional $f(\rho)$ for $\rho$ a function. In the following we give a simple example of how to derive second derivatives for functionals. Consider
$
f(\rho) = \frac{1}{2}\|\rho - g\|^2.
$
The norm is assumed to be associated with a corresponding inner product $\langle \cdot, \cdot\rangle$. The typical inner product is the average of the evaluation of the training data as defined in [[Natural Gradient Descent]] The first functional derivative is obtained by expanding
$
f(\rho + \epsilon h) = f(\rho) + \epsilon\langle \rho - g, h\rangle + \epsilon \langle h, h\rangle
$
for some function $h$. We denote by $\partial_{\rho} f(\rho)[v]$ the functional derivative of $f$ evaluated in the direction of the function $v$. It follows that
$
\partial_\rho f(\rho)[v] = \langle \rho - g, v\rangle
$
for the above quadratic functional. For the second functional derivative we need to repeat the same process to obtain
$
\partial_{\rho}^2 f(\rho + \epsilon h)[v] = \partial_{\rho} f(\rho) + \epsilon \langle h, v\rangle
$
giving
$
\partial_\rho f(\rho)[v, w] = \langle v, w \rangle.
$
Hence, the second derivative can be interpreted as the quadratic form over an identity operator as expected if we start with a quadratic functional.
## Newton steps over functional derivatives
We now want to write down a Newton step for a functional equation. Consider the minimisation problem
$
\min_\rho f(\rho)
$
for some functional $f$. A necessary condition for a minimum is that
$
\partial_{\rho} f(\rho)[v] = 0 \quad \forall v.
$
We will solve this equation using Newton, making the ansatz
$
\partial_{\rho} f(\rho + h)[v] = 0\quad \forall v
$
where $h$ is a function, the direction in which we take the Newton step. Expanding gives
$
\partial_{\rho} f(\rho + h)[v] = \partial_{\rho}f(\rho)[v] + \partial_{\rho}^2 f(\rho)[v, h] + \text{h.o.t.}
$
Hence, our Newton step is obtained as the solution $h$ of
$
\partial_{\rho} f(\rho)[v] + \partial_{\rho}^2 f(\rho)[v, h] = 0\quad \forall v.
$
## Projected Newton step within the tangent manifold.
We now introduce a parameterisation $\theta$ of $\rho$. Let $M$ be the manifold of all functions $\rho(\theta)$ and let $\mathcal{T}M$ be the tangental manifold spanned by $\{\partial_{\theta_1}\rho(\theta), \dots, \partial_{\theta_p}\rho(\theta)\}$, where $\theta_j$ is the $j$th scalar parameter.
We now restrict $v$ and $h$ above onto this tangent space, meaning
$
v = \sum_{i = 1}^p \eta_i^{(v)} \partial_{\theta_i} \rho(\theta), \quad h = \sum_{j = 1}^p \eta_j^{(h)} \partial_{\theta_j} \rho(\theta)
$
Notice that $\partial_{\rho} f(\rho)[\partial_{\theta_i}\rho(\theta)]$ is just the gradient $\partial_{\theta_i} f(\rho(\theta))$, which is used in the standard Gradient Descent Method (see [[Natural Gradient Descent]]. We define
$
\eta^{GD}:= [ \partial_{\theta_1} f(\rho(\theta)),\dots, \partial_{\theta_p} f(\rho(\theta))]^T
$
as the standard gradient descent direction. Further, let $G_E$ be the *Energy Gram Matrix* given as $[G_E]_{i, j} = \partial_\rho^2 f(\rho)[\partial_{\theta_i} \rho(\theta), \partial_{\theta_j}\rho(\theta)]$. With
$
\eta^{(v)} = [\eta_1^{(v)}, \dots, \eta_p^{(v)}]^T \quad \eta^{(h)} = [\eta_1^{(h)}, \dots, \eta_p^{(h)}]^T
$
the functional Newton step on the tangent plane $\mathcal{T}M$ now takes the simple form
$
[\eta^{(v)}]^T \eta^{GD} + [\eta^{(v)}]^T G_E \eta^{(h)} = 0,\quad \forall \eta^{(v)}.
$
This is a variational form of a square linear system of equations with the solution
$
\eta^{(h)} = - G_E^{-1}\eta^{(GD)}.
$
The direction $\eta^{(h)}$ is called the *energy gradient direction* . The corresponding ODE for the parameterisation takes the form
$
\dot{\theta} = - G_{E}^{-1}\eta^{GD}.
$
Note that while the gradient descent direction is contained in this equation, the method overall is not a gradient descent method. The flow now follows the projected Newton direction and not a gradient descent, giving rise to the assumption that close to the minimiser we can expect quadratic convergence of the minimisation as opposed to linear convergence with simple gradient descent.
## References
[1]
J. Müller and M. Zeinhofer, ‘Achieving High Accuracy with PINNs via Energy Natural Gradients’, _arXiv_, 2023, doi: [10.48550/arxiv.2302.13163](https://doi.org/10.48550/arxiv.2302.13163).