We consider a neural network $f_\theta\subset\mathcal{F} := \{f: \mathbb{R}^m\rightarrow\mathbb{R}^n\}$ depending on the parameter set $\theta\in\mathbb{R}^p$ with a generalised regression loss function $E[f_\theta]$
$
E[f_\theta] = \frac{1}{2N}\sum_{i=1}^N\frac{1}{2}\|R[f_\theta; g](x_i)\|^2 =: \frac{1}{2}\|R[f_\theta, g]\|_{p^{in}}^2.
$
Here, $\\|\cdot\\|$ is a suitable norm for the problem, $\\|\cdot\\|_{p^{in}}$ is the average norm over the training data as defined in [[Neural Tangent Kernel]] . $R[f; g]: \mathcal{F}\rightarrow \mathcal{F}$ is a possibly nonlinear map on $\mathcal{F}$. $g$ is a parameter describing ground truth values, and the $x_i\in\mathbb{R}^m$ are input training values. Examples are:
- Regression Loss: Here, $R[f_\theta; g] = f_\theta - g$.
- PINN Loss (excluding boundary term): $R[f_\theta; g] = \mathcal{L}f_{\theta} - g$ for some linear partial differential operator $\mathcal{L}$.
Training the neural network means finding a minimiser $\hat{\theta}$ of the minimisation problem
$
\min_{\theta} E[f_\theta].
$
Before we write down the Gauss-Newton iteration let us first clarify the functional derivative of $R[f_{\theta}; g]$ though.
## Derivative of $R[f_{\theta}; g]$
We have
$
\partial_{t} R[f_\theta + t h; g] = R[f_{\theta}; g] + tDR[f_{\theta}; g]h + \text{h.o.t},
$
as scalar derivative with respect to $t$. Hence, the functional derivative is a linear operator $DR[f_{\theta}; g]$ that maps $\mathcal{H}$ into $\mathcal{H}$. The derivative of R with respect to a single parameter $\theta_i$ can now be written as
$
\partial_{\theta_i} R[f_{\theta}; g] = DR[f_{\theta}; g]\partial_{\theta_i}f_{\theta}.
$
## Gauss-Newton iteration
We follow the derivation of Gauss-Newton in [[Gauss-Newton Method]] and start with the necessary condition at a minimum $\hat{\theta}$.
$
\partial_{\theta_i} E[f_{\hat{\theta}}] = 0;~i = 1, \dots, p.
$
We now use the chain rule ones, noting that the functional derivative of $\frac{1}{2}\\|f\\|_{p^{in}}^2$ with respect to a function $h$ is given as $\langle f, h\rangle_{p^{in}}$.
$
\begin{align}
0 &= \langle R[f_{\hat{\theta}}; g], \partial_{\theta_i} R[f_{\hat{\theta}}; g]\rangle_{p^{in}},\nonumber\\
&= \langle R[f_{\hat{\theta}}; g], DR[f_{\hat{\theta}}; g]\partial_{\theta_i}f_{\hat{\theta}}\rangle_{p^{in}}\nonumber, ~i=1,\dots, p
\end{align}
$
Let $\theta^{(k)}$ be our current iterate and let $\Delta^{(k)} = \theta^{(k)} - \theta^{(k+1)}$. Gauss-Newton uses the following linearised approximation
$
0 = \langle R[f_{\theta^{(k)}}; g] - \sum_{i=1}^p \Delta_i^{(k)}DR[f_{\theta^{(k)}}; g]\partial_{\theta}f_{\theta^{(k)}}, DR[f_{\theta^{(k)}}; g]\partial_{\theta}f_{\theta^{(k)}}\rangle_{p^{in}}.
$
This is the normal equation for the linear least-squares problem
$
\min_{\Delta^{(k)}} \|DR[f_{\theta^{(k)}}; g]\begin{bmatrix} \partial_{\theta_1} f_{\theta^{(k)}},\dots, \partial_{\theta_p} f_{\theta^{(k)}}\end{bmatrix}\Delta^{(k)} - R[f_{\theta^{(k)}}; g]\|_{p^{in}}.
$
## Comparison with the natural gradient iteration
It is instructive to compare this with the natural gradient descent [[Natural Gradient Descent]]
The natural gradient descent describes a flow of the form
$
\dot{\theta} = -G_{NG}(\theta)^{-1}\eta^{GD},
$
where $G_{NG}(\theta)_{i, j} = \langle \partial_{\theta_i} f_{\theta}, \partial_{\theta_j} f_{\theta}\rangle$ and $\eta^{GD}$ is the steepest descent direction with entries $\eta_i^{GD} = \langle R[f_{\theta}; g], \partial_{\theta_i} f_{\theta}\rangle$. Gauss-Newton leads to a very similar update step
$
\dot{\theta} = -G_{GN}(\theta)^{-1}\eta^{GD}
$
with
$
G_{GN}(\theta)_{i, j} = \langle \partial_{\theta_i} f_{\theta}, DR[f_\theta; g]^* DR[f_\theta; g]\partial_{\theta_j} f_{\theta}\rangle.
$
The operator $DR[f_\theta; g]^* DR[f_\theta; g]$ modifies the inner product in the Gram matrix. Let us consider two instructive cases:
- Regression Loss: Here, $DR[f_{\theta}; g]$ is just the identity operator and Gauss-Newton is identical to natural gradient descent.
- PINN Loss: Here, $DR[f_{\theta}; g] = \mathcal{L}$ for some linear PDE operator $\mathcal{L}$, and therefore $DR[f_\theta; g]^* DR[f_\theta; g] = \mathcal{L}^*\mathcal{L}$.
In the latter case the inner product in Gauss-Newton becomes a scaled inner product with the square of the differential operator.
## Comparison with the energy gradient iteration
The energy gradient iteration follows the flow
$
\dot{\theta} = -G_{ED}(\theta)\eta^{GD}.
$
Here,
$
[G_{ED}(\theta)]_{i, j} = D^2E[f_{\theta}](\partial_{\theta_i}f_{\theta}, \partial_{\theta_j} f_{\theta})
$
is the energy Gram matrix with $D^2E[f_{\theta}]$ the second functional derivative of $E$. For details see [[Energy Natural Gradient Descent]] .
Let's check how this relates to Gauss-Newton. For this we compute the second derivative of the energy functional $E$
$
E[f_\theta + tu] = E[f_{\theta}] + t\langle R[f_{\theta}; g], DR[f_{\theta}; g] u\rangle + \text{h.o.t.}
$
as $DE[f_{\theta}](u) = \langle R[f_{\theta}; g], DR[f_{\theta}; g]u\rangle$. Correspondingly, for the second derivative we have
$
\begin{align}
DE[f_{\theta} + tv](u) &= \langle R[f_{\theta} + tv; g], DR[f_{\theta} + tv; g]u\rangle\nonumber\\
&=\langle R[f_{\theta}; g] + tDR[f_{\theta}; g]v, DR[f_{\theta}; g]u + t(D^2R[f_{\theta};g]u)(v)\rangle + \text{h.o.t.}\nonumber
\end{align}
$
Hence,
$
\begin{align}
D^2E[f_{\theta}](u, v) &= \langle DR[f_{\theta}; g]v, DR[f_{\theta}; g]u\rangle\nonumber\\
&+ \langle R[f_{\theta}; g], (D^2R[f_{\theta}; g]u)(v)\rangle .\nonumber
\end{align}
$
The operator $D^2R[f_{\theta}; g]u$ is the Hessian operator, which maps from $\mathcal{H}$ into the space of linear operators $L(\mathcal{H})$ on $\mathcal{H}$.
Comparing with Gauss-Newton it follows that the energy gradient is identical to Gauss-Newton if $DR[f_{\theta}; g]u = 0$, i.e. if $R[f_{\theta}; g]$ is a linear functional in $f_{\theta}$. The canonical example is a PINN with a linear PDE operator.