Gauss-Newton for Neural Networks

We consider a neural network $f_\theta\subset\mathcal{F} := \{f: \mathbb{R}^m\rightarrow\mathbb{R}^n\}$ depending on the parameter set $\theta\in\mathbb{R}^p$ with a generalised regression loss function $E[f_\theta]$ $ E[f_\theta] = \frac{1}{2N}\sum_{i=1}^N\frac{1}{2}\|R[f_\theta; g](x_i)\|^2 =: \frac{1}{2}\|R[f_\theta, g]\|_{p^{in}}^2. $ Here, $\\|\cdot\\|$ is a suitable norm for the problem, $\\|\cdot\\|_{p^{in}}$ is the average norm over the training data as defined in [[Neural Tangent Kernel]] . $R[f; g]: \mathcal{F}\rightarrow \mathcal{F}$ is a possibly nonlinear map on $\mathcal{F}$. $g$ is a parameter describing ground truth values, and the $x_i\in\mathbb{R}^m$ are input training values. Examples are: - Regression Loss: Here, $R[f_\theta; g] = f_\theta - g$. - PINN Loss (excluding boundary term): $R[f_\theta; g] = \mathcal{L}f_{\theta} - g$ for some linear partial differential operator $\mathcal{L}$. Training the neural network means finding a minimiser $\hat{\theta}$ of the minimisation problem $ \min_{\theta} E[f_\theta]. $ Before we write down the Gauss-Newton iteration let us first clarify the functional derivative of $R[f_{\theta}; g]$ though. ## Derivative of $R[f_{\theta}; g]$ We have $ \partial_{t} R[f_\theta + t h; g] = R[f_{\theta}; g] + tDR[f_{\theta}; g]h + \text{h.o.t}, $ as scalar derivative with respect to $t$. Hence, the functional derivative is a linear operator $DR[f_{\theta}; g]$ that maps $\mathcal{H}$ into $\mathcal{H}$. The derivative of R with respect to a single parameter $\theta_i$ can now be written as $ \partial_{\theta_i} R[f_{\theta}; g] = DR[f_{\theta}; g]\partial_{\theta_i}f_{\theta}. $ ## Gauss-Newton iteration We follow the derivation of Gauss-Newton in [[Gauss-Newton Method]] and start with the necessary condition at a minimum $\hat{\theta}$. $ \partial_{\theta_i} E[f_{\hat{\theta}}] = 0;~i = 1, \dots, p. $ We now use the chain rule ones, noting that the functional derivative of $\frac{1}{2}\\|f\\|_{p^{in}}^2$ with respect to a function $h$ is given as $\langle f, h\rangle_{p^{in}}$. $ \begin{align} 0 &= \langle R[f_{\hat{\theta}}; g], \partial_{\theta_i} R[f_{\hat{\theta}}; g]\rangle_{p^{in}},\nonumber\\ &= \langle R[f_{\hat{\theta}}; g], DR[f_{\hat{\theta}}; g]\partial_{\theta_i}f_{\hat{\theta}}\rangle_{p^{in}}\nonumber, ~i=1,\dots, p \end{align} $ Let $\theta^{(k)}$ be our current iterate and let $\Delta^{(k)} = \theta^{(k)} - \theta^{(k+1)}$. Gauss-Newton uses the following linearised approximation $ 0 = \langle R[f_{\theta^{(k)}}; g] - \sum_{i=1}^p \Delta_i^{(k)}DR[f_{\theta^{(k)}}; g]\partial_{\theta}f_{\theta^{(k)}}, DR[f_{\theta^{(k)}}; g]\partial_{\theta}f_{\theta^{(k)}}\rangle_{p^{in}}. $ This is the normal equation for the linear least-squares problem $ \min_{\Delta^{(k)}} \|DR[f_{\theta^{(k)}}; g]\begin{bmatrix} \partial_{\theta_1} f_{\theta^{(k)}},\dots, \partial_{\theta_p} f_{\theta^{(k)}}\end{bmatrix}\Delta^{(k)} - R[f_{\theta^{(k)}}; g]\|_{p^{in}}. $ ## Comparison with the natural gradient iteration It is instructive to compare this with the natural gradient descent [[Natural Gradient Descent]] The natural gradient descent describes a flow of the form $ \dot{\theta} = -G_{NG}(\theta)^{-1}\eta^{GD}, $ where $G_{NG}(\theta)_{i, j} = \langle \partial_{\theta_i} f_{\theta}, \partial_{\theta_j} f_{\theta}\rangle$ and $\eta^{GD}$ is the steepest descent direction with entries $\eta_i^{GD} = \langle R[f_{\theta}; g], \partial_{\theta_i} f_{\theta}\rangle$. Gauss-Newton leads to a very similar update step $ \dot{\theta} = -G_{GN}(\theta)^{-1}\eta^{GD} $ with $ G_{GN}(\theta)_{i, j} = \langle \partial_{\theta_i} f_{\theta}, DR[f_\theta; g]^* DR[f_\theta; g]\partial_{\theta_j} f_{\theta}\rangle. $ The operator $DR[f_\theta; g]^* DR[f_\theta; g]$ modifies the inner product in the Gram matrix. Let us consider two instructive cases: - Regression Loss: Here, $DR[f_{\theta}; g]$ is just the identity operator and Gauss-Newton is identical to natural gradient descent. - PINN Loss: Here, $DR[f_{\theta}; g] = \mathcal{L}$ for some linear PDE operator $\mathcal{L}$, and therefore $DR[f_\theta; g]^* DR[f_\theta; g] = \mathcal{L}^*\mathcal{L}$. In the latter case the inner product in Gauss-Newton becomes a scaled inner product with the square of the differential operator. ## Comparison with the energy gradient iteration The energy gradient iteration follows the flow $ \dot{\theta} = -G_{ED}(\theta)\eta^{GD}. $ Here, $ [G_{ED}(\theta)]_{i, j} = D^2E[f_{\theta}](\partial_{\theta_i}f_{\theta}, \partial_{\theta_j} f_{\theta}) $ is the energy Gram matrix with $D^2E[f_{\theta}]$ the second functional derivative of $E$. For details see [[Energy Natural Gradient Descent]] . Let's check how this relates to Gauss-Newton. For this we compute the second derivative of the energy functional $E$ $ E[f_\theta + tu] = E[f_{\theta}] + t\langle R[f_{\theta}; g], DR[f_{\theta}; g] u\rangle + \text{h.o.t.} $ as $DE[f_{\theta}](u) = \langle R[f_{\theta}; g], DR[f_{\theta}; g]u\rangle$. Correspondingly, for the second derivative we have $ \begin{align} DE[f_{\theta} + tv](u) &= \langle R[f_{\theta} + tv; g], DR[f_{\theta} + tv; g]u\rangle\nonumber\\ &=\langle R[f_{\theta}; g] + tDR[f_{\theta}; g]v, DR[f_{\theta}; g]u + t(D^2R[f_{\theta};g]u)(v)\rangle + \text{h.o.t.}\nonumber \end{align} $ Hence, $ \begin{align} D^2E[f_{\theta}](u, v) &= \langle DR[f_{\theta}; g]v, DR[f_{\theta}; g]u\rangle\nonumber\\ &+ \langle R[f_{\theta}; g], (D^2R[f_{\theta}; g]u)(v)\rangle .\nonumber \end{align} $ The operator $D^2R[f_{\theta}; g]u$ is the Hessian operator, which maps from $\mathcal{H}$ into the space of linear operators $L(\mathcal{H})$ on $\mathcal{H}$. Comparing with Gauss-Newton it follows that the energy gradient is identical to Gauss-Newton if $DR[f_{\theta}; g]u = 0$, i.e. if $R[f_{\theta}; g]$ is a linear functional in $f_{\theta}$. The canonical example is a PINN with a linear PDE operator.