Neural Tangent Kernel - Timo's Bucket

## Preliminaries We define a MLP as in [[Multilayer Perceptron (MLP)]] . Denote by $\mathcal{F}$ the space of all possible functions $f:\mathbb{R}^{n_0}\rightarrow \mathbb{R}^{n_{L+1}}$. We can define a seminorm $\\|\cdot\\|_{p^{in}}$ on $\mathcal{F}$ through the bilinear form $ \langle f, g\rangle_{p^{in}} = \mathbb{E}_{x\sim p^{in}} \left[f(x)^Tg(x)\right] $ Here, $p^{in}$ is assumed to be the empirical distribution on a finite dataset $x_1,\dots,x_N$ given through the sum of Dirac measures $\frac{1}{N}\sum_{i=0}^N\delta_{x_i}$. Hence, $ <f,g>=\frac{1}{N}\sum_{i=1}^N f(x_i)^Tg(x_i). $ A function defined through a realisation of the neural network with parameters $\theta$ is denoted by $f_{\theta}$. In addition to the usual definition we assume that the function $C_k$ in each layer takes the form $ C_k(x) = \frac{1}{\sqrt{n_k}}W_kx + b_k $ with $W_k$ in $b_k$ initialised as iid Gaussians $\mathcal{N}(0, 1)$. The cost functional associated with the training of the neural network will be denoted as $\mathcal{C}: \mathcal{F}\rightarrow \mathbb{R}$. ## Kernel Gradient A *multi-dimensional kernel* $K$ is a function $\mathbb{R}^{n_0}\times \mathbb{R}^{n_0} \rightarrow \mathbb{R}^{n_{L+1}\times n_{L+1}}$, which maps any pair $(x, x')$ to an $n_{L+1}\times n_{L+1}$ matrix such that $K(x, x') = K(x, x')^T$. The kernel defines a bilinear map $ <f, g>_K := \mathbb{E}_{x,x'\sim p^{in}} \left[f(x)^TK(x,x')g(x')\right] $ Let $\mathcal{F}^*$ be the dual to $\mathcal{F}$, that is the space of all linear forms $\mu:\mathcal{F}\rightarrow\mathbb{R}$. Note that for $\mathcal{F}$ to be a Hilbert space with inner product $\langle \cdot, \cdot\rangle_{p^{in}}$ for each $\mu\in \mathcal{F}^*$ there exists a unique $d\in\mathcal{F}$ such that $\mu = \langle d, \cdot\rangle_{p^{in}}$. Now consider the $i$th column $K_{i, \cdot}(x, \cdot)$ for fixed $x$, $i = 1,\dots, n_{L+1}$. This is a function in $\mathcal{F}$. For given $\mu$ we can now define a function $f_{\mu}$ through $ f_{\mu, i}(x) = \mu K_i,\cdot(x,\cdot) = \langle d, K_{i,\cdot}(x, \cdot)\rangle_{p^{in}} $ for the element $d\in\mathcal{F}$ defining the linear form $\mu$. We denote by $\Phi_K:\mathcal{F}^*\rightarrow \mathcal{F}: \mu\rightarrow f$ the thus defined mapping. Note that from the symmetry of $K$ we have $ f_{\mu}(x) = \frac{1}{N}\sum_{j=1}^NK(x, x_j)d(x_j). $ The functional derivative of the cost function $\mathcal{C}$ at a given point $f_0\in\mathcal{F}$ is an element of $\mathcal{F}^*$ and is denoted by $\partial_f\mathcal{C}\|_{f_0}$. Let $d\|_{f_0}$ be the corresponding element in $\mathcal{F}$ such that $\partial_{f}\mathcal{C}\|_{f_0} = \langle d\|_{f_0}, \cdot\rangle_{p^{in}}$. The *kernel gradient* $\nabla_K \mathcal{C}\|f_{0}\in\mathcal{F}$ is defined as $\Phi_K(\partial_f \mathcal{C}\|_{f_0})$. We can write it as $ \nabla_K \mathcal{C}|_{f_0}(x) = \frac{1}{N}\sum_{j=1}^N K(x, x_j)d|_{f_0}(x_j). $ A time-dependent function $f(t)$ follows the kernel gradient descent with respect to $K$ if it satisfies the differential equation $ \partial_tf(t) = -\nabla_K\mathcal{C}|_{f(t)}. $ The corresponding cost $\mathcal{C}(f(t))$ evolves via chain rule as $ \partial_t\mathcal{C}_{f(t)} = \langle d|_{f(t)}, \partial_tf(t)\rangle_{p^{in}} = -\langle d|_{f(t)}, \nabla_K\mathcal{C}|_{f(t)}\rangle_{p^{in}} = -\|d|_{f(t)}\|_K^2. $ ## Neural Tangent Kernel We are now ready to define the *neural tangent kernel* . Let $f_{\theta}$ be the function representing the MLP with parameter set $\theta$. For the cost functional $\mathcal{C}$ gradient descent gives $ \partial_t\theta_p(t) = -\partial_{\theta_p}\left[\mathcal{C}\circ f_{\theta(t)}\right]. $ Here, $\theta_p$ is the $p$th neural network parameter. The function $f_{\theta}$ evolves according to $ \begin{align} \partial_t f_{\theta(t)}(x) &= \sum_{p=1}^P\partial_{\theta_p}f(x)\partial_t\theta_p\nonumber\\ &= -\sum_{p=1}^P\partial_{\theta_p}f(x) \partial_{\theta_p}\left[\mathcal{C}\circ f_{\theta(t)}\right]\nonumber\\ &= -\sum_{p=1}^P\partial_{\theta_p}f(x) \left<d|_{f_\theta(t)}, \partial_{\theta_p} f_{\theta(t)}\right>_{p^{in}}\nonumber\\ &= -\sum_{p=1}^P\partial_{\theta_p} f(x)\left[\frac{1}{N}\sum_{j=1}^N\partial_{\theta_p}f_{\theta(t)}(x_j)^Td|_{f_{\theta(t)}}(x_j)\right]\nonumber\\ &= -\frac{1}{N}\sum_{j=1}^N\left[\sum_{p=1}^P\partial_{\theta_p}f_{\theta(t)}(x)\otimes\partial_{\theta_p}f_{\theta(t)}(x_j)\right]d|_{f_{\theta(t)}}(x_j)\nonumber\\ &= -\nabla_K\mathcal{C}|_{f_{\theta(t)}}\nonumber, \end{align} $ where the kernel $ K(x, x') = \sum_{p=1}^P\partial_{\theta_p}f_{\theta(t)}(x)\otimes\partial_{\theta_p}f_{\theta(t)}(x') $ is called the *neural tangent kernel* . ## Regression type cost functionals Assume that the cost functional is of the form $\mathcal{C}[f_\theta] = \frac{1}{2N}\sum_{j=1}^N \\|f_\theta(x_j) - g(x_j)\\|_2^2 = \frac{1}{2}\\|f - g\\|_{p^{in}}^2$. Then $ \begin{align} C[f_{\theta} + \epsilon h] &= \frac{1}{2N}\sum_{j=1}^N\langle f_{\theta}(x_j) +\epsilon h(x_j) - g(x_j), f_\theta(x_j) + \epsilon h(x_j) - g(x_j)\rangle\nonumber\\ &= \mathcal{C}[f] + \epsilon \frac{1}{N}\sum_{j=1}^N\langle f_{\theta}(x_j) - g(x_j), h(x_j)\rangle + \epsilon^2\frac{1}{N}\sum_{j=1}^N\langle h(x_j), h(x_j)\rangle.\nonumber \end{align} $ Hence, we have that $ \partial_f\mathcal{C}[h] = \langle f_\theta - g,h\rangle_{p^{in}}. $ With the neural tangent kernel $K(x, x')$ we obtain $ \partial_t f_\theta(x_i) = - \frac{1}{N}\sum_{j=1}^{N}K(x_i, x_j)(f_\theta(x_j) - g(x_j)), i=1,\dots, N. $ Let $K_{\theta}$ be the block matrix defined such that the block $[K_\theta]_{i, j}$ is given as $[K_\theta]_{i, j} = \frac{1}{N}K(x_i, x_j)$. The dimension of $K_\theta$ is $N n_{L+1}\times N n_{L+1}$. Further, let $f_\theta := \left[ f_\theta(x_1), \dots, f_\theta(x_N)\right]^T$ and correspondingly $g = \left[g(x_1),\dots, g(x_N)\right]$. The above equation can now be written as the system $ \partial_t f_\theta = -K_{\theta}f_{\theta} + Kg. $ This is an ODE in $f_\theta$ which describes the change of the neural network under gradient descent with the cost functional $\mathcal{C}$. In the special case that $n_{L+1} = 1$ the matrix $K_{\theta}$ takes a particularly simple form, namely $ [K_\theta]_{i, j} = \frac{1}{N} \sum_{p=1}^P \partial_{\theta_p}f_{\theta}(x_i) \partial_{\theta_p}f_{\theta}(x_j) = \frac{1}{N} \partial_{\theta}f_{\theta}(x_i) \cdot \partial_{\theta}f_{\theta}(x_j). $ ## Behaviour under training The neural tangent kernel depends on $\theta$ and therefore changes under training. However, in [[#^3f4379|Jacot et. al.]] it is shown that - In the asymptotic infinite width limit the neural tangent kernel converges to a deterministic kernel. - The neural tangent kernel asymptotically stays constant during training for sufficiently large width. ## Early stopping of training The neural tangent kernel can explain why early stopping in neural network training is useful. Assuming that the neural tangent kernel is an asymptotically constant matrix $K\approx K_\theta$ the dynamics of the ODE $\partial_t f_\theta = -K_{\theta}f_{\theta} + Kg$ is well described by the eigenvalues of $K$. The dominant eigenvalues of $K$ tend to converge quickly, while the small eigenvalues associated with oscillatory noise converge slowly. Hence, we can usually stop well before the oscillatory eigenvalues have converged. ## References [1] A. Jacot, F. Gabriel, and C. Hongler, ‘Neural Tangent Kernel: Convergence and Generalization in Neural Networks’, _arXiv_, 2018, doi: [10.48550/arxiv.1806.07572](https://doi.org/10.48550/arxiv.1806.07572). ^3f4379 [2] S. Wang, X. Yu, and P. Perdikaris, ‘When and why PINNs fail to train: A neural tangent kernel perspective’, _Journal of Computational Physics_, vol. 449, p. 110768, 2022, doi: [10.1016/j.jcp.2021.110768](https://doi.org/10.1016/j.jcp.2021.110768).