Kernel de tangente neural

No estudo de redes neurais artificiais (RNAs), o kernel de tangente neural (KTN) é um kernel que descreve a evolução de redes neurais artificiais profundas durante seu treinamento por gradiente descendente . Ele permite que RNAs sejam estudadas usando algoritmos do tipo Máquina de vetores de suporte.

Para a maioria das arquiteturas de rede neural, no limite da largura da camada, o KTN se torna constante. Isso permite que declarações simples de forma fechada sejam feitas sobre previsões de rede neural, dinâmicas de treinamento, generalização e superfícies de perda. Por exemplo, ele garante que RNAs largas o suficiente convergem para um mínimo global quando treinados para minimizar uma perda empírica. O KTN de redes de grande largura também está relacionado a vários outros limites de largura de redes neurais.

O KTN foi lançado em 2018 por Arthur Jacot, Franck Gabriel e Clément Hongler.^[1] Também estava implícito em alguns trabalhos contemporâneos.^[2]^[3]^[4]

Definição[editar | editar código-fonte]

Caso de saída escalar[editar | editar código-fonte]

Uma RNA com saída escalar consiste em uma família de funções $f\left(\cdot ,\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ parametrizado por um vetor de parâmetros $\theta \in \mathbb {R} ^{P}$ .

O KTN é um kernel $\Theta :\mathbb {R} ^{n_{\mathrm {in} }}\times \mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ definido por

\Theta \left(x,y;\theta \right)=\sum _{p=1}^{P}\partial _{\theta _{p}}f\left(x;\theta \right)\partial _{\theta _{p}}f\left(y;\theta \right).

Em uma SVM, o KTN $\Theta$ é um kernel associado a uma feature $\left(x\mapsto \partial _{\theta _{p}}f\left(x;\theta \right)\right)_{p=1,\ldots ,P}$ .

Caso de saída vetorial[editar | editar código-fonte]

Uma RNA com saída vetorial de tamanho $n_{\mathrm {out} }$ consiste em uma família de funções $f\left(\cdot ;\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R} ^{n_{\mathrm {out} }}$ parametrizada por um vetor de parâmetros $\theta \in \mathbb {R} ^{P}$ .

Neste caso o KTN $\Theta :\mathbb {R} ^{n_{\mathrm {in} }}\times \mathbb {R} ^{n_{\mathrm {in} }}\to {\mathcal {M}}_{n_{\mathrm {out} }}\left(\mathbb {R} \right)$ é um SVM de saída vetorial com valores de $n_{\mathrm {out} }\times n_{\mathrm {out} }$ e matrizes definidas por

\Theta _{k,l}\left(x,y;\theta \right)=\sum _{p=1}^{P}\partial _{\theta _{p}}f_{k}\left(x;\theta \right)\partial _{\theta _{p}}f_{l}\left(y;\theta \right).

Derivação[editar | editar código-fonte]

Ao otimizar os parâmetros $\theta \in \mathbb {R} ^{P}$ de uma RNA para minimizar uma perda empírica através da método do gradiente, o KTN determina a dinâmica da função de saída da RNA $f_{\theta }$ durante todo o treinamento.

Caso de saída escalar[editar | editar código-fonte]

Para um dataset $\left(x_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R} ^{n_{\mathrm {in} }}$ com rótulos escalares $\left(z_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R}$ e uma função de perda $c:\mathbb {R} \times \mathbb {R} \to \mathbb {R}$ associada a uma perda empírica, definida em funções $f:\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ é dada por

{\mathcal {C}}\left(f\right)=\sum _{i=1}^{n}c\left(f\left(x_{i}\right),z_{i}\right).

Ao treinar uma RNA $f\left(\cdot ;\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ é treinado para se ajustar ao conjunto de dados (ou seja, minimizar ${\mathcal {C}}$ ) via método do gradiente por tempo contínuo os parâmetros $\left(\theta \left(t\right)\right)_{t\geq 0}$ evoluem através da função diferencial ordinária:

\partial _{t}\theta \left(t\right)=-\nabla {\mathcal {C}}\left(f\left(\cdot ;\theta \right)\right).

Durante o treinamento, a função de saída da RNA segue a evolução de uma equação diferencial dada em termos de KTN:

$\partial _{t}f\left(x;\theta \left(t\right)\right)=-\sum _{i=1}^{n}\Theta \left(x,x_{i};\theta \right)\partial _{w}c\left(w,z_{i}\right){\Big |}_{w=f\left(x_{i};\theta \left(t\right)\right)}.$

Esta equação mostra como o KTN conduz a dinâmica de $f\left(\cdot ;\theta \left(t\right)\right)$ no espaço das funções $\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ durante o treinamento.

Caso de saída vetorial[editar | editar código-fonte]

Para um dataset $\left(x_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R} ^{n_{\mathrm {in} }}$ com vetores $\left(z_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R} ^{n_{\mathrm {out} }}$ e uma função de perda $c:\mathbb {R} ^{n_{\mathrm {out} }}\times \mathbb {R} ^{n_{\mathrm {out} }}\to \mathbb {R}$ a perda empírica correspondente em funções $f:\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R} ^{n_{\mathrm {out} }}$ é definida por:

{\mathcal {C}}\left(f\right)=\sum _{i=1}^{n}c\left(f\left(x_{i}\right),z_{i}\right).

O treinamento de $f_{\theta \left(t\right)}$ através do método do gradiente por tempo contínuo produz a seguinte evolução na função do espaço gerada pelo KTN:

\partial _{t}f_{k}\left(x;\theta \left(t\right)\right)=-\sum _{i=1}^{n}\sum _{l=1}^{n_{\mathrm {out} }}\Theta _{k,l}\left(x,x_{i};\theta \right)\partial _{w_{l}}c\left(\left(w_{1},\ldots ,w_{n_{\mathrm {out} }}\right),z_{i}\right){\Big |}_{w=f\left(x_{i};\theta \left(t\right)\right)}.

Interpretação[editar | editar código-fonte]

O KTN $\Theta \left(x,x_{i};\theta \right)$ representa a influência da perda de gradiente $\partial _{w}c\left(w,z_{i}\right){\big |}_{w=f\left(x_{i};\theta \right)}$ com respeito ao exemplo $i$ sobre a evolução da saída (produção) da RNA $f\left(x;\theta \right)$ através de uma etapa do método do gradiente: no caso escalar, se lê:

f\left(x;\theta \left(t+\epsilon \right)\right)-f\left(x;\theta \left(t\right)\right)\approx \epsilon \sum _{i=1}^{n}\Theta \left(x,x_{i};\theta \left(t\right)\right)\partial _{w}c\left(w,z_{i}\right){\big |}_{w=f\left(x_{i};\theta \right)}.

Em particular, cada ponto de dados $x_{i}$ influencia a evolução do resultado $f\left(x;\theta \right)$ para cada $x$ ao longo do treinamento, de modo que é capturada pelo KTN $\Theta \left(x,x_{i};\theta \right)$ .

Grande limite de largura[editar | editar código-fonte]

Trabalhos teóricos e empíricos recentes em aprendizagem profunda mostraram que o desempenho das RNAs melhora estritamente à medida que a largura de suas camadas aumenta.^[5]^[6] Para várias arquiteturas de RNA o KTN fornece uma visão precisa sobre o treinamento neste regime de grandes larguras.^[1]^[7]^[8]^[9]^[10]^[11]

Referências

↑ ^a ^b Jacot, Arthur; Gabriel, Franck; Hongler, Clement (2018), Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K., eds., «Neural Tangent Kernel: Convergence and Generalization in Neural Networks» (PDF), Curran Associates, Inc., Advances in Neural Information Processing Systems 31: 8571–8580, Bibcode:2018arXiv180607572J, arXiv:1806.07572, consultado em 27 de novembro de 2019
↑ Li, Yuanzhi; Liang, Yingyu (2018). «Learning overparameterized neural networks via stochastic gradient descent on structured data». Advances in Neural Information Processing Systems
↑ Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao (2018). «A convergence theory for deep learning via overparameterization». International Conference on Machine Learning
↑ Du, Simon S; Zhai, Xiyu; Poczos, Barnabas; Aarti, Singh (2019). «Gradient descent provably optimizes over-parameterized neural networks». International Conference on Learning Representations
↑ Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (15 de fevereiro de 2018). «Sensitivity and Generalization in Neural Networks: an Empirical Study». Bibcode:2018arXiv180208760N. arXiv:1802.08760
↑ Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (4 de novembro de 2016). «An Analysis of Deep Neural Network Models for Practical Applications». Bibcode:2016arXiv160507678C. arXiv:1605.07678
↑ Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao (9 de novembro de 2018). «A Convergence Theory for Deep Learning via Over-Parameterization». International Conference on Machine Learning (em inglês): 242–252. arXiv:1811.03962
↑ Du, Simon; Lee, Jason; Li, Haochuan; Wang, Liwei; Zhai, Xiyu (24 de maio de 2019). «Gradient Descent Finds Global Minima of Deep Neural Networks». International Conference on Machine Learning (em inglês): 1675–1685. arXiv:1811.03804
↑ Lee, Jaehoon; Xiao, Lechao; Schoenholz, Samuel S.; Bahri, Yasaman; Novak, Roman; Sohl-Dickstein, Jascha; Pennington, Jeffrey (15 de fevereiro de 2018). «Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent». arXiv:1902.06720
↑ Arora, Sanjeev; Du, Simon S; Hu, Wei; Li, Zhiyuan; Salakhutdinov, Russ R; Wang, Ruosong (2019), «On Exact Computation with an Infinitely Wide Neural Net», NeurIPS: 8139–8148, arXiv:1904.11955
↑ Huang, Jiaoyang; Yau, Horng-Tzer (17 de setembro de 2019). «Dynamics of Deep Neural Networks and Neural Tangent Hierarchy». arXiv:1909.08156

Este artigo sobre computação é um esboço. Você pode ajudar a Wikipédia expandindo-o.

[:0-1] Jacot, Arthur; Gabriel, Franck; Hongler, Clement (2018), Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K., eds., «Neural Tangent Kernel: Convergence and Generalization in Neural Networks» (PDF), Curran Associates, Inc., Advances in Neural Information Processing Systems 31: 8571–8580, Bibcode:2018arXiv180607572J, arXiv:1806.07572, consultado em 27 de novembro de 2019

[2] Li, Yuanzhi; Liang, Yingyu (2018). «Learning overparameterized neural networks via stochastic gradient descent on structured data». Advances in Neural Information Processing Systems

[3] Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao (2018). «A convergence theory for deep learning via overparameterization». International Conference on Machine Learning

[4] Du, Simon S; Zhai, Xiyu; Poczos, Barnabas; Aarti, Singh (2019). «Gradient descent provably optimizes over-parameterized neural networks». International Conference on Learning Representations

[5] Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (15 de fevereiro de 2018). «Sensitivity and Generalization in Neural Networks: an Empirical Study». Bibcode:2018arXiv180208760N. arXiv:1802.08760

[6] Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (4 de novembro de 2016). «An Analysis of Deep Neural Network Models for Practical Applications». Bibcode:2016arXiv160507678C. arXiv:1605.07678

[:2-7] Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao (9 de novembro de 2018). «A Convergence Theory for Deep Learning via Over-Parameterization». International Conference on Machine Learning (em inglês): 242–252. arXiv:1811.03962

[:5-8] Du, Simon; Lee, Jason; Li, Haochuan; Wang, Liwei; Zhai, Xiyu (24 de maio de 2019). «Gradient Descent Finds Global Minima of Deep Neural Networks». International Conference on Machine Learning (em inglês): 1675–1685. arXiv:1811.03804

[Lee-9] Lee, Jaehoon; Xiao, Lechao; Schoenholz, Samuel S.; Bahri, Yasaman; Novak, Roman; Sohl-Dickstein, Jascha; Pennington, Jeffrey (15 de fevereiro de 2018). «Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent». arXiv:1902.06720

[:1-10] Arora, Sanjeev; Du, Simon S; Hu, Wei; Li, Zhiyuan; Salakhutdinov, Russ R; Wang, Ruosong (2019), «On Exact Computation with an Infinitely Wide Neural Net», NeurIPS: 8139–8148, arXiv:1904.11955

[11] Huang, Jiaoyang; Yau, Horng-Tzer (17 de setembro de 2019). «Dynamics of Deep Neural Networks and Neural Tangent Hierarchy». arXiv:1909.08156

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

v d e Inteligência artificial / Inteligência computacional
Temas	Alinhamento de IA Aprendizado de IA Aprendizagem profunda Ética na IA Explosão de IA Progresso da IA Segurança da IA
Tópicos	Alucinação Cérebro artificial Cérebro positrônico Comportamento da Máquina Computação bioinspirada Computação evolucionária Computação social Raciocínio automatizado Raciocínio baseado em casos Rebelião das máquinas Robótica em nuvem
Tipos	Inteligência computacional IA Distribuída IA Explicável IA Generativa IA Geral IA para TI Inteligência de conteúdo Inteligência de enxame Máquina autorreplicadora Máquina Moral Máquina de vetores de suporte Sistema de reconhecimento facial Sistema especialista Sistema multiagente Sistema Tutorial Inteligente Sistemas de Processamento de Informações Neurais Sistemas periciais
Heurísticas	A* Subida de encosta
Meta-heurísticas	Algoritmo genético Pesquisa tabu Colônia de formigas Enxame de partículas GRASP
Aplicações	Adestramento de Cães Arte na IA Classificação estatística Corretor gramatical Governo por algoritmo IA na Música Processamento de linguagem natural Retificação de imagem Reconhecimento de entidade mencionada Reconhecimento de fala Reconhecimento ótico de caracteres
Categoria