Thursday, June 11, 2020

Linformer: Self-Attention with Linear Complexity (Paper Explained)


Transformers are notoriously resource-intensive because their self-attention mechanism requires a squared number of memory and computations in the length of the input sequence. The Linformer Model gets around that by using the fact that often, the actual information in the attention matrix is of lower rank and can be approximated. OUTLINE: 0:00 - Intro & Overview 1:40 - The Complexity of Self-Attention 4:50 - Embedding Dimension & Multiple Heads 8:45 - Formal Attention 10:30 - Empirical Investigation into RoBERTa 20:00 - Theorem: Self-Attention is Low Rank 28:10 - Linear Self-Attention Method 36:15 - Theorem: Linear Self-Attention 44:10 - Language Modeling 46:40 - NLP Benchmarks 47:50 - Compute Time & Memory Gains 48:20 - Broader Impact Statement 49:55 - Conclusion Paper: https://ift.tt/3fc6l6t Abstract: Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses O(n2) time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n2) to O(n) in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient. Authors: Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://ift.tt/3dJpBrR BitChute: https://ift.tt/38iX6OV Minds: https://ift.tt/37igBpB

No comments:

Post a Comment