Sunday, May 31, 2020

Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)


Do we really need dot-product attention? The attention mechanism is a central part of modern Transformers, mainly due to the dot-product attention mechanism. This paper changes the mechanism to remove the quadratic interaction terms and comes up with a new model, the Synthesizer. As it turns out, you can do pretty well like that! OUTLINE: 0:00 - Intro & High Level Overview 1:00 - Abstract 2:30 - Attention Mechanism as Information Routing 5:45 - Dot Product Attention 8:05 - Dense Synthetic Attention 15:00 - Random Synthetic Attention 17:15 - Comparison to Feed-Forward Layers 22:00 - Factorization & Mixtures 23:10 - Number of Parameters 25:35 - Machine Translation & Language Modeling Experiments 36:15 - Summarization & Dialogue Generation Experiments 37:15 - GLUE & SuperGLUE Experiments 42:00 - Weight Sizes & Number of Head Ablations 47:05 - Conclusion Paper: https://ift.tt/3cldH5V My Video on Transformers (Attention Is All You Need): https://youtu.be/iDulhoQ2pro My Video on BERT: https://youtu.be/-9evrZnBorM Abstract: The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. Our experimental results show that \textsc{Synthesizer} is competitive against vanilla Transformer models across a range of tasks, including MT (EnDe, EnFr), language modeling (LM1B), abstractive summarization (CNN/Dailymail), dialogue generation (PersonaChat) and Multi-task language understanding (GLUE, SuperGLUE). Authors: Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher BitChute: https://ift.tt/38iX6OV Minds: https://ift.tt/37igBpB

No comments:

Post a Comment