Tuesday, July 7, 2020

SupSup: Supermasks in Superposition (Paper Explained)


Supermasks are binary masks of a randomly initialized neural network that result in the masked network performing well on a particular task. This paper considers the problem of (sequential) Lifelong Learning and trains one Supermask per Task, while keeping the randomly initialized base network constant. By minimizing the output entropy, the system can automatically derive the Task ID of a data point at inference time and distinguish up to 2500 tasks automatically. OUTLINE: 0:00 - Intro & Overview 1:20 - Catastrophic Forgetting 5:20 - Supermasks 9:35 - Lifelong Learning using Supermasks 11:15 - Inference Time Task Discrimination by Entropy 15:05 - Mask Superpositions 24:20 - Proof-of-Concept, Task Given at Inference 30:15 - Binary Maximum Entropy Search 32:00 - Task Not Given at Inference 37:15 - Task Not Given at Training 41:35 - Ablations 45:05 - Superfluous Neurons 51:10 - Task Selection by Detecting Outliers 57:40 - Encoding Masks in Hopfield Networks 59:40 - Conclusion Paper: https://ift.tt/2BMrcPL Code: https://ift.tt/3iy8Mmi My Video about Lottery Tickets: https://youtu.be/ZVVnvZdUMUk My Video about Supermasks: https://youtu.be/jhCInVFE2sc Abstract: We present the Supermasks in Superposition (SupSup) model, capable of sequentially learning thousands of tasks without catastrophic forgetting. Our approach uses a randomly initialized, fixed base network and for each task finds a subnetwork (supermask) that achieves good performance. If task identity is given at test time, the correct subnetwork can be retrieved with minimal memory usage. If not provided, SupSup can infer the task using gradient-based optimization to find a linear superposition of learned supermasks which minimizes the output entropy. In practice we find that a single gradient step is often sufficient to identify the correct mask, even among 2500 tasks. We also showcase two promising extensions. First, SupSup models can be trained entirely without task identity information, as they may detect when they are uncertain about new data and allocate an additional supermask for the new training distribution. Finally the entire, growing set of supermasks can be stored in a constant-sized reservoir by implicitly storing them as attractors in a fixed-sized Hopfield network. Authors: Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, Ali Farhadi Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://ift.tt/3dJpBrR BitChute: https://ift.tt/38iX6OV Minds: https://ift.tt/37igBpB Parler: https://ift.tt/38tQU7C

No comments:

Post a Comment