29 May 2018 | Joan Serrà, Dídac Surís, Marius Miron, Alexandros Karatzoglou
The paper introduces a task-based hard attention mechanism (HAT) to mitigate catastrophic forgetting in neural networks. Catastrophic forgetting occurs when a neural network loses previously learned information after learning new tasks. HAT preserves information from previous tasks without affecting the learning of new tasks by using a hard attention mask that is learned concurrently with each task. The mask is used to condition the learning process, allowing the network to retain knowledge from previous tasks while adapting to new ones. HAT is effective in reducing catastrophic forgetting, achieving up to 80% reduction in forgetting rates. It is robust to different hyperparameter choices and offers monitoring capabilities. The approach allows for controlling the stability and compactness of learned knowledge, making it suitable for online learning and network compression. HAT uses a task embedding and a sigmoid gate to generate attention vectors, which are then used to create binary masks that control weight updates. The attention vectors are used to condition the training of new tasks, ensuring that previous knowledge is not forgotten. The method is evaluated on multiple image classification tasks, showing superior performance compared to existing approaches. HAT is also effective in network pruning, allowing for the compression of the network while maintaining accuracy. The approach is lightweight, adding only a small fraction of weights to the base network, and is trained using backpropagation and SGD. The paper concludes that HAT is a promising solution for overcoming catastrophic forgetting and has potential applications in online learning and network compression.The paper introduces a task-based hard attention mechanism (HAT) to mitigate catastrophic forgetting in neural networks. Catastrophic forgetting occurs when a neural network loses previously learned information after learning new tasks. HAT preserves information from previous tasks without affecting the learning of new tasks by using a hard attention mask that is learned concurrently with each task. The mask is used to condition the learning process, allowing the network to retain knowledge from previous tasks while adapting to new ones. HAT is effective in reducing catastrophic forgetting, achieving up to 80% reduction in forgetting rates. It is robust to different hyperparameter choices and offers monitoring capabilities. The approach allows for controlling the stability and compactness of learned knowledge, making it suitable for online learning and network compression. HAT uses a task embedding and a sigmoid gate to generate attention vectors, which are then used to create binary masks that control weight updates. The attention vectors are used to condition the training of new tasks, ensuring that previous knowledge is not forgotten. The method is evaluated on multiple image classification tasks, showing superior performance compared to existing approaches. HAT is also effective in network pruning, allowing for the compression of the network while maintaining accuracy. The approach is lightweight, adding only a small fraction of weights to the base network, and is trained using backpropagation and SGD. The paper concludes that HAT is a promising solution for overcoming catastrophic forgetting and has potential applications in online learning and network compression.