Conformer: Convolution-augmented Transformer for Speech Recognition

Conformer: Convolution-augmented Transformer for Speech Recognition

16 May 2020 | Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang
The Conformer is a convolution-augmented Transformer model designed for speech recognition, combining the strengths of both convolutional neural networks (CNNs) and Transformers. It effectively captures both local and global dependencies in audio sequences with a parameter-efficient approach. The model outperforms previous Transformer and CNN-based models on the LibriSpeech benchmark, achieving state-of-the-art results with WER of 2.1%/4.3% without a language model and 1.9%/3.9% with an external language model on test/testother. A smaller model with 10M parameters achieves 2.7%/6.3% performance, demonstrating the model's efficiency. The Conformer encoder uses a convolution subsampling layer followed by multiple Conformer blocks. Each block consists of a feed-forward module, self-attention module, convolution module, and a second feed-forward module. The self-attention module uses relative positional encoding for better generalization, while the convolution module includes a gating mechanism and depthwise convolution. The feed-forward module employs pre-norm residual units and Swish activation. Experiments show that the Conformer model outperforms previous models, including ContextNet and Transformer Transducer, across various parameter sizes. Ablation studies reveal that the convolution module is crucial for performance, and placing the convolution after self-attention improves results. The model also benefits from increasing the number of attention heads up to 16 and using larger kernel sizes in the convolution module, with kernel size 32 performing best. The Conformer model demonstrates superior accuracy with fewer parameters compared to previous work on the LibriSpeech dataset, achieving a new state-of-the-art performance of 1.9%/3.9% on test/testother. The model's architecture effectively integrates CNNs and Transformers, leveraging their respective strengths to achieve high performance in speech recognition.The Conformer is a convolution-augmented Transformer model designed for speech recognition, combining the strengths of both convolutional neural networks (CNNs) and Transformers. It effectively captures both local and global dependencies in audio sequences with a parameter-efficient approach. The model outperforms previous Transformer and CNN-based models on the LibriSpeech benchmark, achieving state-of-the-art results with WER of 2.1%/4.3% without a language model and 1.9%/3.9% with an external language model on test/testother. A smaller model with 10M parameters achieves 2.7%/6.3% performance, demonstrating the model's efficiency. The Conformer encoder uses a convolution subsampling layer followed by multiple Conformer blocks. Each block consists of a feed-forward module, self-attention module, convolution module, and a second feed-forward module. The self-attention module uses relative positional encoding for better generalization, while the convolution module includes a gating mechanism and depthwise convolution. The feed-forward module employs pre-norm residual units and Swish activation. Experiments show that the Conformer model outperforms previous models, including ContextNet and Transformer Transducer, across various parameter sizes. Ablation studies reveal that the convolution module is crucial for performance, and placing the convolution after self-attention improves results. The model also benefits from increasing the number of attention heads up to 16 and using larger kernel sizes in the convolution module, with kernel size 32 performing best. The Conformer model demonstrates superior accuracy with fewer parameters compared to previous work on the LibriSpeech dataset, achieving a new state-of-the-art performance of 1.9%/3.9% on test/testother. The model's architecture effectively integrates CNNs and Transformers, leveraging their respective strengths to achieve high performance in speech recognition.
Reach us at info@study.space