Conformer: Convolution-augmented Transformer for Speech Recognition

Conformer: Convolution-augmented Transformer for Speech Recognition

16 May 2020 | Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang
The paper introduces Conformer, a novel architecture that integrates convolutional neural networks (CNNs) and transformers for end-to-end speech recognition. Conformer combines the strengths of both models to capture both local and global dependencies in audio sequences efficiently. The proposed model, named Conformer, features a convolution module sandwiched between two feed-forward layers, which are in turn sandwiched by a multi-headed self-attention module. This structure allows Conformer to learn both fine-grained local features and long-range global interactions. The authors evaluate Conformer on the LibriSpeech dataset, achieving state-of-the-art results with a significant improvement over previous Transformer and CNN-based models. Specifically, Conformer achieves a word error rate (WER) of 2.1%/4.3% without a language model and 1.9%/3.9% with an external language model. The paper also includes ablation studies to demonstrate the effectiveness of each component in Conformer, showing that the convolution module and the Macaron-style feed-forward layers are crucial for its performance.The paper introduces Conformer, a novel architecture that integrates convolutional neural networks (CNNs) and transformers for end-to-end speech recognition. Conformer combines the strengths of both models to capture both local and global dependencies in audio sequences efficiently. The proposed model, named Conformer, features a convolution module sandwiched between two feed-forward layers, which are in turn sandwiched by a multi-headed self-attention module. This structure allows Conformer to learn both fine-grained local features and long-range global interactions. The authors evaluate Conformer on the LibriSpeech dataset, achieving state-of-the-art results with a significant improvement over previous Transformer and CNN-based models. Specifically, Conformer achieves a word error rate (WER) of 2.1%/4.3% without a language model and 1.9%/3.9% with an external language model. The paper also includes ablation studies to demonstrate the effectiveness of each component in Conformer, showing that the convolution module and the Macaron-style feed-forward layers are crucial for its performance.
Reach us at info@study.space