[slides] Sequence Transduction with Recurrent Neural Networks

The paper introduces a novel end-to-end, probabilistic sequence transduction system based on recurrent neural networks (RNNs). This system aims to transform any input sequence into any finite, discrete output sequence, addressing the challenge of representing sequences in a way that is invariant to sequential distortions. Traditional RNNs require a predefined alignment between input and output sequences, which is often difficult to determine. The proposed system extends the Connectionist Temporal Classification (CTC) approach by defining a distribution over output sequences of all lengths and jointly modeling both input-output and output-output dependencies. The system consists of two RNNs: a transcription network that scans the input sequence and outputs transcription vectors, and a prediction network that models each element of the output sequence given the previous ones. The output distribution is calculated using a forward-backward algorithm, and the system is trained to minimize the log-loss of the target sequence. Experimental results on the TIMIT speech corpus demonstrate the system's effectiveness in phoneme recognition, achieving one of the lowest phoneme error rates recorded. The paper also discusses future directions, including applications to text-to-speech and machine translation.The paper introduces a novel end-to-end, probabilistic sequence transduction system based on recurrent neural networks (RNNs). This system aims to transform any input sequence into any finite, discrete output sequence, addressing the challenge of representing sequences in a way that is invariant to sequential distortions. Traditional RNNs require a predefined alignment between input and output sequences, which is often difficult to determine. The proposed system extends the Connectionist Temporal Classification (CTC) approach by defining a distribution over output sequences of all lengths and jointly modeling both input-output and output-output dependencies. The system consists of two RNNs: a transcription network that scans the input sequence and outputs transcription vectors, and a prediction network that models each element of the output sequence given the previous ones. The output distribution is calculated using a forward-backward algorithm, and the system is trained to minimize the log-loss of the target sequence. Experimental results on the TIMIT speech corpus demonstrate the system's effectiveness in phoneme recognition, achieving one of the lowest phoneme error rates recorded. The paper also discusses future directions, including applications to text-to-speech and machine translation.

Sequence Transduction with Recurrent Neural Networks

14 Nov 2012 | Alex Graves