7 Feb 2024 | Roman Koshkin, Katsuhito Sudoh, Satoshi Nakamura
This paper presents TRANSLLAMA, a policy-free simultaneous machine translation (SiMT) system based on large language models (LLMs). The system uses a pre-trained decoder-only LLM, fine-tuned on a dataset of causally aligned source and target sentences. The key innovation is that the LLM can directly control input segmentation by generating a special "wait" token, eliminating the need for a separate policy. This allows the LLM to perform English-German and English-Russian SiMT tasks with BLEU scores comparable to state-of-the-art baselines. The system also supports speech-to-speech translation (S2TT) by integrating an automatic speech recognition (ASR) model.
The system's architecture is based on a cascaded approach, where an ASR model processes the source audio and feeds the recognized words into the LLM. The LLM is prompted with a partial source sentence and its corresponding partial translation, generating one or more target tokens until either a full new word or a <WAIT> token is generated. The <WAIT> token signals the model to read in more source words. When extended with an off-the-shelf ASR model, the system handles S2TT tasks with performance approaching that of some recent baselines at comparable latencies.
The main contributions of this work are: (1) a method to fine-tune a pre-trained LLM for SiMT using direct supervision on causally aligned source-target sentence pairs; and (2) demonstrating that an LLM can perform both simultaneous translation and input segmentation without a separate policy, achieving performance comparable to or exceeding state-of-the-art results.
The system was evaluated on the MuST-C v2.0 dataset for English-to-German (en-de) and English-to-Russian (en-ru) translation directions. Results show that the LLM's size is a major factor in determining translation quality. The system also performed well in zero-shot settings, with GPT-4 showing particularly strong performance. The use of <WAIT> tokens was found to be crucial for achieving the desired balance between translation quality and latency.
The paper also discusses the limitations of the current approach, including the need for background information in the prompt and the impact of ASR errors on translation quality. Future work includes exploring multilingual fine-tuning, self-instruct, and human preference tuning. The study highlights the potential of LLMs for SiMT tasks, demonstrating that with minimal fine-tuning, an off-the-shelf pre-trained LLM can achieve performance that rivals some of the latest SiMT models.This paper presents TRANSLLAMA, a policy-free simultaneous machine translation (SiMT) system based on large language models (LLMs). The system uses a pre-trained decoder-only LLM, fine-tuned on a dataset of causally aligned source and target sentences. The key innovation is that the LLM can directly control input segmentation by generating a special "wait" token, eliminating the need for a separate policy. This allows the LLM to perform English-German and English-Russian SiMT tasks with BLEU scores comparable to state-of-the-art baselines. The system also supports speech-to-speech translation (S2TT) by integrating an automatic speech recognition (ASR) model.
The system's architecture is based on a cascaded approach, where an ASR model processes the source audio and feeds the recognized words into the LLM. The LLM is prompted with a partial source sentence and its corresponding partial translation, generating one or more target tokens until either a full new word or a <WAIT> token is generated. The <WAIT> token signals the model to read in more source words. When extended with an off-the-shelf ASR model, the system handles S2TT tasks with performance approaching that of some recent baselines at comparable latencies.
The main contributions of this work are: (1) a method to fine-tune a pre-trained LLM for SiMT using direct supervision on causally aligned source-target sentence pairs; and (2) demonstrating that an LLM can perform both simultaneous translation and input segmentation without a separate policy, achieving performance comparable to or exceeding state-of-the-art results.
The system was evaluated on the MuST-C v2.0 dataset for English-to-German (en-de) and English-to-Russian (en-ru) translation directions. Results show that the LLM's size is a major factor in determining translation quality. The system also performed well in zero-shot settings, with GPT-4 showing particularly strong performance. The use of <WAIT> tokens was found to be crucial for achieving the desired balance between translation quality and latency.
The paper also discusses the limitations of the current approach, including the need for background information in the prompt and the impact of ASR errors on translation quality. Future work includes exploring multilingual fine-tuning, self-instruct, and human preference tuning. The study highlights the potential of LLMs for SiMT tasks, demonstrating that with minimal fine-tuning, an off-the-shelf pre-trained LLM can achieve performance that rivals some of the latest SiMT models.