Europarl: A Parallel Corpus for Statistical Machine Translation

Europarl: A Parallel Corpus for Statistical Machine Translation

2005 | Philipp Koehn
The paper "Europarl: A Parallel Corpus for Statistical Machine Translation" by Philipp Koehn discusses the collection and application of the Europarl corpus, a parallel text corpus from the proceedings of the European Parliament. The corpus, available in 11 official languages of the European Union, has been widely used in natural language processing (NLP) research. Koehn and his team collected the corpus to aid their research in statistical machine translation (SMT) and made it available to the NLP community. The paper details the five steps involved in acquiring a parallel corpus for SMT: obtaining raw data, extracting and mapping parallel chunks of text, breaking text into sentences, preparing the corpus for SMT systems, and mapping sentences in one language to sentences in another. The authors describe the process of crawling the European Parliament website, document alignment, sentence splitting and tokenization, and sentence alignment. They also provide a common test set for comparing machine translation systems and discuss the challenges and performance of 110 SMT systems trained on the Europarl corpus. The results show that the quality of SMT systems varies significantly across different language pairs, highlighting the diverse challenges in SMT research. The paper concludes by emphasizing the importance of resources and tools in advancing the field of statistical machine translation.The paper "Europarl: A Parallel Corpus for Statistical Machine Translation" by Philipp Koehn discusses the collection and application of the Europarl corpus, a parallel text corpus from the proceedings of the European Parliament. The corpus, available in 11 official languages of the European Union, has been widely used in natural language processing (NLP) research. Koehn and his team collected the corpus to aid their research in statistical machine translation (SMT) and made it available to the NLP community. The paper details the five steps involved in acquiring a parallel corpus for SMT: obtaining raw data, extracting and mapping parallel chunks of text, breaking text into sentences, preparing the corpus for SMT systems, and mapping sentences in one language to sentences in another. The authors describe the process of crawling the European Parliament website, document alignment, sentence splitting and tokenization, and sentence alignment. They also provide a common test set for comparing machine translation systems and discuss the challenges and performance of 110 SMT systems trained on the Europarl corpus. The results show that the quality of SMT systems varies significantly across different language pairs, highlighting the diverse challenges in SMT research. The paper concludes by emphasizing the importance of resources and tools in advancing the field of statistical machine translation.
Reach us at info@study.space