This paper presents a method for aligning sentences in bilingual corpora, such as the Canadian Hansards, which are available in multiple languages (French and English). The method is based on a simple statistical model of character lengths, where longer sentences in one language tend to be translated into longer sentences in the other language, and shorter sentences tend to be translated into shorter sentences. The alignment is performed using a dynamic programming framework to find the maximum likelihood alignment of sentences.
The method was tested on a trilingual sample of Swiss economic reports, achieving an error rate of 4% across 1316 alignments. By selecting the best-scoring 80% of alignments, the error rate was reduced to 0.7%. The method is also fairly language-independent, as it performed well for both English-French and English-German translations.
The paper discusses the evaluation of the alignment program, including a comparison with human alignment, and highlights the importance of using character lengths rather than words for more accurate results. The method is simple yet effective, making it a useful tool for aligning sentences in bilingual corpora.This paper presents a method for aligning sentences in bilingual corpora, such as the Canadian Hansards, which are available in multiple languages (French and English). The method is based on a simple statistical model of character lengths, where longer sentences in one language tend to be translated into longer sentences in the other language, and shorter sentences tend to be translated into shorter sentences. The alignment is performed using a dynamic programming framework to find the maximum likelihood alignment of sentences.
The method was tested on a trilingual sample of Swiss economic reports, achieving an error rate of 4% across 1316 alignments. By selecting the best-scoring 80% of alignments, the error rate was reduced to 0.7%. The method is also fairly language-independent, as it performed well for both English-French and English-German translations.
The paper discusses the evaluation of the alignment program, including a comparison with human alignment, and highlights the importance of using character lengths rather than words for more accurate results. The method is simple yet effective, making it a useful tool for aligning sentences in bilingual corpora.