2024-01-11 | Zhihui Xie, Handong Zhao, Tong Yu, Shuai Li
This paper addresses the issue of strong language identity information in multilingual embedding spaces, which hinders the expression of linguistic factors shared across languages. The authors propose a novel method called Language Agnostic Rectification (LSAR) to project away language-specific factors from multilingual embeddings. LSAR identifies a low-rank subspace using singular value decomposition (SVD) based on multiple monolingual corpora. This subspace encodes information irrelevant to semantics, such as syntactic details. By projecting the original embeddings into the null space of this subspace, LSAR effectively removes language-specific signals without fine-tuning. The method is evaluated on various semantic tasks, including cross-lingual sentence retrieval, and shows significant improvements over commonly used multilingual language models (ML-LMs). Empirical results demonstrate that LSAR consistently enhances performance, especially in challenging tasks like language-agnostic QA retrieval. The paper also provides insights into the nature of the identified low-rank subspace, showing that it primarily captures syntactic information.This paper addresses the issue of strong language identity information in multilingual embedding spaces, which hinders the expression of linguistic factors shared across languages. The authors propose a novel method called Language Agnostic Rectification (LSAR) to project away language-specific factors from multilingual embeddings. LSAR identifies a low-rank subspace using singular value decomposition (SVD) based on multiple monolingual corpora. This subspace encodes information irrelevant to semantics, such as syntactic details. By projecting the original embeddings into the null space of this subspace, LSAR effectively removes language-specific signals without fine-tuning. The method is evaluated on various semantic tasks, including cross-lingual sentence retrieval, and shows significant improvements over commonly used multilingual language models (ML-LMs). Empirical results demonstrate that LSAR consistently enhances performance, especially in challenging tasks like language-agnostic QA retrieval. The paper also provides insights into the nature of the identified low-rank subspace, showing that it primarily captures syntactic information.