Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations

Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations

11 Jan 2024 | Zhihui Xie, Handong Zhao, Tong Yu, Shuai Li
This paper proposes a method called LSAR to enhance language agnosticism in pretrained multilingual language models (ML-LMs). The key idea is to identify a low-rank subspace in the multilingual embedding space that primarily encodes language-specific signals. By projecting the original embeddings into the null space of this subspace, the method effectively removes language-specific factors, enabling more semantic-based cross-lingual tasks. The approach is based on singular value decomposition (SVD) and uses multiple monolingual corpora as input. The identified subspace is then used to factor out language-specific information without requiring fine-tuning. The method is evaluated on various tasks, including cross-lingual sentence retrieval and language-agnostic question answering. Results show that LSAR consistently improves performance over commonly used ML-LMs. Specifically, it achieves significant improvements on the LAReQA benchmark, a challenging task targeting strong language agnosticism. The study also reveals that the identified subspace encodes a significant amount of syntactic information, suggesting that LSAR effectively removes redundant linguistic signals to facilitate language agnosticism. The paper also discusses related work, including previous attempts to remove language-specific signals from multilingual representations. It highlights the effectiveness of LSAR in various semantic tasks and provides evidence that the identified subspace captures strong syntactic signals. The method is shown to be effective in both cross-lingual retrieval and zero-shot classification tasks. Overall, LSAR offers a simple yet effective approach to enhance language agnosticism in multilingual representations.This paper proposes a method called LSAR to enhance language agnosticism in pretrained multilingual language models (ML-LMs). The key idea is to identify a low-rank subspace in the multilingual embedding space that primarily encodes language-specific signals. By projecting the original embeddings into the null space of this subspace, the method effectively removes language-specific factors, enabling more semantic-based cross-lingual tasks. The approach is based on singular value decomposition (SVD) and uses multiple monolingual corpora as input. The identified subspace is then used to factor out language-specific information without requiring fine-tuning. The method is evaluated on various tasks, including cross-lingual sentence retrieval and language-agnostic question answering. Results show that LSAR consistently improves performance over commonly used ML-LMs. Specifically, it achieves significant improvements on the LAReQA benchmark, a challenging task targeting strong language agnosticism. The study also reveals that the identified subspace encodes a significant amount of syntactic information, suggesting that LSAR effectively removes redundant linguistic signals to facilitate language agnosticism. The paper also discusses related work, including previous attempts to remove language-specific signals from multilingual representations. It highlights the effectiveness of LSAR in various semantic tasks and provides evidence that the identified subspace captures strong syntactic signals. The method is shown to be effective in both cross-lingual retrieval and zero-shot classification tasks. Overall, LSAR offers a simple yet effective approach to enhance language agnosticism in multilingual representations.
Reach us at info@study.space