This paper investigates the effectiveness of multilingual pretraining and instruction tuning in improving cross-lingual knowledge alignment in large language models (LLMs). The authors propose a systematic framework called CLiKA to evaluate cross-lingual knowledge alignment at three levels: Performance (PF), Consistency (CT), and Conductivity (CD). The results show that while both multilingual pretraining and instruction tuning improve cross-lingual knowledge alignment, the training strategy must be carefully designed. Continued pretraining improves the alignment of the target language at the cost of other languages, while mixed pretraining affects other languages less. However, neither method substantially improves the cross-lingual knowledge conductivity. The study also finds that the overall cross-lingual knowledge alignment, especially in the conductivity level, is unsatisfactory for all tested LLMs. The results suggest that the high cross-lingual consistency observed in current LLMs is more likely due to overlapping training data rather than true knowledge transfer between languages. The paper also explores the effects of multilingual pretraining and instruction tuning on the basic language ability and cross-lingual knowledge alignment, using Chinese as a representative high-resource, non-English language. The findings indicate that mixed pretraining improves basic abilities and cross-lingual knowledge alignment, while continued pretraining has negative effects. Multilingual instruction tuning improves basic abilities in the target language but does not significantly improve cross-lingual knowledge alignment. The study concludes that current multilingual pretraining and instruction tuning methods are not sufficient to improve cross-lingual knowledge alignment at deeper levels, and that the cross-lingual alignment in current models is still shallow and requires novel strategies for improvement. The paper also discusses the limitations of the study, including the restricted evaluation to a few selected models and the narrow set of linguistic features considered. The authors suggest that further experiments with other languages and finetuning strategies could provide more insights into the effectiveness of multilingual pretraining and instruction tuning.This paper investigates the effectiveness of multilingual pretraining and instruction tuning in improving cross-lingual knowledge alignment in large language models (LLMs). The authors propose a systematic framework called CLiKA to evaluate cross-lingual knowledge alignment at three levels: Performance (PF), Consistency (CT), and Conductivity (CD). The results show that while both multilingual pretraining and instruction tuning improve cross-lingual knowledge alignment, the training strategy must be carefully designed. Continued pretraining improves the alignment of the target language at the cost of other languages, while mixed pretraining affects other languages less. However, neither method substantially improves the cross-lingual knowledge conductivity. The study also finds that the overall cross-lingual knowledge alignment, especially in the conductivity level, is unsatisfactory for all tested LLMs. The results suggest that the high cross-lingual consistency observed in current LLMs is more likely due to overlapping training data rather than true knowledge transfer between languages. The paper also explores the effects of multilingual pretraining and instruction tuning on the basic language ability and cross-lingual knowledge alignment, using Chinese as a representative high-resource, non-English language. The findings indicate that mixed pretraining improves basic abilities and cross-lingual knowledge alignment, while continued pretraining has negative effects. Multilingual instruction tuning improves basic abilities in the target language but does not significantly improve cross-lingual knowledge alignment. The study concludes that current multilingual pretraining and instruction tuning methods are not sufficient to improve cross-lingual knowledge alignment at deeper levels, and that the cross-lingual alignment in current models is still shallow and requires novel strategies for improvement. The paper also discusses the limitations of the study, including the restricted evaluation to a few selected models and the narrow set of linguistic features considered. The authors suggest that further experiments with other languages and finetuning strategies could provide more insights into the effectiveness of multilingual pretraining and instruction tuning.