Understanding Phi-3 Technical Report%3A A Highly Capable Language Model Locally on Your Phone

Microsoft has introduced phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, which achieves performance comparable to larger models like Mixtral 8x7B and GPT-3.5, despite being small enough to run on a phone. The model's performance is attributed to a scaled-up training dataset composed of filtered web data and synthetic data. Additional models, phi-3-small and phi-3-medium, were also developed, with phi-3-mini achieving 69% on MMLU and 8.38 on MT-bench. Phi-3-vision, a 4.2 billion parameter model, is designed for image and text reasoning. The model is based on a transformer decoder architecture with a default context length of 4K, and a long context version with 128K tokens. It uses a similar block structure to Llama-2 and a tokenizer with a vocabulary size of 32,064. The model is chat-finetuned and uses a specific chat template. The phi-3-small model has 7B parameters and uses a different tokenizer with a vocabulary size of 100,352. It uses GEGLU activation and muP for hyperparameter tuning. The model also uses a blocksparse attention module to optimize training and inference speed. The training data was curated to be in the "data optimal regime" for small models, focusing on quality over compute. This approach allows the model to achieve high performance despite its size. The model was also safety-aligned through post-training, red-teaming, and automated testing. The phi-3-mini model was tested on a phone with a 4-bit quantization, achieving over 12 tokens per second. The model was evaluated on academic benchmarks, showing strong performance on reasoning tasks. Phi-3-vision was also evaluated on visual and text reasoning tasks, showing strong performance on various benchmarks. The model was also tested for safety, with post-training significantly reducing harmful response rates. However, the model has limitations, such as limited factual knowledge and a focus on English. These weaknesses can be addressed with additional search engine integration and multilingual data. The model was developed in accordance with Microsoft's responsible AI principles, with safety alignment and testing across various harm categories. The model was also tested for multi-modal capabilities, showing strong performance on visual and text reasoning tasks. However, the model has limitations in high-level reasoning and generating ungrounded outputs. These issues can be addressed with additional training data and post-training.Microsoft has introduced phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, which achieves performance comparable to larger models like Mixtral 8x7B and GPT-3.5, despite being small enough to run on a phone. The model's performance is attributed to a scaled-up training dataset composed of filtered web data and synthetic data. Additional models, phi-3-small and phi-3-medium, were also developed, with phi-3-mini achieving 69% on MMLU and 8.38 on MT-bench. Phi-3-vision, a 4.2 billion parameter model, is designed for image and text reasoning. The model is based on a transformer decoder architecture with a default context length of 4K, and a long context version with 128K tokens. It uses a similar block structure to Llama-2 and a tokenizer with a vocabulary size of 32,064. The model is chat-finetuned and uses a specific chat template. The phi-3-small model has 7B parameters and uses a different tokenizer with a vocabulary size of 100,352. It uses GEGLU activation and muP for hyperparameter tuning. The model also uses a blocksparse attention module to optimize training and inference speed. The training data was curated to be in the "data optimal regime" for small models, focusing on quality over compute. This approach allows the model to achieve high performance despite its size. The model was also safety-aligned through post-training, red-teaming, and automated testing. The phi-3-mini model was tested on a phone with a 4-bit quantization, achieving over 12 tokens per second. The model was evaluated on academic benchmarks, showing strong performance on reasoning tasks. Phi-3-vision was also evaluated on visual and text reasoning tasks, showing strong performance on various benchmarks. The model was also tested for safety, with post-training significantly reducing harmful response rates. However, the model has limitations, such as limited factual knowledge and a focus on English. These weaknesses can be addressed with additional search engine integration and multilingual data. The model was developed in accordance with Microsoft's responsible AI principles, with safety alignment and testing across various harm categories. The model was also tested for multi-modal capabilities, showing strong performance on visual and text reasoning tasks. However, the model has limitations in high-level reasoning and generating ungrounded outputs. These issues can be addressed with additional training data and post-training.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

23 May 2024 | Microsoft