4 Jun 2024 | Peiyuan Zhang*, Guangtao Zeng*, Tianduo Wang, Wei Lu
**TinyLlama: An Open-Source Small Language Model**
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, Wei Lu
StatNLP Research Group
Singapore University of Technology and Design
**Abstract**
We introduce TinyLlama, a compact 1.1B parameter language model pre-trained on approximately 3 trillion tokens. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages advancements like FlashAttention and Lit-GPT to achieve better computational efficiency. Despite its small size, TinyLlama demonstrates superior performance in various downstream tasks, outperforming existing open-source models of similar sizes. Our model checkpoints and code are publicly available on GitHub.
**Introduction**
Recent progress in natural language processing (NLP) has been driven by scaling up language model sizes. However, the potential of training smaller models with larger datasets remains underexplored. This work focuses on exploring the behavior of smaller models when trained with significantly more tokens than suggested by scaling laws. TinyLlama is trained with 1.1B parameters using up to 3 trillion tokens, making it the first attempt to train such a small model with such a large amount of data. Following the architecture and tokenizer of Llama 2, we name our model TinyLlama. It shows competitive performance compared to existing open-source language models of similar sizes, surpassing OPT-1.3B and Pythia-1.4B in various tasks.
**Pre-training**
TinyLlama is pre-trained using a blend of natural language and code data from SlimPajama and StarCoder datasets. The combined dataset contains approximately 950 billion tokens, processed using the Llama tokenizer. The model is trained across approximately three epochs, processing 3 trillion tokens in total.
**Architecture**
TinyLlama is a decoder-only Transformer model with features such as Rotary Positional Embedding, pre-norm, RMSNorm, SwiGLU, and grouped-query attention. These features enhance training efficiency and reduce memory footprint.
**Speed Optimization**
To improve training speed, TinyLlama integrates FSDP, FlashAttention, fused layernorm, fused cross entropy loss, and fused rotary positional embedding. These optimizations achieve a training throughput of 24,000 tokens per second per A100-40G GPU, significantly enhancing efficiency compared to existing models.
**Training**
TinyLlama is trained using an autoregressive language modeling objective, AdamW optimizer, cosine learning rate schedule, and a batch size of 2M tokens. The training process involves basic pre-training, continual pre-training with specific domains, and a cooldown phase to enhance model convergence.
**Results**
TinyLlama is evaluated on commonsense reasoning and problem-solving tasks, outperforming baselines in many tasks. It demonstrates better problem-solving skills and improved performance in Chinese understanding tasks after continual pre-training on Chinese data.
**Conclusion**
TinyLlama is an open**TinyLlama: An Open-Source Small Language Model**
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, Wei Lu
StatNLP Research Group
Singapore University of Technology and Design
**Abstract**
We introduce TinyLlama, a compact 1.1B parameter language model pre-trained on approximately 3 trillion tokens. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages advancements like FlashAttention and Lit-GPT to achieve better computational efficiency. Despite its small size, TinyLlama demonstrates superior performance in various downstream tasks, outperforming existing open-source models of similar sizes. Our model checkpoints and code are publicly available on GitHub.
**Introduction**
Recent progress in natural language processing (NLP) has been driven by scaling up language model sizes. However, the potential of training smaller models with larger datasets remains underexplored. This work focuses on exploring the behavior of smaller models when trained with significantly more tokens than suggested by scaling laws. TinyLlama is trained with 1.1B parameters using up to 3 trillion tokens, making it the first attempt to train such a small model with such a large amount of data. Following the architecture and tokenizer of Llama 2, we name our model TinyLlama. It shows competitive performance compared to existing open-source language models of similar sizes, surpassing OPT-1.3B and Pythia-1.4B in various tasks.
**Pre-training**
TinyLlama is pre-trained using a blend of natural language and code data from SlimPajama and StarCoder datasets. The combined dataset contains approximately 950 billion tokens, processed using the Llama tokenizer. The model is trained across approximately three epochs, processing 3 trillion tokens in total.
**Architecture**
TinyLlama is a decoder-only Transformer model with features such as Rotary Positional Embedding, pre-norm, RMSNorm, SwiGLU, and grouped-query attention. These features enhance training efficiency and reduce memory footprint.
**Speed Optimization**
To improve training speed, TinyLlama integrates FSDP, FlashAttention, fused layernorm, fused cross entropy loss, and fused rotary positional embedding. These optimizations achieve a training throughput of 24,000 tokens per second per A100-40G GPU, significantly enhancing efficiency compared to existing models.
**Training**
TinyLlama is trained using an autoregressive language modeling objective, AdamW optimizer, cosine learning rate schedule, and a batch size of 2M tokens. The training process involves basic pre-training, continual pre-training with specific domains, and a cooldown phase to enhance model convergence.
**Results**
TinyLlama is evaluated on commonsense reasoning and problem-solving tasks, outperforming baselines in many tasks. It demonstrates better problem-solving skills and improved performance in Chinese understanding tasks after continual pre-training on Chinese data.
**Conclusion**
TinyLlama is an open