21 Jun 2022 | Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer
**Abstract:**
Large language models (LLMs) trained on massive text collections have shown remarkable capabilities for zero- and few-shot learning. However, their computational cost makes replication difficult without significant capital. To address this, we present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. We aim to fully and responsibly share these models with researchers. OPT-175B is comparable to GPT-3 but requires only 1/7th the carbon footprint to develop. We release our logbook detailing infrastructure challenges and the codebase metaseq, which enabled training OPT-175B on 992 80GB A100 GPUs, achieving 147 TFLOP/s utilization per GPU.
**Introduction:**
Large language models have shown surprising capabilities in zero- and few-shot learning. However, access to full model weights through APIs is limited, hindering research. OPT aims to address this by providing a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. We train these models to match GPT-3 in performance and size while applying best practices in data collection and training efficiency. Our goal is to enable reproducible and responsible research at scale.
**Methods:**
We present results on eight Transformer language models, detailing architectural details and training setup. We use a similar setup to GPT-3, with variations in batch size for computational efficiency. We report performance on 16 standard NLP tasks, comparing primarily to GPT-3. We also evaluate OPT-175B on dialogue datasets and assess its bias, toxicity, and hate speech detection capabilities.
**Evaluations:**
OPT-175B performs similarly to GPT-3 on standard evaluation datasets but shows variations across tasks. In dialogue tasks, OPT-175B outperforms the unsupervised Reddit 2.7B model and performs competitively with the fully supervised BlenderBot 1 model. However, it has higher toxicity rates and exhibits more stereotypical biases in some categories.
**Limitations:**
OPT-175B suffers from limitations common in other LLMs, such as poor performance with declarative instructions and repetitive behavior. It also produces factually incorrect statements and has a high propensity to generate toxic language. Future work should focus on improving instruction learning, reducing repetition, and enhancing factual correctness.
**Considerations for Release:**
We disclose all details of OPT-175B's training process, including infrastructure failures and human overhead. We aim to increase transparency and encourage responsible AI research. We provide researchers with access to OPT-175B and smaller baselines, emphasizing the importance of responsible data usage and ethical considerations in LLM development.**Abstract:**
Large language models (LLMs) trained on massive text collections have shown remarkable capabilities for zero- and few-shot learning. However, their computational cost makes replication difficult without significant capital. To address this, we present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. We aim to fully and responsibly share these models with researchers. OPT-175B is comparable to GPT-3 but requires only 1/7th the carbon footprint to develop. We release our logbook detailing infrastructure challenges and the codebase metaseq, which enabled training OPT-175B on 992 80GB A100 GPUs, achieving 147 TFLOP/s utilization per GPU.
**Introduction:**
Large language models have shown surprising capabilities in zero- and few-shot learning. However, access to full model weights through APIs is limited, hindering research. OPT aims to address this by providing a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. We train these models to match GPT-3 in performance and size while applying best practices in data collection and training efficiency. Our goal is to enable reproducible and responsible research at scale.
**Methods:**
We present results on eight Transformer language models, detailing architectural details and training setup. We use a similar setup to GPT-3, with variations in batch size for computational efficiency. We report performance on 16 standard NLP tasks, comparing primarily to GPT-3. We also evaluate OPT-175B on dialogue datasets and assess its bias, toxicity, and hate speech detection capabilities.
**Evaluations:**
OPT-175B performs similarly to GPT-3 on standard evaluation datasets but shows variations across tasks. In dialogue tasks, OPT-175B outperforms the unsupervised Reddit 2.7B model and performs competitively with the fully supervised BlenderBot 1 model. However, it has higher toxicity rates and exhibits more stereotypical biases in some categories.
**Limitations:**
OPT-175B suffers from limitations common in other LLMs, such as poor performance with declarative instructions and repetitive behavior. It also produces factually incorrect statements and has a high propensity to generate toxic language. Future work should focus on improving instruction learning, reducing repetition, and enhancing factual correctness.
**Considerations for Release:**
We disclose all details of OPT-175B's training process, including infrastructure failures and human overhead. We aim to increase transparency and encourage responsible AI research. We provide researchers with access to OPT-175B and smaller baselines, emphasizing the importance of responsible data usage and ethical considerations in LLM development.