6 Jun 2024 | Guijin Son, Sangwon Baek, Sangdae Nam, Igyun Jeong, Seungone Kim
This paper investigates whether large language models (LLMs) can handle multiple instructions simultaneously, a capability referred to as MULTI-TASK INFERENCE. The authors introduce the MTI BENCH, a comprehensive benchmark consisting of 5,000 instances across 25 tasks, each involving 2 to 3 sub-tasks. The benchmark is divided into two subsets: MULTI-STEP, which evaluates models' ability to follow sequential instructions, and MULTI-PART, which assesses models' ability to handle independent sub-tasks.
The study compares three inference methods: SINGLE-TASK INFERENCE, BATCH PROMPTING, and MULTI-TASK INFERENCE. Results show that MULTI-TASK INFERENCE reduces inference time by 1.46 times on average and improves performance by up to 7.3% for LLAMA-2-CHAT-70B and 12.4% for GPT-4 compared to SINGLE-TASK INFERENCE. The benchmark also reveals that MULTI-TASK INFERENCE provides faster speed than BATCH PROMPTING.
The authors evaluate 11 LLMs, including GPT-4, GPT-3.5, TULU, VICUNA, and LLAMA-2-CHAT. They find that MULTI-TASK INFERENCE significantly improves performance on the MTI BENCH, especially for larger models. The benchmark also includes a FREE-FORM GENERATION subset, which evaluates models' ability to generate outputs in various formats. Results show that smaller open-source models perform better with SINGLE-TASK INFERENCE, while larger models show similar performance between the two methods.
The study also conducts ablation experiments to analyze the effectiveness of MULTI-TASK INFERENCE. Results indicate that providing additional input components, such as the second instruction and context, improves performance on the first sub-task. Qualitative analysis reveals that MULTI-TASK INFERENCE enables models to utilize broader context and plan their solutions more effectively.
The authors conclude that MULTI-TASK INFERENCE is an efficient method for handling concurrent tasks and that it outperforms SINGLE-TASK INFERENCE in terms of speed and performance. However, the study also highlights limitations, including the benchmark's focus on English and the need for further research on model evaluation and generalization.This paper investigates whether large language models (LLMs) can handle multiple instructions simultaneously, a capability referred to as MULTI-TASK INFERENCE. The authors introduce the MTI BENCH, a comprehensive benchmark consisting of 5,000 instances across 25 tasks, each involving 2 to 3 sub-tasks. The benchmark is divided into two subsets: MULTI-STEP, which evaluates models' ability to follow sequential instructions, and MULTI-PART, which assesses models' ability to handle independent sub-tasks.
The study compares three inference methods: SINGLE-TASK INFERENCE, BATCH PROMPTING, and MULTI-TASK INFERENCE. Results show that MULTI-TASK INFERENCE reduces inference time by 1.46 times on average and improves performance by up to 7.3% for LLAMA-2-CHAT-70B and 12.4% for GPT-4 compared to SINGLE-TASK INFERENCE. The benchmark also reveals that MULTI-TASK INFERENCE provides faster speed than BATCH PROMPTING.
The authors evaluate 11 LLMs, including GPT-4, GPT-3.5, TULU, VICUNA, and LLAMA-2-CHAT. They find that MULTI-TASK INFERENCE significantly improves performance on the MTI BENCH, especially for larger models. The benchmark also includes a FREE-FORM GENERATION subset, which evaluates models' ability to generate outputs in various formats. Results show that smaller open-source models perform better with SINGLE-TASK INFERENCE, while larger models show similar performance between the two methods.
The study also conducts ablation experiments to analyze the effectiveness of MULTI-TASK INFERENCE. Results indicate that providing additional input components, such as the second instruction and context, improves performance on the first sub-task. Qualitative analysis reveals that MULTI-TASK INFERENCE enables models to utilize broader context and plan their solutions more effectively.
The authors conclude that MULTI-TASK INFERENCE is an efficient method for handling concurrent tasks and that it outperforms SINGLE-TASK INFERENCE in terms of speed and performance. However, the study also highlights limitations, including the benchmark's focus on English and the need for further research on model evaluation and generalization.