30 May 2024 | Zekun Li¹, Zhiyu Zoy Chen², Mike Ross³, Patrick Huber³, Seungwhan Moon³, Zhaojiang Lin³, Luna Dong³, Adithya Sagar³, Xifeng Yan¹, and Paul A. Crook³
This paper proposes a novel approach called FnCTOD for zero-shot dialogue state tracking (DST) using large language models (LLMs). The method enables LLMs to perform DST by treating each domain as a function and the slot values as function arguments. The approach integrates DST as part of the assistant's output during chat completion, allowing LLMs to generate function calls along with the response. The method is demonstrated to achieve exceptional performance with both open-source and proprietary LLMs, surpassing previous state-of-the-art (SOTA) results. The approach is also shown to be effective when fine-tuned on a small collection of task-oriented dialogues, enabling models like LLaMA2-Chat to achieve DST performance comparable to ChatGPT while maintaining their chat capabilities. The method is evaluated on the MultiWOZ benchmark, where it achieves significant improvements in joint goal accuracy (JGA) compared to previous prompting approaches. The results show that the approach outperforms previous SOTA by 5.6% in average JGA and improves ChatGPT's performance by 4.8% and 14% for GPT-3.5 and GPT-4, respectively. The approach is also shown to be effective in end-to-end task-oriented dialogue (TOD) evaluation, where it enables the generation of both dialogue states and responses in the assistant's output. The method is evaluated on various open-source models, demonstrating its effectiveness in bridging the gap between open-source and proprietary models. The approach is also shown to be effective in ablation studies, where it is demonstrated that function call decomposition and the use of function specifications significantly improve performance. The method is also shown to be effective in varying numbers of in-context examples, with performance improving as the number of examples increases. The approach is also shown to be effective in fine-tuning data sizes, with results indicating that as few as 200 samples per domain can be used to fine-tune a model to match the zero-shot DST performance of ChatGPT. The paper concludes that the proposed approach is a significant advancement in the field of DST and TOD, enabling LLMs to handle both general conversations and task-oriented dialogues in diverse domains without the need for additional data collection.This paper proposes a novel approach called FnCTOD for zero-shot dialogue state tracking (DST) using large language models (LLMs). The method enables LLMs to perform DST by treating each domain as a function and the slot values as function arguments. The approach integrates DST as part of the assistant's output during chat completion, allowing LLMs to generate function calls along with the response. The method is demonstrated to achieve exceptional performance with both open-source and proprietary LLMs, surpassing previous state-of-the-art (SOTA) results. The approach is also shown to be effective when fine-tuned on a small collection of task-oriented dialogues, enabling models like LLaMA2-Chat to achieve DST performance comparable to ChatGPT while maintaining their chat capabilities. The method is evaluated on the MultiWOZ benchmark, where it achieves significant improvements in joint goal accuracy (JGA) compared to previous prompting approaches. The results show that the approach outperforms previous SOTA by 5.6% in average JGA and improves ChatGPT's performance by 4.8% and 14% for GPT-3.5 and GPT-4, respectively. The approach is also shown to be effective in end-to-end task-oriented dialogue (TOD) evaluation, where it enables the generation of both dialogue states and responses in the assistant's output. The method is evaluated on various open-source models, demonstrating its effectiveness in bridging the gap between open-source and proprietary models. The approach is also shown to be effective in ablation studies, where it is demonstrated that function call decomposition and the use of function specifications significantly improve performance. The method is also shown to be effective in varying numbers of in-context examples, with performance improving as the number of examples increases. The approach is also shown to be effective in fine-tuning data sizes, with results indicating that as few as 200 samples per domain can be used to fine-tune a model to match the zero-shot DST performance of ChatGPT. The paper concludes that the proposed approach is a significant advancement in the field of DST and TOD, enabling LLMs to handle both general conversations and task-oriented dialogues in diverse domains without the need for additional data collection.