[slides and audio] FOFO%3A A Benchmark to Evaluate LLMs' Format-Following Capability

FoFo is a benchmark designed to evaluate the format-following capability of large language models (LLMs). This capability is crucial for LLMs to function as AI agents, yet it has been underexplored in existing benchmarks. FoFo addresses this gap by incorporating a diverse range of real-world, domain-specific formats and complex instructions, developed through an AI-Human collaborative approach. The benchmark evaluates both open-source and closed-source LLMs, revealing key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across domains. These insights highlight the need for specialized tuning for format-following skills and emphasize FoFo's role in guiding the selection of domain-specific AI agents. FoFo is constructed through three steps: collecting domains and subdomains, gathering domain-specific data formats, and generating format-oriented instructions. The benchmark includes a wide range of real-world, domain-specific formats and complex instructions, ensuring that LLMs are tested with realistic, complex contexts. The benchmark is evaluated using GPT-4 as the main judge and human evaluation to ensure high agreement. The results show that format-following performance is not consistent with content-following performance, and open-source models lag behind closed-source models in format adherence. Additionally, LLMs' format-following capability varies across domains, indicating that format-following is not universally transferable across domains. The benchmark includes a variety of formats, including domain-specific formats and general formats like JSON, XML, CSV, Markdown, and YAML. The benchmark also includes detailed format specifications for each test example, ensuring that LLMs are tested with complex, realistic scenarios. The benchmark is evaluated using GPT-4 as the main judge and human evaluation to ensure high agreement. The results show that format-following performance is not consistent with content-following performance, and open-source models lag behind closed-source models in format adherence. Additionally, LLMs' format-following capability varies across domains, indicating that format-following is not universally transferable across domains. The benchmark is evaluated using GPT-4 as the main judge and human evaluation to ensure high agreement. The results show that format-following performance is not consistent with content-following performance, and open-source models lag behind closed-source models in format adherence. Additionally, LLMs' format-following capability varies across domains, indicating that format-following is not universally transferable across domains. The benchmark is also compared with other benchmarks like IfEval, showing that FoFo is a much harder benchmark compared to current format-following test sets. The benchmark is also evaluated for cost, showing that using GPT-4 for benchmark creation and evaluation incurs associated expenses. The benchmark is also evaluated for its ability to guide the selection of domain-specific AI agents, showing that it can serve as a guiding and probing tool for the choice of domain-specific AI agentFoFo is a benchmark designed to evaluate the format-following capability of large language models (LLMs). This capability is crucial for LLMs to function as AI agents, yet it has been underexplored in existing benchmarks. FoFo addresses this gap by incorporating a diverse range of real-world, domain-specific formats and complex instructions, developed through an AI-Human collaborative approach. The benchmark evaluates both open-source and closed-source LLMs, revealing key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across domains. These insights highlight the need for specialized tuning for format-following skills and emphasize FoFo's role in guiding the selection of domain-specific AI agents. FoFo is constructed through three steps: collecting domains and subdomains, gathering domain-specific data formats, and generating format-oriented instructions. The benchmark includes a wide range of real-world, domain-specific formats and complex instructions, ensuring that LLMs are tested with realistic, complex contexts. The benchmark is evaluated using GPT-4 as the main judge and human evaluation to ensure high agreement. The results show that format-following performance is not consistent with content-following performance, and open-source models lag behind closed-source models in format adherence. Additionally, LLMs' format-following capability varies across domains, indicating that format-following is not universally transferable across domains. The benchmark includes a variety of formats, including domain-specific formats and general formats like JSON, XML, CSV, Markdown, and YAML. The benchmark also includes detailed format specifications for each test example, ensuring that LLMs are tested with complex, realistic scenarios. The benchmark is evaluated using GPT-4 as the main judge and human evaluation to ensure high agreement. The results show that format-following performance is not consistent with content-following performance, and open-source models lag behind closed-source models in format adherence. Additionally, LLMs' format-following capability varies across domains, indicating that format-following is not universally transferable across domains. The benchmark is evaluated using GPT-4 as the main judge and human evaluation to ensure high agreement. The results show that format-following performance is not consistent with content-following performance, and open-source models lag behind closed-source models in format adherence. Additionally, LLMs' format-following capability varies across domains, indicating that format-following is not universally transferable across domains. The benchmark is also compared with other benchmarks like IfEval, showing that FoFo is a much harder benchmark compared to current format-following test sets. The benchmark is also evaluated for cost, showing that using GPT-4 for benchmark creation and evaluation incurs associated expenses. The benchmark is also evaluated for its ability to guide the selection of domain-specific AI agents, showing that it can serve as a guiding and probing tool for the choice of domain-specific AI agent

FoFo: A Benchmark to Evaluate LLMs' Format-Following Capability

28 Feb 2024 | Congying Xia, Chen Xing, Jiangshu Du, Xinyi Yang, Yihao Feng, Ran Xu, Wenpeng Yin, Caiming Xiong