23 Jan 2024 | Zhiruo Wang, Graham Neubig, Daniel Fried
**Abstract:**
Language models (LMs) can solve tasks such as answering questions about tables or images by writing programs. However, using primitive functions often leads to verbose and error-prone programs, and higher-level functions require expert design. To enable better solutions without human labor, we use code LMs to curate reusable high-level functions and use them to write solutions. We present TrOVE, a training-free method of inducing a verifiable and efficient toolbox of functions by generating, using, growing, and periodically trimming the toolbox. On 11 datasets from math, table question answering, and image reasoning tasks, TROVE consistently yields simpler solutions with higher accuracy than baselines using CODELLAMA and previous methods using GPT, while using 79-98% smaller toolboxes. TrOVE further enables 31% faster and 13% more accurate human verification than baselines. With the same pipeline, it creates diverse functions for varied tasks and datasets, providing insights into their individual characteristics.
**Introduction:**
Generating code from natural language commands has been a method for solving tasks such as question answering and agent navigation. Language models (LMs) have been used to write programs in general-purpose languages, expanding the applicability of code generation. However, relying on primitive functions can lead to complex and error-prone programs. Human developers create application-specific functions to address this issue. Recent works have attempted to use LMs to automatically induce tools, but existing methods tend to produce large and complex toolboxes or require additional training and validation datasets.
**TrOVE:**
TrOVE is a training-free method that induces a verifiable and efficient function toolbox by using and growing a shared function library over time, selecting optimal outputs based on execution agreement, and periodically trimming low-utility functions. It operates in linear time by processing examples in a streaming fashion. The method is evaluated on 11 datasets from math, table question answering, and visual reasoning tasks, showing higher accuracy and reduced complexity compared to baselines, while maintaining a smaller function library.
**Experiments:**
TrOVE is compared to two baselines: PRIMITIVE, which uses primitive functions, and INSTANCE, which abstracts functions example-wise. TrOVE consistently outperforms both baselines in terms of answer correctness, solution complexity, and library size. It also facilitates more efficient human verification, with solutions generated by TrOVE being 31% faster and 13% more accurate than those generated by baselines.
**Conclusion:**
TrOVE effectively induces a toolbox of reusable functions for solving programmatic tasks, producing simpler and more accurate solutions with smaller function libraries. It also enhances human verification efficiency and provides insights into task-specific characteristics.**Abstract:**
Language models (LMs) can solve tasks such as answering questions about tables or images by writing programs. However, using primitive functions often leads to verbose and error-prone programs, and higher-level functions require expert design. To enable better solutions without human labor, we use code LMs to curate reusable high-level functions and use them to write solutions. We present TrOVE, a training-free method of inducing a verifiable and efficient toolbox of functions by generating, using, growing, and periodically trimming the toolbox. On 11 datasets from math, table question answering, and image reasoning tasks, TROVE consistently yields simpler solutions with higher accuracy than baselines using CODELLAMA and previous methods using GPT, while using 79-98% smaller toolboxes. TrOVE further enables 31% faster and 13% more accurate human verification than baselines. With the same pipeline, it creates diverse functions for varied tasks and datasets, providing insights into their individual characteristics.
**Introduction:**
Generating code from natural language commands has been a method for solving tasks such as question answering and agent navigation. Language models (LMs) have been used to write programs in general-purpose languages, expanding the applicability of code generation. However, relying on primitive functions can lead to complex and error-prone programs. Human developers create application-specific functions to address this issue. Recent works have attempted to use LMs to automatically induce tools, but existing methods tend to produce large and complex toolboxes or require additional training and validation datasets.
**TrOVE:**
TrOVE is a training-free method that induces a verifiable and efficient function toolbox by using and growing a shared function library over time, selecting optimal outputs based on execution agreement, and periodically trimming low-utility functions. It operates in linear time by processing examples in a streaming fashion. The method is evaluated on 11 datasets from math, table question answering, and visual reasoning tasks, showing higher accuracy and reduced complexity compared to baselines, while maintaining a smaller function library.
**Experiments:**
TrOVE is compared to two baselines: PRIMITIVE, which uses primitive functions, and INSTANCE, which abstracts functions example-wise. TrOVE consistently outperforms both baselines in terms of answer correctness, solution complexity, and library size. It also facilitates more efficient human verification, with solutions generated by TrOVE being 31% faster and 13% more accurate than those generated by baselines.
**Conclusion:**
TrOVE effectively induces a toolbox of reusable functions for solving programmatic tasks, producing simpler and more accurate solutions with smaller function libraries. It also enhances human verification efficiency and provides insights into task-specific characteristics.