**AutowebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent**
**Authors:** Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang
**Institution:** Zhipu AI, Tsinghua University
**Abstract:**
Large language models (LLMs) have significantly advanced intelligent agent tasks, such as web navigation, but existing agents often fall short in real-world webpages due to the versatility of actions, the complexity of HTML text, and the open-domain nature of the web. To address these challenges, the authors develop AUTOWEBGLM, an automated web navigation agent that outperforms GPT-4. Inspired by human browsing patterns, they design an HTML simplification algorithm to represent webpages succinctly while preserving essential information. They employ a hybrid human-AI method to build web browsing data for curriculum training and use reinforcement learning and rejection sampling to enhance the model's ability to understand webpages, perform browser operations, and decompose tasks efficiently. The authors establish a bilingual benchmark, AutoWebBench, to evaluate AUTOWEBGLM across various web navigation benchmarks. The results show improvements but also highlight underlying challenges in real-world environments.
**Contributions:**
- Development of AUTOWEBGLM, a deployable web browsing agent based on ChatGLM3-6B.
- Construction of a real-world web browsing operation dataset of approximately 10,000 traces, including a bilingual (English and Chinese) benchmark, AutoWebBench.
- Evaluation of AUTOWEBGLM on diverse web navigation benchmarks, demonstrating its performance improvements and practical usability.
**Methods:**
- **Problem Setup:** Web browsing tasks are treated as sequence decision-making processes, with states including current page status and actions such as clicking, scrolling, and typing.
- **AutowebGLM Framework:** The system processes information through HTML simplification and OCR modules, marking operable elements for interaction. The observation space includes HTML, current position, and past operation records.
- **Data Preparation:** A hybrid human-AI method is used to create training data, addressing challenges such as task collection, privacy, and objective annotation.
- **Training:** The model is trained through curriculum learning, reinforcement learning, and rejection sampling finetuning to enhance web browsing capabilities.
**Experiments:**
- **Main Results:** AUTOWEBGLM outperforms other agents on AutoWebBench and other benchmarks, showing strong performance in predicting user operations.
- **Ablation Study:** Different stages of data and training strategies are evaluated, highlighting the importance of complex task data and reinforcement learning.
- **Case Study and Error Analysis:** The system's effectiveness is assessed through case studies, identifying limitations such as hallucinations, poor graphical recognition,**AutowebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent**
**Authors:** Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang
**Institution:** Zhipu AI, Tsinghua University
**Abstract:**
Large language models (LLMs) have significantly advanced intelligent agent tasks, such as web navigation, but existing agents often fall short in real-world webpages due to the versatility of actions, the complexity of HTML text, and the open-domain nature of the web. To address these challenges, the authors develop AUTOWEBGLM, an automated web navigation agent that outperforms GPT-4. Inspired by human browsing patterns, they design an HTML simplification algorithm to represent webpages succinctly while preserving essential information. They employ a hybrid human-AI method to build web browsing data for curriculum training and use reinforcement learning and rejection sampling to enhance the model's ability to understand webpages, perform browser operations, and decompose tasks efficiently. The authors establish a bilingual benchmark, AutoWebBench, to evaluate AUTOWEBGLM across various web navigation benchmarks. The results show improvements but also highlight underlying challenges in real-world environments.
**Contributions:**
- Development of AUTOWEBGLM, a deployable web browsing agent based on ChatGLM3-6B.
- Construction of a real-world web browsing operation dataset of approximately 10,000 traces, including a bilingual (English and Chinese) benchmark, AutoWebBench.
- Evaluation of AUTOWEBGLM on diverse web navigation benchmarks, demonstrating its performance improvements and practical usability.
**Methods:**
- **Problem Setup:** Web browsing tasks are treated as sequence decision-making processes, with states including current page status and actions such as clicking, scrolling, and typing.
- **AutowebGLM Framework:** The system processes information through HTML simplification and OCR modules, marking operable elements for interaction. The observation space includes HTML, current position, and past operation records.
- **Data Preparation:** A hybrid human-AI method is used to create training data, addressing challenges such as task collection, privacy, and objective annotation.
- **Training:** The model is trained through curriculum learning, reinforcement learning, and rejection sampling finetuning to enhance web browsing capabilities.
**Experiments:**
- **Main Results:** AUTOWEBGLM outperforms other agents on AutoWebBench and other benchmarks, showing strong performance in predicting user operations.
- **Ablation Study:** Different stages of data and training strategies are evaluated, highlighting the importance of complex task data and reinforcement learning.
- **Case Study and Error Analysis:** The system's effectiveness is assessed through case studies, identifying limitations such as hallucinations, poor graphical recognition,