[slides] AST-T5%3A Structure-Aware Pretraining for Code Generation and Understanding

AST-T5 is a novel pretraining paradigm designed to enhance code generation, transpilation, and understanding by leveraging the Abstract Syntax Tree (AST) of code. Unlike traditional language models that treat code as simple sequences, AST-T5 retains the structured nature of code through dynamic programming-based AST-aware segmentation and AST-aware span corruption. This approach ensures that the model can maintain the structural integrity of code during pretraining, leading to improved performance in various code-related tasks. Evaluations show that AST-T5 consistently outperforms similar-sized models like CodeT5 and CodeT5+ across benchmarks such as HumanEval and MBPP, particularly in code-to-code transpilation tasks. The model's structure-awareness makes it particularly effective in capturing the nuances of code, making it a powerful tool for automated code generation and understanding. AST-T5 is publicly available at <https://github.com/gonglinyuan/ast-t5>.AST-T5 is a novel pretraining paradigm designed to enhance code generation, transpilation, and understanding by leveraging the Abstract Syntax Tree (AST) of code. Unlike traditional language models that treat code as simple sequences, AST-T5 retains the structured nature of code through dynamic programming-based AST-aware segmentation and AST-aware span corruption. This approach ensures that the model can maintain the structural integrity of code during pretraining, leading to improved performance in various code-related tasks. Evaluations show that AST-T5 consistently outperforms similar-sized models like CodeT5 and CodeT5+ across benchmarks such as HumanEval and MBPP, particularly in code-to-code transpilation tasks. The model's structure-awareness makes it particularly effective in capturing the nuances of code, making it a powerful tool for automated code generation and understanding. AST-T5 is publicly available at <https://github.com/gonglinyuan/ast-t5>.

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

2024 | Linyuan Gong, Mostafa Elhoushi, Alvin Cheung