AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

2024 | Linyuan Gong¹ Mostafa Elhoushi² Alvin Cheung¹
AST-T5 is a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) to enhance code generation, transpilation, and understanding. Unlike other models that treat code as simple sequences, AST-T5 uses dynamic programming for AST-aware segmentation and AST-aware span corruption to retain code structure and enable reconstruction of various code structures. This approach avoids complex program analyses or architectural changes, allowing seamless integration with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks, including HumanEval and MBPP. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. AST-T5 is publicly available at https://github.com/gonglinyuan/ast_t5. The model uses a lightweight, multi-language parser called Tree-sitter to parse code into ASTs, then applies dynamic programming-based segmentation to maintain structural integrity. AST-Aware Span Corruption pretrains the model to reconstruct code structures, enhancing flexibility and structure-awareness. AST-T5 introduces no architecture changes or additional heads, maintaining the same pretraining objective as Vanilla T5. This compatibility enables seamless integration as a drop-in replacement for any T5 variant. AST-T5 consistently outperforms baselines in code generation, transpilation, and understanding tasks. Through controlled experiments, it is shown that these advancements are attributed to AST-aware pretraining techniques. AST-T5 not only outperforms similar-sized models like CodeT5 and CodeT5+ but also remains competitive with, or occasionally exceeds, the performance of much larger models. The inherent AST-awareness of AST-T5 offers unique advantages in structure-sensitive tasks, such as code-to-code transpilation and Clone Detection, highlighting its effectiveness at capturing the structural nuances of code. The model is evaluated across three types of tasks: text-to-code generation, code-to-code transpilation, and code understanding. It is benchmarked against existing models, including decoder-only models like GPT variants and encoder-decoder models like PLBART, CodeT5, and StructCoder. AST-T5 excels in code transpilation tasks, showing significant improvements in Bugs2Fix and Java-C# transpilation. The model is parameter-efficient and adaptable, making it suitable for real-world deployments. The study concludes that AST-T5 is a promising approach for code-centric language models, leveraging AST structure to enhance performance in code generation, transpilation, and understanding.AST-T5 is a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) to enhance code generation, transpilation, and understanding. Unlike other models that treat code as simple sequences, AST-T5 uses dynamic programming for AST-aware segmentation and AST-aware span corruption to retain code structure and enable reconstruction of various code structures. This approach avoids complex program analyses or architectural changes, allowing seamless integration with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks, including HumanEval and MBPP. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. AST-T5 is publicly available at https://github.com/gonglinyuan/ast_t5. The model uses a lightweight, multi-language parser called Tree-sitter to parse code into ASTs, then applies dynamic programming-based segmentation to maintain structural integrity. AST-Aware Span Corruption pretrains the model to reconstruct code structures, enhancing flexibility and structure-awareness. AST-T5 introduces no architecture changes or additional heads, maintaining the same pretraining objective as Vanilla T5. This compatibility enables seamless integration as a drop-in replacement for any T5 variant. AST-T5 consistently outperforms baselines in code generation, transpilation, and understanding tasks. Through controlled experiments, it is shown that these advancements are attributed to AST-aware pretraining techniques. AST-T5 not only outperforms similar-sized models like CodeT5 and CodeT5+ but also remains competitive with, or occasionally exceeds, the performance of much larger models. The inherent AST-awareness of AST-T5 offers unique advantages in structure-sensitive tasks, such as code-to-code transpilation and Clone Detection, highlighting its effectiveness at capturing the structural nuances of code. The model is evaluated across three types of tasks: text-to-code generation, code-to-code transpilation, and code understanding. It is benchmarked against existing models, including decoder-only models like GPT variants and encoder-decoder models like PLBART, CodeT5, and StructCoder. AST-T5 excels in code transpilation tasks, showing significant improvements in Bugs2Fix and Java-C# transpilation. The model is parameter-efficient and adaptable, making it suitable for real-world deployments. The study concludes that AST-T5 is a promising approach for code-centric language models, leveraging AST structure to enhance performance in code generation, transpilation, and understanding.
Reach us at info@study.space
[slides] AST-T5%3A Structure-Aware Pretraining for Code Generation and Understanding | StudySpace