June 2009 | Scott Martin, Rajakrishnan Rajkumar, and Michael White
This paper presents a system for grammar engineering using the Combinatory Categorial Grammar (CCG) with Ant and XSLT. The system improves grammar engineering by transforming the standard CCGbank corpus into an OpenCCG grammar. The system uses Apache Ant to control successive XSLT transformations on an XML version of the corpus, enabling the addition of linguistic attributes such as Propbank roles, head lexicalization for case-marking prepositions, derivational restructuring for punctuation analysis, named entity annotation, and lemmatization. The system is designed to facilitate the evolving process of grammar engineering by separating the conversion and extraction steps into a pipeline of XSLT transforms. This design is beneficial because XSLT is well-suited for arbitrary transformations of XML trees, and Ant provides fine-grained control. The system enables state-of-the-art BLEU scores for surface realization on section 23 of the CCGbank.
The system's design starts by generating an XML version of the CCGbank using JavaCC. Next, conversion and extraction transforms are applied to create a converted corpus and extracted grammar. The original design was refactored to separate the grammar engineering task into several configurable processes using Ant tasks, which simplifies process management, speeds up experiment iterations, and facilitates the comparison of different grammar engineering strategies.
The implementation uses XSLT for both conversion and extraction steps, as both OpenCCG grammars and the CCGbank translation are represented in XML. Ant is particularly well-suited to this process because it is written in Java, like OpenCCG. The system also employs Ant's built-in FileSet and FileList data types for specifying groups of corpus files and reusing series of XSLT transforms. The first extension task, convert, encapsulates the conversion process, while the second task, extract, implements the grammar extraction procedure for a previously-converted corpus.
The system supports comprehensive experimentation and has helped facilitate recent efforts to investigate factors impacting surface realization. The system's initial results recorded 69.7% single-rooted LFs with a BLEU score of 0.5768. Current figures stand at 95.8% single-rooted LFs and a state-of-the-art BLEU score of 0.8506 on section 23 of the CCGbank. Future work will focus on increasing the number of single-rooted LFs and integrating this system with OpenCCG.This paper presents a system for grammar engineering using the Combinatory Categorial Grammar (CCG) with Ant and XSLT. The system improves grammar engineering by transforming the standard CCGbank corpus into an OpenCCG grammar. The system uses Apache Ant to control successive XSLT transformations on an XML version of the corpus, enabling the addition of linguistic attributes such as Propbank roles, head lexicalization for case-marking prepositions, derivational restructuring for punctuation analysis, named entity annotation, and lemmatization. The system is designed to facilitate the evolving process of grammar engineering by separating the conversion and extraction steps into a pipeline of XSLT transforms. This design is beneficial because XSLT is well-suited for arbitrary transformations of XML trees, and Ant provides fine-grained control. The system enables state-of-the-art BLEU scores for surface realization on section 23 of the CCGbank.
The system's design starts by generating an XML version of the CCGbank using JavaCC. Next, conversion and extraction transforms are applied to create a converted corpus and extracted grammar. The original design was refactored to separate the grammar engineering task into several configurable processes using Ant tasks, which simplifies process management, speeds up experiment iterations, and facilitates the comparison of different grammar engineering strategies.
The implementation uses XSLT for both conversion and extraction steps, as both OpenCCG grammars and the CCGbank translation are represented in XML. Ant is particularly well-suited to this process because it is written in Java, like OpenCCG. The system also employs Ant's built-in FileSet and FileList data types for specifying groups of corpus files and reusing series of XSLT transforms. The first extension task, convert, encapsulates the conversion process, while the second task, extract, implements the grammar extraction procedure for a previously-converted corpus.
The system supports comprehensive experimentation and has helped facilitate recent efforts to investigate factors impacting surface realization. The system's initial results recorded 69.7% single-rooted LFs with a BLEU score of 0.5768. Current figures stand at 95.8% single-rooted LFs and a state-of-the-art BLEU score of 0.8506 on section 23 of the CCGbank. Future work will focus on increasing the number of single-rooted LFs and integrating this system with OpenCCG.