8 Mar 2024 | Sotaro Takeshita¹, Tommaso Green¹, Ines Reinig¹, Kai Eckert², Simone Paolo Ponzetto¹
ACLSum is a new dataset for aspect-based summarization of scientific publications, carefully crafted and evaluated by domain experts. Unlike previous datasets, ACLSum enables multi-aspect summarization of scientific papers, covering challenges, approaches, and outcomes in depth. The dataset includes manually-crafted and validated summaries for both extractive and abstractive setups on three different aspects. Each document is annotated with sentences relevant to each aspect and abstractive reference summaries. The dataset is manually validated by domain experts to ensure quality.
ACLSum was used to evaluate different summarization strategies. First, two approaches for text summarization with pretrained language models (PLMs) were compared: (i) end-to-end summarization, where the PLM directly produces a summary from the source document, and (ii) extract-then-abstract summarization, where an extractive model first extracts sentences that are then used as input to the PLM to generate the summary. The unique property of the dataset is having gold annotations both for aspects and summaries, enabling fine-grained analysis. Generative models suffer more when the relevant information is scattered across the source document, requiring higher levels of abstraction.
Second, recent large language models (LLMs) were evaluated by training and testing Llama 2 in two ways: (i) end-to-end instruction-tuning where the model is trained to produce summaries directly given an instruction sentence and a source document as input; and (ii) extract-then-abstract chain-of-thought-like training, where the model is first trained to generate references to sentences in the source document that cover relevant aspects for the summary, and then merged to produce a final summary. Third, a greedy algorithm used in previous work to induce silver-standard extractive summaries was evaluated against ground-truth annotations from human experts, showing its low quality.
The contributions of ACLSum include a new expert-annotated and validated multi-aspect summarization dataset with both extractive and abstractive summary annotations, an extensive and fine-grained evaluation of PLM systems and instruction-tuned LLMs on aspect-based summarization of scientific papers, and a benchmarking assessment of a greedy search heuristic for extractive summary generation on our domain. The dataset contains 250 documents, which is more than twice larger than the 100 documents from SQuALITY. ACLSum is suitable for the scholarly domain as it requires researchers to consume a steadily increasing number of papers. The dataset is manually annotated and validated by domain experts, ensuring high-quality annotations. The results show that end-to-end aspect-based summarization outperforms extract-then-abstract approaches, and that the greedy algorithm for inducing extractive summaries performs poorly when evaluated against ground-truth annotations. The dataset provides a benchmark for evaluating summarization models in the scholarly domain.ACLSum is a new dataset for aspect-based summarization of scientific publications, carefully crafted and evaluated by domain experts. Unlike previous datasets, ACLSum enables multi-aspect summarization of scientific papers, covering challenges, approaches, and outcomes in depth. The dataset includes manually-crafted and validated summaries for both extractive and abstractive setups on three different aspects. Each document is annotated with sentences relevant to each aspect and abstractive reference summaries. The dataset is manually validated by domain experts to ensure quality.
ACLSum was used to evaluate different summarization strategies. First, two approaches for text summarization with pretrained language models (PLMs) were compared: (i) end-to-end summarization, where the PLM directly produces a summary from the source document, and (ii) extract-then-abstract summarization, where an extractive model first extracts sentences that are then used as input to the PLM to generate the summary. The unique property of the dataset is having gold annotations both for aspects and summaries, enabling fine-grained analysis. Generative models suffer more when the relevant information is scattered across the source document, requiring higher levels of abstraction.
Second, recent large language models (LLMs) were evaluated by training and testing Llama 2 in two ways: (i) end-to-end instruction-tuning where the model is trained to produce summaries directly given an instruction sentence and a source document as input; and (ii) extract-then-abstract chain-of-thought-like training, where the model is first trained to generate references to sentences in the source document that cover relevant aspects for the summary, and then merged to produce a final summary. Third, a greedy algorithm used in previous work to induce silver-standard extractive summaries was evaluated against ground-truth annotations from human experts, showing its low quality.
The contributions of ACLSum include a new expert-annotated and validated multi-aspect summarization dataset with both extractive and abstractive summary annotations, an extensive and fine-grained evaluation of PLM systems and instruction-tuned LLMs on aspect-based summarization of scientific papers, and a benchmarking assessment of a greedy search heuristic for extractive summary generation on our domain. The dataset contains 250 documents, which is more than twice larger than the 100 documents from SQuALITY. ACLSum is suitable for the scholarly domain as it requires researchers to consume a steadily increasing number of papers. The dataset is manually annotated and validated by domain experts, ensuring high-quality annotations. The results show that end-to-end aspect-based summarization outperforms extract-then-abstract approaches, and that the greedy algorithm for inducing extractive summaries performs poorly when evaluated against ground-truth annotations. The dataset provides a benchmark for evaluating summarization models in the scholarly domain.