LegalLens: Leveraging LLMs for Legal Violation Identification in Unstructured Text

LegalLens: Leveraging LLMs for Legal Violation Identification in Unstructured Text

6 Feb 2024 | Dor Bernsohn, Gil Semo, Yaron Vazana, Gila Hayat, Ben Hagag, Joel Niklaus, Rohit Saha, Kyryl Truskovskyi
This study focuses on two main tasks: detecting legal violations within unstructured textual data and associating these violations with potentially affected individuals. The researchers constructed two datasets using Large Language Models (LLMs) and validated them by domain expert annotators. The experimental design involved fine-tuning models from the BERT family and open-source LLMs, as well as conducting few-shot experiments with closed-source LLMs. The results, with an F1-score of 62.69% for violation identification and 81.02% for associating victims, demonstrate the effectiveness of the datasets and setups. The researchers also released the datasets and code to advance further research in legal natural language processing (NLP). The study introduces two dedicated datasets for legal violation identification, which were generated using LLMs and validated by domain experts. These datasets include new legal entities and are designed for broader applicability in identifying various types of legal violations. The research evaluates various language models across two NLP tasks, providing insights into their applicability and limitations in legal NLP. The study implements a dual-setup approach using Named Entity Recognition (NER) and Natural Language Inference (NLI) tasks, offering a methodology for legal violation detection and resolution. The contributions of this paper are threefold: 1. Introduction of two dedicated datasets for legal violation identification. 2. Evaluation of various language models across two NLP tasks. 3. Implementation of a dual-setup approach for legal violation detection and resolution. The study aims to uncover hidden legal violations in unstructured text and link them to relevant prior class actions. The main research questions focus on the effectiveness of the datasets in enhancing model performance, the adaptability of models to new, unseen data, and the differences between machine-generated and human-generated text in legal violation identification. Previous works in legal violation identification have primarily focused on domain-specific topics, such as compliance, data privacy, and industry-specific regulations. This study introduces a dataset designed for broader applicability in identifying various types of legal violations. The research also explores the use of LLMs for synthetic data generation and the interconnection between NER and NLI tasks. The study employs a systematic and carefully planned data generation process, consisting of three stages: prompting, labeling, and data validation. The NER task uses named entity recognition to identify violations, while the NLI task uses natural language inference to match these violations with resolved class-action cases. The methodology leverages a prompt-based approach optimized for the legal domain to generate high-quality data for both tasks. The study conducted experiments to evaluate the performance of language models on the created setups. The experiments included fine-tuning BERT-based models, evaluating open-source LLMs, and assessing closed-source LLMs. The results showed that BERT-based models outperformed LLMs in NER tasks, while LLMs excelled in NLI tasks, particularly in low-data scenarios. The study conducted error analysis to improve the models and understand theirThis study focuses on two main tasks: detecting legal violations within unstructured textual data and associating these violations with potentially affected individuals. The researchers constructed two datasets using Large Language Models (LLMs) and validated them by domain expert annotators. The experimental design involved fine-tuning models from the BERT family and open-source LLMs, as well as conducting few-shot experiments with closed-source LLMs. The results, with an F1-score of 62.69% for violation identification and 81.02% for associating victims, demonstrate the effectiveness of the datasets and setups. The researchers also released the datasets and code to advance further research in legal natural language processing (NLP). The study introduces two dedicated datasets for legal violation identification, which were generated using LLMs and validated by domain experts. These datasets include new legal entities and are designed for broader applicability in identifying various types of legal violations. The research evaluates various language models across two NLP tasks, providing insights into their applicability and limitations in legal NLP. The study implements a dual-setup approach using Named Entity Recognition (NER) and Natural Language Inference (NLI) tasks, offering a methodology for legal violation detection and resolution. The contributions of this paper are threefold: 1. Introduction of two dedicated datasets for legal violation identification. 2. Evaluation of various language models across two NLP tasks. 3. Implementation of a dual-setup approach for legal violation detection and resolution. The study aims to uncover hidden legal violations in unstructured text and link them to relevant prior class actions. The main research questions focus on the effectiveness of the datasets in enhancing model performance, the adaptability of models to new, unseen data, and the differences between machine-generated and human-generated text in legal violation identification. Previous works in legal violation identification have primarily focused on domain-specific topics, such as compliance, data privacy, and industry-specific regulations. This study introduces a dataset designed for broader applicability in identifying various types of legal violations. The research also explores the use of LLMs for synthetic data generation and the interconnection between NER and NLI tasks. The study employs a systematic and carefully planned data generation process, consisting of three stages: prompting, labeling, and data validation. The NER task uses named entity recognition to identify violations, while the NLI task uses natural language inference to match these violations with resolved class-action cases. The methodology leverages a prompt-based approach optimized for the legal domain to generate high-quality data for both tasks. The study conducted experiments to evaluate the performance of language models on the created setups. The experiments included fine-tuning BERT-based models, evaluating open-source LLMs, and assessing closed-source LLMs. The results showed that BERT-based models outperformed LLMs in NER tasks, while LLMs excelled in NLI tasks, particularly in low-data scenarios. The study conducted error analysis to improve the models and understand their
Reach us at info@study.space
[slides and audio] LegalLens%3A Leveraging LLMs for Legal Violation Identification in Unstructured Text