The Material Contracts Corpus
By: Peter Adelson, Julian Nyarko
Potential Business Impact:
Helps computers understand legal papers faster.
This paper introduces the Material Contracts Corpus (MCC), a publicly available dataset comprising over one million contracts filed by public companies with the U.S. Securities and Exchange Commission (SEC) between 2000 and 2023. The MCC facilitates empirical research on contract design and legal language, and supports the development of AI-based legal tools. Contracts in the corpus are categorized by agreement type and linked to specific parties using machine learning and natural language processing techniques, including a fine-tuned LLaMA-2 model for contract classification. The MCC further provides metadata such as filing form, document format, and amendment status. We document trends in contractual language, length, and complexity over time, and highlight the dominance of employment and security agreements in SEC filings. This resource is available for bulk download and online access at https://mcc.law.stanford.edu.
Similar Papers
3CEL: A corpus of legal Spanish contract clauses
Computation and Language
Helps lawyers understand Spanish contracts faster.
A Survey of Classification Tasks and Approaches for Legal Contracts
Computation and Language
Helps computers quickly understand legal papers.
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Computation and Language
Creates free, safe data for smart computer programs.