The Massive Legal Embedding Benchmark (MLEB)
By: Umar Butler, Abdur-Rahman Butler, Adrian Lucas Malec
Potential Business Impact:
Helps computers understand and find legal information.
We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.
Similar Papers
MIEB: Massive Image Embedding Benchmark
CV and Pattern Recognition
Tests how well computers understand pictures and words.
PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding
Computation and Language
Finds patents faster and better.
MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models
Artificial Intelligence
Teaches AI to fix wrong medical pictures and words.