Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods
By: Tereza Novotna, Jakub Harasta
Potential Business Impact:
Helps lawyers find court cases faster.
Retrieving case law is a time-consuming task predominantly carried out by querying databases. We provide a comparison of two models in three different settings for Czech Constitutional Court decisions: (i) a large general-purpose embedder (OpenAI), (ii) a domain-specific BERT-trained from scratch on ~30,000 decisions using sliding windows and attention pooling. We propose a noise-aware evaluation including IDF-weighted keyword overlap as graded relevance, binarization via two thresholds (0.20 balanced, 0.28 strict), significance via paired bootstrap, and an nDCG diagnosis supported with qualitative analysis. Despite modest absolute nDCG (expected under noisy labels), the general OpenAI embedder decisively outperforms the domain pre-trained BERT in both settings at @10/@20/@100 across both thresholds; differences are statistically significant. Diagnostics attribute low absolutes to label drift and strong ideals rather than lack of utility. Additionally, our framework is robust enough to be used for evaluation under a noisy gold dataset, which is typical when handling data with heterogeneous labels stemming from legacy judicial databases.
Similar Papers
Evaluating Embedding Models and Pipeline Optimization for AI Search Quality
Information Retrieval
Makes AI search find information much better.
Contextual Embedding-based Clustering to Identify Topics for Healthcare Service Improvement
Machine Learning (CS)
Finds hidden problems in patient comments.
Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction
Computation and Language
Checks legal documents automatically for accuracy.