Cluster-based name embeddings reduce ethnic disparities in record linkage quality under realistic name corruption: evidence from the North Carolina Voter Registry
By: Joseph Lam , Mario Cortina-Borja , Rob Aldridge and more
Differential ethnic-based record linkage errors can bias epidemiologic estimates. Prior evidence often conflates heterogeneity in error mechanisms with unequal exposure to error. Using snapshots of the North Carolina Voter Registry (Oct 2011-Oct 2022), we derived empirical name-discrepancy profiles to parameterise realistic corruptions. From an Oct 2022 extract (n=848,566), we generated five replicate corrupted datasets under three settings that separately varied mechanism heterogeneity and exposure inequality, and linked records back to originals using unadjusted Jaro-Winkler, Term Frequency (TF)-adjusted Jaro-Winkler, and a cluster-based forename-embedding comparator combined with TF-adjusted surname comparison. We evaluated false match rate (FMR), missed match rate (MMR) and white-centric disparities. At a fixed MMR near 0.20, overall error rates and ethnic disparities diverged substantially by model under disproportionate exposure to corruption. Term-frequency (TF)-adjusted Jaro-Winkler achieved very low overall FMR (0.55% (95% CI 0.54-0.57)) at overall MMR 20.34% (20.30-20.39), but large white-centric under-linkage disparities persisted: Hispanic voters had 36.3% (36.1-36.6) and Non-Hispanic Black voters 8.6% (8.6-8.7) higher FMRs compared to Non-Hispanic White groups. Relative to unadjusted string similarity, TF adjustment reduced these disparities (Hispanic: +60.4% (60.1-60.7) to +36.3%; Black: +13.1% (13.0-13.2) to +8.6%). The cluster-based forename-embedding model reduced missed-match disparities further (Hispanic: +10.2% (9.8-10.3); Black: +0.6% (0.4-0.7)), but at a cost of increasing overall FMR (4.28% (4.22-4.35)) at the same threshold. Unequal exposure to identifier error drove substantially larger disparities than mechanism heterogeneity alone; cluster-based embeddings markedly narrowed under-linkage disparities beyond TF adjustment.
Similar Papers
Population-Scale Network Embeddings Expose Educational Divides in Network Structure Related to Right-Wing Populist Voting
Social and Information Networks
Shows how social connections predict voting.
Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks
Computation and Language
AI models show unfair bias based on names.
Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset
Computation and Language
Helps computers spot fake news better.