Small Models Achieve Large Language Model Performance: Evaluating Reasoning-Enabled AI for Secure Child Welfare Research
By: Zia Qi , Brian E. Perron , Bryan G. Victor and more
Potential Business Impact:
Helps computers find child safety risks faster.
Objective: This study develops a systematic benchmarking framework for testing whether language models can accurately identify constructs of interest in child welfare records. The objective is to assess how different model sizes and architectures perform on four validated benchmarks for classifying critical risk factors among child welfare-involved families: domestic violence, firearms, substance-related problems generally, and opioids specifically. Method: We constructed four benchmarks for identifying risk factors in child welfare investigation summaries: domestic violence, substance-related problems, firearms, and opioids (n=500 each). We evaluated seven model sizes (0.6B-32B parameters) in standard and extended reasoning modes, plus a mixture-of-experts variant. Cohen's kappa measured agreement with gold standard classifications established by human experts. Results: The benchmarking revealed a critical finding: bigger models are not better. A small 4B parameter model with extended reasoning proved most effective, outperforming models up to eight times larger. It consistently achieved "substantial" to "almost perfect" agreement across all four benchmark categories. This model achieved "almost perfect" agreement (\k{appa} = 0.93-0.96) on three benchmarks (substance-related problems, firearms, and opioids) and "substantial" agreement (\k{appa} = 0.74) on the most complex task (domestic violence). Small models with extended reasoning rivaled the largest models while being more resource-efficient. Conclusions: Small reasoning-enabled models achieve accuracy levels historically requiring larger architectures, enabling significant time and computational efficiencies. The benchmarking framework provides a method for evidence-based model selection to balance accuracy with practical resource constraints before operational deployment in social work research.
Similar Papers
Large Language Models Imitate Logical Reasoning, but at what Cost?
Artificial Intelligence
Makes AI think better and cost less.
Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?
Digital Libraries
Helps smaller AI judge research papers as well as big AI.
A Technical Study into 0.5B Reasoning Language Models
Artificial Intelligence
Makes small AI smart enough for hard problems.