Score: 1

A Confidence-Diversity Framework for Calibrating AI Judgement in Accessible Qualitative Coding Tasks

Published: August 4, 2025 | arXiv ID: 2508.02029v2

By: Zhilong Zhao, Yindi Liu

Potential Business Impact:

Makes AI code faster and more trustworthy.

LLMs enable qualitative coding at large scale, but assessing reliability remains challenging where human experts seldom agree. We investigate confidence-diversity calibration as a quality assessment framework for accessible coding tasks where LLMs already demonstrate strong performance but exhibit overconfidence. Analysing 5,680 coding decisions from eight state-of-the-art LLMs across ten categories, we find that mean self-confidence tracks inter-model agreement closely (Pearson r=0.82). Adding model diversity quantified as normalised Shannon entropy produces a dual signal explaining agreement almost completely (R-squared=0.979), though this high predictive power likely reflects task simplicity for current LLMs. The framework enables a three-tier workflow auto-accepting 35 percent of segments with less than 5 percent error, cutting manual effort by 65 percent. Cross-domain validation confirms transferability (kappa improvements of 0.20 to 0.78). While establishing a methodological foundation for AI judgement calibration, the true potential likely lies in more challenging scenarios where LLMs may demonstrate comparative advantages over human cognitive limitations.

Page Count
23 pages

Category
Computer Science:
Machine Learning (CS)