A Confidence-Diversity Framework for Calibrating AI Judgement in Accessible Qualitative Coding Tasks
By: Zhilong Zhao, Yindi Liu
Potential Business Impact:
Makes AI code faster and more trustworthy.
LLMs enable qualitative coding at large scale, but assessing reliability remains challenging where human experts seldom agree. We investigate confidence-diversity calibration as a quality assessment framework for accessible coding tasks where LLMs already demonstrate strong performance but exhibit overconfidence. Analysing 5,680 coding decisions from eight state-of-the-art LLMs across ten categories, we find that mean self-confidence tracks inter-model agreement closely (Pearson r=0.82). Adding model diversity quantified as normalised Shannon entropy produces a dual signal explaining agreement almost completely (R-squared=0.979), though this high predictive power likely reflects task simplicity for current LLMs. The framework enables a three-tier workflow auto-accepting 35 percent of segments with less than 5 percent error, cutting manual effort by 65 percent. Cross-domain validation confirms transferability (kappa improvements of 0.20 to 0.78). While establishing a methodological foundation for AI judgement calibration, the true potential likely lies in more challenging scenarios where LLMs may demonstrate comparative advantages over human cognitive limitations.
Similar Papers
Confidence-Diversity Calibration of AI Judgement Enables Reliable Qualitative Coding
Machine Learning (CS)
Helps computers check their own answers before asking humans.
Automated Quality Assessment for LLM-Based Complex Qualitative Coding: A Confidence-Diversity Framework
Computers and Society
Checks AI's work on hard problems.
Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
Artificial Intelligence
Makes AI judges more honest about what they know.