Score: 0

Assessing the Reliability of Large Language Models for Deductive Qualitative Coding: A Comparative Study of ChatGPT Interventions

Published: July 18, 2025 | arXiv ID: 2507.14384v1

By: Angjelin Hila, Elliott Hauser

Potential Business Impact:

Helps computers sort legal cases by topic.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

In this study, we investigate the use of large language models (LLMs), specifically ChatGPT, for structured deductive qualitative coding. While most current research emphasizes inductive coding applications, we address the underexplored potential of LLMs to perform deductive classification tasks aligned with established human-coded schemes. Using the Comparative Agendas Project (CAP) Master Codebook, we classified U.S. Supreme Court case summaries into 21 major policy domains. We tested four intervention methods: zero-shot, few-shot, definition-based, and a novel Step-by-Step Task Decomposition strategy, across repeated samples. Performance was evaluated using standard classification metrics (accuracy, F1-score, Cohen's kappa, Krippendorff's alpha), and construct validity was assessed using chi-squared tests and Cramer's V. Chi-squared and effect size analyses confirmed that intervention strategies significantly influenced classification behavior, with Cramer's V values ranging from 0.359 to 0.613, indicating moderate to strong shifts in classification patterns. The Step-by-Step Task Decomposition strategy achieved the strongest reliability (accuracy = 0.775, kappa = 0.744, alpha = 0.746), achieving thresholds for substantial agreement. Despite the semantic ambiguity within case summaries, ChatGPT displayed stable agreement across samples, including high F1 scores in low-support subclasses. These findings demonstrate that with targeted, custom-tailored interventions, LLMs can achieve reliability levels suitable for integration into rigorous qualitative coding workflows.

ChatGPT in Introductory Programming: Counterbalanced Evaluation of Code Quality, Conceptual Learning, and Student Perceptions

Software Engineering

Helps students write better code faster.

1 Oct 2025 0

90%

Generative Large Language Models (gLLMs) in Content Analysis: A Practical Guide for Communication Research

Artificial Intelligence

Helps computers understand what people write faster.

28 Oct 2025 0

90%

An Empirical Study on the Capability of LLMs in Decomposing Bug Reports

Software Engineering

Helps computers break down bug reports faster.

29 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

42 pages

Assessing the Reliability of Large Language Models for Deductive Qualitative Coding: A Comparative Study of ChatGPT Interventions

Helps computers sort legal cases by topic.

Technical Abstract

ChatGPT in Introductory Programming: Counterbalanced Evaluation of Code Quality, Conceptual Learning, and Student Perceptions

Generative Large Language Models (gLLMs) in Content Analysis: A Practical Guide for Communication Research

An Empirical Study on the Capability of LLMs in Decomposing Bug Reports