A suite of LMs comprehend puzzle statements as well as humans
By: Adele E Goldberg , Supantho Rakshit , Jennifer Hu and more
Potential Business Impact:
Computers now understand sentences better than people.
Recent claims suggest that large language models (LMs) underperform humans in comprehending minimally complex English statements (Dentella et al., 2024). Here, we revisit those findings and argue that human performance was overestimated, while LLM abilities were underestimated. Using the same stimuli, we report a preregistered study comparing human responses in two conditions: one allowed rereading (replicating the original study), and one that restricted rereading (a more naturalistic comprehension test). Human accuracy dropped significantly when rereading was restricted (73%), falling below that of Falcon-180B-Chat (76%) and GPT-4 (81%). The newer GPT-o1 model achieves perfect accuracy. Results further show that both humans and models are disproportionately challenged by queries involving potentially reciprocal actions (e.g., kissing), suggesting shared pragmatic sensitivities rather than model-specific deficits. Additional analyses using Llama-2-70B log probabilities, a recoding of open-ended model responses, and grammaticality ratings of other sentences reveal systematic underestimation of model performance. We find that GPT-4o can align with either naive or expert grammaticality judgments, depending on prompt framing. These findings underscore the need for more careful experimental design and coding practices in LLM evaluation, and they challenge the assumption that current models are inherently weaker than humans at language comprehension.
Similar Papers
Comparing human and language models sentence processing difficulties on complex structures
Computation and Language
Computers understand sentences like people do.
Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability
Computation and Language
Computers aren't getting more creative, even the best ones.
Evidence of conceptual mastery in the application of rules by Large Language Models
Artificial Intelligence
Makes AI understand rules like people do.