SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations
By: Shuai Huang, Wenxuan Zhao, Jun Gao
Potential Business Impact:
Tests how well AI understands people talking.
As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in process reasoning under complex social situations, yet they still fall behind humans in reply quality. Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the performance of LLMs in social dialogue tasks. All datasets are openly available at https://github.com/SI-Bench/SI-Bench.git.
Similar Papers
SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models
Social and Information Networks
Helps computers understand how people think.
SocialEval: Evaluating Social Intelligence of Large Language Models
Computation and Language
Helps computers understand and act like people.
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Computation and Language
Tests if AI acts like real people.