Score: 2

SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations

Published: October 27, 2025 | arXiv ID: 2510.23182v1

By: Shuai Huang, Wenxuan Zhao, Jun Gao

Potential Business Impact:

Tests how well AI understands people talking.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in process reasoning under complex social situations, yet they still fall behind humans in reply quality. Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the performance of LLMs in social dialogue tasks. All datasets are openly available at https://github.com/SI-Bench/SI-Bench.git.

Repos / Data Links

Page Count
17 pages

Category
Computer Science:
Computation and Language