Are LLMs complicated ethical dilemma analyzers?
By: Jiashen , Du , Jesse Yao and more
Potential Business Impact:
Helps computers make good choices like people.
One open question in the study of Large Language Models (LLMs) is whether they can emulate human ethical reasoning and act as believable proxies for human judgment. To investigate this, we introduce a benchmark dataset comprising 196 real-world ethical dilemmas and expert opinions, each segmented into five structured components: Introduction, Key Factors, Historical Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also collect non-expert human responses for comparison, limited to the Key Factors section due to their brevity. We evaluate multiple frontier LLMs (GPT-4o-mini, Claude-3.5-Sonnet, Deepseek-V3, Gemini-1.5-Flash) using a composite metric framework based on BLEU, Damerau-Levenshtein distance, TF-IDF cosine similarity, and Universal Sentence Encoder similarity. Metric weights are computed through an inversion-based ranking alignment and pairwise AHP analysis, enabling fine-grained comparison of model outputs to expert responses. Our results show that LLMs generally outperform non-expert humans in lexical and structural alignment, with GPT-4o-mini performing most consistently across all sections. However, all models struggle with historical grounding and proposing nuanced resolution strategies, which require contextual abstraction. Human responses, while less structured, occasionally achieve comparable semantic similarity, suggesting intuitive moral reasoning. These findings highlight both the strengths and current limitations of LLMs in ethical decision-making.
Similar Papers
Advancing Automated Ethical Profiling in SE: a Zero-Shot Evaluation of LLM Reasoning
Software Engineering
Helps computers understand right from wrong.
Analyzing the Ethical Logic of Six Large Language Models
Artificial Intelligence
Computers can now explain their moral choices.
Normative Evaluation of Large Language Models with Everyday Moral Dilemmas
Artificial Intelligence
Tests AI's right and wrong choices in real-life problems.