HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation
By: Shaina Raza , Aravind Narayanan , Vahid Reza Khazaie and more
Potential Business Impact:
Tests if AI is fair, kind, and inclusive.
Large multimodal models (LMMs) have been widely tested on tasks like visual question answering (VQA), image captioning, and grounding, but lack rigorous evaluation for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce \textbf{HumaniBench}, a novel benchmark of 32,000 real-world image-question pairs and an evaluation suite. Labels are generated via an AI-assisted pipeline and validated by experts. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality, through diverse open-ended and closed-ended VQA tasks. Grounded in AI ethics and real-world needs, these principles provide a holistic lens for societal impact. Benchmarking results on different LMM shows that proprietary models generally lead in reasoning, fairness, and multilinguality, while open-source models excel in robustness and grounding. Most models struggle to balance accuracy with ethical and inclusive behavior. Techniques like Chain-of-Thought prompting and test-time scaling improve alignment. As the first benchmark tailored for HC alignment, HumaniBench offers a rigorous testbed to diagnose limitations, and promote responsible LMM development. All data and code are publicly available for reproducibility. Keywords: HumaniBench, vision-language models, responsible AI benchmark, AI alignment evaluation, AI ethics assessment, fairness in AI models, visual question answering (VQA) benchmark, image captioning evaluation, visual grounding tasks, trustworthy AI models, Chain-of-Thought prompting, test-time scaling, ethical AI development tools.
Similar Papers
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans
CV and Pattern Recognition
Tests if AI can think like people.
Video-Bench: Human-Aligned Video Generation Benchmark
CV and Pattern Recognition
Tests AI videos to match what people like.
MANBench: Is Your Multimodal Model Smarter than Human?
Computation and Language
Tests how well computers understand pictures and words.