VideoNorms: Benchmarking Cultural Awareness of Video Language Models
By: Nikhil Reddy Varimalla , Yunfei Xu , Arkadiy Saakyan and more
Potential Business Impact:
Teaches AI to understand different cultures in videos.
As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.
Similar Papers
VideoLLM Benchmarks and Evaluation: A Survey
CV and Pattern Recognition
Helps computers understand videos better.
A Culturally-diverse Multilingual Multimodal Video Benchmark & Model
Computation and Language
Helps computers understand videos in many languages.
VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs
CV and Pattern Recognition
Fixes AI that watches videos to understand better.