"Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection
By: Anastasios Skoularikis , Stefanos-Iordanis Papadopoulos , Symeon Papadopoulos and more
Potential Business Impact:
Finds if AI pictures are funny, art, or lies.
Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.
Similar Papers
"Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection
CV and Pattern Recognition
Helps tell if AI pictures are jokes or lies.
D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning
CV and Pattern Recognition
Helps computers spot dark jokes in memes.
Labels or Input? Rethinking Augmentation in Multimodal Hate Detection
CV and Pattern Recognition
Finds mean memes by looking at pictures and words.