Score: 1

Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model

Published: April 28, 2025 | arXiv ID: 2504.19739v1

By: Muzammil Behzad, Guoying Zhao

Potential Business Impact:

Reads emotions from faces in 3D.

Business Areas:

Image Recognition Data and Analytics, Software

In this paper, we introduce AffectVLM, a vision-language model designed to integrate multiviews for a semantically rich and visually comprehensive understanding of facial emotions from 3D/4D data. To effectively capture visual features, we propose a joint representation learning framework paired with a novel gradient-friendly loss function that accelerates model convergence towards optimal feature representation. Additionally, we introduce augmented textual prompts to enhance the model's linguistic capabilities and employ mixed view augmentation to expand the visual dataset. We also develop a Streamlit app for a real-time interactive inference and enable the model for distributed learning. Extensive experiments validate the superior performance of AffectVLM across multiple benchmarks.

Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition

CV and Pattern Recognition

Helps computers understand feelings from faces.

14 May 2025 1

92%

Self-Supervised Multi-View Representation Learning using Vision-Language Model for 3D/4D Facial Expression Recognition

CV and Pattern Recognition

Computer understands your face's feelings better.

1 Jun 2025 1

90%

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Computation and Language

Makes computer voices sound more real.

22 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇸🇦 🇫🇮 Saudi Arabia, Finland

Page Count

6 pages

Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model

Reads emotions from faces in 3D.

Technical Abstract

Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition

Self-Supervised Multi-View Representation Learning using Vision-Language Model for 3D/4D Facial Expression Recognition

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation