Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice
Abstract
Social-MAE, an extended Contrastive Audio-Visual Masked Auto-Encoder, achieves state-of-the-art performance in multimodal emotion and laughter recognition through in-domain self-supervised pre-training.
Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning (2025)
- Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach (2025)
- AudioMAE++: learning better masked audio representations with SwiGLU FFNs (2025)
- LPGNet: A Lightweight Network with Parallel Attention and Gated Fusion for Multimodal Emotion Recognition (2025)
- Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization (2025)
- ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning (2025)
- Taming Transformer for Emotion-Controllable Talking Face Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper