Talking head generation with arbitrary identities and speech audio remains a crucial problem in
the realm of the virtual metaverse. Recently, diffusion models have become a
popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual
artifacts caused by the implicit latent space of Variational Auto-Encoders (VAE), which complicates
the diffusion process; 2) a lack of authentic facial expressions and head movements due to inadequate
multi-modal information fusion. In this paper, MoDA handles these challenges by: 1) defining a joint parameter space that bridges motion generation and neural rendering, and leveraging flow matching to simplify diffusion learning; 2) introducing a multi-modal diffusion architecture to
model the interaction among noisy motion, audio, and auxiliary conditions, enhancing overall facial expressiveness. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different modalities, ensuring effective feature fusion. Experimental results
demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications.
gallery
Talking Head Generation in Complex Scenarios.
Fine-grained Emotion Control.
Happy
Sad
Happy
Sad
Long Videos Generation.
Citation
@article{li2025moda,
title = {MoDA: Multi-modal Diffusion Architecture for Talking Head Generation},
author = {Li, Xinyang and Li, Gen and Lin, Zhihui and Qian, Yichen and
Yao, Gongxin and Jia, Weinan and Chen, Weihua and Wang, Fan},
journal = {arXiv preprint arXiv:2507.03256},
year = {2025}
}