MoDA: Multi-modal Diffusion Architecture for Talking Head Generation

Authors

Xinyang Li^1,2, Gen Li², Zhihui Lin^1,3, Yichen Qian^1,3 †, Gongxin Yao², Weinan Jia¹, Weihua Chen^1,3, Fan Wang^1,3

¹Xunguang Team, DAMO Academy, Alibaba Group ²Zhejiang University ³Hupan Lab

^†Corresponding authors: yichen.qyc@alibaba-inc.com, l_xyang@zju.edu.cn

Abstract

Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts caused by the implicit latent space of Variational Auto-Encoders (VAE), which complicates the diffusion process; 2) a lack of authentic facial expressions and head movements due to inadequate multi-modal information fusion. In this paper, MoDA handles these challenges by: 1) defining a joint parameter space that bridges motion generation and neural rendering, and leveraging flow matching to simplify diffusion learning; 2) introducing a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and auxiliary conditions, enhancing overall facial expressiveness. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different modalities, ensuring effective feature fusion. Experimental results demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications.

gallery

Talking Head Generation in Complex Scenarios.

Fine-grained Emotion Control.

Happy

Sad

Happy

Sad

Long Videos Generation.

Citation

                            
@article{li2025moda,
    title     = {MoDA: Multi-modal Diffusion Architecture for Talking Head Generation},
    author    = {Li, Xinyang and Li, Gen and Lin, Zhihui and Qian, Yichen and 
                Yao, Gongxin and Jia, Weinan and Chen, Weihua and Wang, Fan},
    journal   = {arXiv preprint arXiv:2507.03256},
    year      = {2025}
}