MoDA: Multi-modal Diffusion Architecture for Talking Head Generation

Authors

Xinyang Li1,2,  Gen Li2,  Zhihui Lin1,3,  Yichen Qian1,3 †,  Gongxin Yao2,  Weinan Jia1,  Weihua Chen1,3,  Fan Wang1,3

1Xunguang Team, DAMO Academy, Alibaba Group    2Zhejiang University    3Hupan Lab

Corresponding authors: yichen.qyc@alibaba-inc.coml_xyang@zju.edu.cn

Github Paper on Arxiv

Abstract

Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts caused by the implicit latent space of Variational Auto-Encoders (VAE), which complicates the diffusion process; 2) a lack of authentic facial expressions and head movements due to inadequate multi-modal information fusion. In this paper, MoDA handles these challenges by: 1) defining a joint parameter space that bridges motion generation and neural rendering, and leveraging flow matching to simplify diffusion learning; 2) introducing a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and auxiliary conditions, enhancing overall facial expressiveness. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different modalities, ensuring effective feature fusion. Experimental results demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications.

gallery

Talking Head Generation in Complex Scenarios.

Fine-grained Emotion Control.

Long Videos Generation.