Submittd to "ICASSP 2024"
Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are tradi- tionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks comple- tion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innova- tive approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker’s superior perfor- mance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.
As illustrated in the above figure, we decompose DiffTalker into two agent networks: (1) an audio-guided landmark predic- tion network and (2) a landmark-guided face generation network. The landmark prediction network takes the upper half of the landmarks as input and generates the remaining key points with the support of audio features. It comprises three transformer-based modules, responsible for predicting the lower half of the face (LF-Trans), establishing the basic mouth shape (BM-Trans), and adjusting the mouth based on audio input (AM-Trans). We view the landmark results as supplementary information that assists the diffusion model in generating precise facial geometry, particularly for mouth shape. The diffusion-based face prediction network takes both Gaussian noise and the predicted landmarks as input, producing the facial texture. Similar to most methods, the landmark features interact with image features through cross-attention. By employing landmarks, we bridge the do- main gap between the audio and image spaces without the need for training an additional aligner network.
The above figure presents a comparison between landmark completion and ground truth. In the left portion of the results, it is evident that the landmark completion network accurately predicts the coordinates of the lower key points, which provide geometry information for the subsequent face synthesis network. To as- sess the impact of audio features on mouth key points, we use mismatched landmarks and speech features as inputs, such as employing the upper landmark of the 142_th frame and the audio features of the 224_th frame. The third column of the results demonstrates that the shape of the mouth accurately aligns with the facial image, confirming the success of our design by solely utilizing the offset of the mouth key points.
To assess the performance of the generated faces, we utilized Five metrics: LSE-D, LSE-C, LD (The landmark distance be- tween predictions and GTs), PSNR, and SSIM. From the below table, we can find that our method (Ours2) outperforms the GAN- based method and is close to the current diffusion model. However, the DAE. needs additional training of an aligner for audio features and image features. When we remove the land- mark information and directly use the DeepSpeech features as the controllable information (replacing the text prompt), the performance of our model (Ours1) significantly deteriorates. This underscores the efficiency of the landmark as an additional information. Figure 3 showcases the qualitative outcomes obtained from our face synthesis network. The illustration highlights that the generated faces exhibit a coherent geometric structure The above figure showcases the qualitative outcomes obtained from our face synthesis network. The illustration highlights that the generated faces exhibit a coherent geometric structure across various head poses, owing to the guidance from the upper face and complete landmark information. The lower part of the figure displays the landmark prediction results, reflecting the geometric accuracy of the synthesized face.
In this paper, we introduce DiffTalker, a co-driven diffusion method tailored for generating talking faces. DiffTalker con- sists of two sub-agents: a landmark completion network and a face synthesis network. By harnessing landmarks, we es- tablish a seamless connection between the audio domain and images, thereby enhancing the network’s capability to gener- ate precise geometry and mouth shape results. Experimental results demonstrate DiffTalker’s exceptional performance in producing clear and geometrically accurate talking faces, all without the requirement for additional alignment between au- dio and image features.
Our method can make full use of landmark information to generate images consistent with speech. Avoid unnecessary training of an equalizer for audio features and image features. However, there is still room for improvement in the continuity and consistency between frames of our method. This has a lot to do with the accuracy of landmark prediction and inter-frame continuity (there is a certain error in using the results of ffmep as ground truth itself). In future work, we will improve the inter-frame continuity of predicted landmarks to further ensure the quality of the videos we generate.
the video clip of the predicted images.
the video clip of the predicted landmarks.
the video clip of the ground truth.
@misc{
@article{qi2023difftalker,
title={DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks},
author={Qi, Zipeng and Zhang, Xulong and Cheng, Ning and Xiao, Jing and Wang, Jianzong},
journal={arXiv preprint arXiv:2309.07509},
year={2023}
}