NeRF-3DTalker: Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing NeRF-based audio-driven talking-head methods suffer from limited view generalization and poor audio-visual alignment accuracy. To address these bottlenecks, this paper proposes a free-viewpoint 3D talking-head synthesis framework. Methodologically: (1) we introduce the first 3D facial prior-guided audio feature disentanglement module, explicitly decoupling lip-motion features driven by speech from speaker-specific identity features; (2) we design a local-global normalized spatial correction mechanism to mitigate geometric distortions in NeRF rendering under non-frontal viewpoints. The model integrates NeRF-based rendering, multi-view supervision, and joint audio-visual optimization. Quantitatively, it achieves state-of-the-art performance across key metrics—lower LPIPS and FID scores, and higher SyncNet score—enabling high-fidelity, lip-synchronized 360° free-viewpoint video generation.

Technology Category

Application Category

📝 Abstract
Talking head synthesis is to synthesize a lip-synchronized talking head video using audio. Recently, the capability of NeRF to enhance the realism and texture details of synthesized talking heads has attracted the attention of researchers. However, most current NeRF methods based on audio are exclusively concerned with the rendering of frontal faces. These methods are unable to generate clear talking heads in novel views. Another prevalent challenge in current 3D talking head synthesis is the difficulty in aligning acoustic and visual spaces, which often results in suboptimal lip-syncing of the generated talking heads. To address these issues, we propose Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis (NeRF-3DTalker). Specifically, the proposed method employs 3D prior information to synthesize clear talking heads with free views. Additionally, we propose a 3D Prior Aided Audio Disentanglement module, which is designed to disentangle the audio into two distinct categories: features related to 3D awarded speech movements and features related to speaking style. Moreover, to reposition the generated frames that are distant from the speaker's motion space in the real space, we have devised a local-global Standardized Space. This method normalizes the irregular positions in the generated frames from both global and local semantic perspectives. Through comprehensive qualitative and quantitative experiments, it has been demonstrated that our NeRF-3DTalker outperforms state-of-the-art in synthesizing realistic talking head videos, exhibiting superior image quality and lip synchronization. Project page: https://nerf-3dtalker.github.io/NeRF-3Dtalker.
Problem

Research questions and friction points this paper is trying to address.

Enhances realism in talking head synthesis
Improves lip-syncing accuracy with 3D audio disentanglement
Generates clear talking heads in novel views
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Radiance Field with 3D Prior
3D Prior Aided Audio Disentanglement
Local-global Standardized Space
🔎 Similar Papers
No similar papers found.
X
Xiaoxing Liu
College of Intelligence and Computing, Tianjin University, Tianjin, China
Z
Zhilei Liu
College of Intelligence and Computing, Tianjin University, Tianjin, China
Chongke Bi
Chongke Bi
Professor of Tianjin University
VisualizationBig data