원문정보
초록
영어
This study proposes a novel model structure for implementing 3D lip-sync by extending the Korean speech synthesis model, Korean-FastSpeech2. Existing Korean lip-sync technologies have struggled to accurately render the three-dimensional expressions of the Korean pronunciations "아" (/a/), "오" (/o/), and "우" (/u/), particularly the lip rounding (mouthPucker). To address this, we introduce a Lip Predictor to the Encoder-Variance Adaptor-Decoder architecture, enabling the model to learn ARKit data. The Lip Predictor, built on a Transformer decoder with four layers and eight multi-head attentions, processes phoneme features and temporal information. By sharing the Variance Adaptor’s output with the speech output Decoder, it naturally resolves synchronization issues between speech and lip movements, which is the core contribution of this study. The proposed model facilitates specialized learning for "아", "오", and "우" pronunciations and is expected to offer superior precision, synchronization accuracy, and scalability compared to existing 3D lip-sync algorithms such as Audio2Face, VOCA, and FaceFormer. This work highlights the potential for advancing lip-sync technology for minority languages.
목차
1. Introduction
2. Related research
2.1 Korean-FastSpeech2
2.2 ARKit Facial Animation
2.3 VOCA: Voice Operated Character Animation
2.4 Existing 3D Lip-Sync Algorithms
3. Research Methods
3.1 Data Preparation
3.2 Data Preprocessing
3.3 Model Architecture
3.4 Training Procedure
3.5 Proposed Evaluation Framework
4. Expected Model Superiority
4.1 Precise Implementation of &quat;아&quat;, &quat;오&quat;, &quat;우&quat; Pronunciations
4.2 Speech-Lip Synchronization Precision
4.3 Korean Phoneme-Specific Learning Capability
4.4 Scalability and Flexibility
4.5 Real-Time Application Potential
4.6 Theoretical Model Validation Framework
5. Discussion and Future Work
5.1 Speaker-Dependent Data Considerations
5.2 Dataset Scalability Strategy
5.3 Emotional Expression Integration
6. Conclusion
Acknowledgement
Reference
