earticle

논문검색

Convergence of Internet, Broadcasting and Communication

Proposal of a Korean 3D Lip-Sync Model Structure through Extension of an Existing Korean Speech Synthesis Model

초록

영어

This study proposes a novel model structure for implementing 3D lip-sync by extending the Korean speech synthesis model, Korean-FastSpeech2. Existing Korean lip-sync technologies have struggled to accurately render the three-dimensional expressions of the Korean pronunciations "아" (/a/), "오" (/o/), and "우" (/u/), particularly the lip rounding (mouthPucker). To address this, we introduce a Lip Predictor to the Encoder-Variance Adaptor-Decoder architecture, enabling the model to learn ARKit data. The Lip Predictor, built on a Transformer decoder with four layers and eight multi-head attentions, processes phoneme features and temporal information. By sharing the Variance Adaptor’s output with the speech output Decoder, it naturally resolves synchronization issues between speech and lip movements, which is the core contribution of this study. The proposed model facilitates specialized learning for "아", "오", and "우" pronunciations and is expected to offer superior precision, synchronization accuracy, and scalability compared to existing 3D lip-sync algorithms such as Audio2Face, VOCA, and FaceFormer. This work highlights the potential for advancing lip-sync technology for minority languages.

목차

Abstract
1. Introduction
2. Related research
2.1 Korean-FastSpeech2
2.2 ARKit Facial Animation
2.3 VOCA: Voice Operated Character Animation
2.4 Existing 3D Lip-Sync Algorithms
3. Research Methods
3.1 Data Preparation
3.2 Data Preprocessing
3.3 Model Architecture
3.4 Training Procedure
3.5 Proposed Evaluation Framework
4. Expected Model Superiority
4.1 Precise Implementation of &quat;아&quat;, &quat;오&quat;, &quat;우&quat; Pronunciations
4.2 Speech-Lip Synchronization Precision
4.3 Korean Phoneme-Specific Learning Capability
4.4 Scalability and Flexibility
4.5 Real-Time Application Potential
4.6 Theoretical Model Validation Framework
5. Discussion and Future Work
5.1 Speaker-Dependent Data Considerations
5.2 Dataset Scalability Strategy
5.3 Emotional Expression Integration
6. Conclusion
Acknowledgement
Reference

저자정보

  • Ki-Hong Kim Professor, Department of Visual Animation, Dongseo University, Korea

참고문헌

자료제공 : 네이버학술정보

    함께 이용한 논문

      ※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

      0개의 논문이 장바구니에 담겼습니다.