Implicit Generative Model의 잠재 공간 내 보간을 활용한 부드러운 전환으로 두 음성 합성 방법 연구

허찬영; 정재희

Implicit Generative Model의 잠재 공간 내 보간을 활용한 부드러운 전환으로 두 음성 합성 방법 연구

원문정보

A study of two Audio Synthesis Method with Smooth Transition using Implicit Generative Model's Interpolation in Latent space

허찬영, 정재희

한국차세대컴퓨팅학회 한국차세대컴퓨팅학회 논문지 Vol.19 No.5 2023.10 pp.7-19 KCI 등재

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

This paper proposes a soft audio synthesis method by applying an approach of image generation to audio synthesis technology. Using the Implicit Generative Model among the image generation models, we utilized the feature where a smooth transition occurs between the images of the two latent vectors when generating an image with the interpolated vector interpolated between two latent vectors in the latent space. Audio data was reconstructed and used in image form without preprocessing such as Fourier Transform to reduce information loss and increase generation performance. The models trained were DCGAN(Deep Convolution Generative Adversarial Nets) and DDIM(Denoising Diffusion Implicit Model). By experimenting with the smooth transition effects created using linear interpolation and spherical interpolation between the two audios, we aim to find an effective model and interpolation method for smooth audio transitions. As a result of the experiment, DCGAN confirmed a smooth audio transition with various interpolation methods, but fell in terms of audio quality. On the other hand, DDIM confirmed that the quality of the generated voice was excellent, but that the conversion was successful only in the spherical interpolation method. This study concluded that the combination of DDIM and spherical interpolation resulted in the optimal audio synthesis effect.

한국어

본 연구에서는 인공지능 기반의 음성 합성 기술에 이미지 생성의 접근법을 적용하여 부드러운 음성 합성 방법을 제 안한다. 이미지 생성 모델 중 Implicit Generative Model에서 잠재 공간 내 두 잠재 벡터를 보간한 보간 벡터로 이미지를 생성할 때 두 잠재 벡터의 이미지 사이에서 부드러운 전환이 일어나는 특징을 두 음성 합성에 활용하였다. 음성 데이터는 푸리에 전환(Fourier Transform) 등 전처리를 하지 않고 이미지 형태로 재구성하여 사용하여 정보 손실을 줄이고 생성 성능을 높였다. 모델은 Implicit Generative Model 중 DCGAN(Deep Convolution Generative Adversarial Nets)과 DDIM(Denoising Diffusion Implicit Model) 을 학습시켰으며 선형 보간 법, 구형 보간법을 활용하여 생성한 두 음성 사이의 부드러운 전환 효과를 실험하여, 부드러운 음성 전환을 목적으 로 하는 효과적인 모델과 보간 방법을 찾고자 한다. 실험 결과, DCGAN은 다양한 보간법으로 부드러운 음성 전환 이 확인되었지만, 음성 품질 면에서 떨어졌다. 반면 DDIM은 생성된 음성의 품질은 우수하였으나 구형 보간법에서 만 성공적인 전환이 이루어짐을 확인하였다. 본 연구에서 영상을 활용한 부드러운 음성 전환에 DDIM과 구형 보간 법 조합이 최적의 음성 합성 효과임을 보여주었다.

요약
Abstract
1. 서론
2. 관련 연구
3. 실험 환경
3.1 데이터
3.2 모델
3.3 보간 방법
4. 실험 결과
4.1 실험 평가 지표
4.2 DCGAN 실험 환경
4.3 DCGAN 실험 결과
4.4 DDIM 실험 환경
4.5 DDIM 실험 결과
4.6 DDIM의 잠재 공간 내와 데이터 보간 비교 결과
5. 결론
참고문헌

키워드

저자정보

허찬영 ChanYeong Heo. 명지대학교 정보통신공학부
정재희 Jaehee Jung. 명지대학교 정보통신공학과 조교수, 부교수

참고문헌

자료제공 : 네이버학술정보

함께 이용한 논문

0개의 논문이 장바구니에 담겼습니다.

earticle