[Paper 리뷰] DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

티스토리 뷰

Paper/TTS

[Paper 리뷰] DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

feVeRin 2024. 7. 4. 09:09

DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

대부분의 text-to-speech system은 well-designed 환경에서 수집된 고품질 corpus를 활용하므로 데이터 수집 비용이 높음
DRSpeech
- Noisy speech corpora를 training data로 활용할 수 있는 noise-robust text-to-speech 모델
- Frame-level encoder를 통해 time-variant additive noise를 represent 하고 utterance-level encoder를 사용하여 time-invariant environmental distortion을 jointly represent 함
- 추가적으로 utterance-dependent information으로부터 disentangle 된 clean environmental embedding을 얻기 위한 regularization method를 도입
논문 (INTERSPEECH 2022) : Paper Link

1. Introduction

Text-to-Speech (TTS) 모델은 일반적으로 well-developed environment에서 record 된 고품질의 speech corpora를 활용하므로 data collection 과정이 time-consuming 하고 expensive 함
- 이때 phone-recorded data나 downloaded data를 활용하면 비용을 절감할 수 있지만, noise나 reverberation이 많이 포함되어 있으므로 TTS 모델을 training 하기 어려움
- 이를 해결하기 위해 DenoiSpeech는 time-variant additive noise를 고려하는 방식을 도입했음
  - BUT, 실제 speech data에는 speech signal과 independent 하게 존재하는 environmental distortion이 포함되어 있음

-> 그래서 distortion을 jointly represent 할 수 있는 TTS framework인 DRSpeech를 제안

DRSpeech
- Time-variant additive noise와 time-invariant environmental noise를 모두 포함한 degraded speech로 training 할 수 있는 degradation-robust TTS 모델
- Frame-level noise encoder를 사용하여 time-variant additive noise를 represent 하고 utterance-level environmental encoder를 통해 time-invariant distortion을 처리
- Environmental distortion을 모델링하기 위해 utterance-dependent information과 clean environmental embedding을 distentangle 할 수 있는 regularization method를 도입

< Overall of DRSpeech >

Additive noise와 environmental distortion을 jointly address 하는 degradation-robust TTS 모델
Speaker information과 linguistic content로부터 clean environmental embedding을 얻을 수 있는 regularization method를 도입
결과적으로 기존 방법들보다 뛰어난 성능을 달성

2. Method

DRSpeech는 FastSpeech2를 기반으로 frame-level noise representation과 utterance-level environmental representation을 jointly use 함
- 구조적으로는 input phoneme embedding을 hidden sequence로 encode 하는 phoneme encoder와 encoded representation의 length를 예측하고 extend 하는 length regulator, pitch/energy predictor, decoder를 활용
- 이때 phoneme encoder output에 embedded speaker ID를 추가하여 multi-speaker model로 구성됨

- Frame-Level Noise Representation Learning

먼저 DenoiSpeech와 같이 time-variant noise를 고려하기 위해, noise extractor를 통해 frame-level noise representation을 도입함
- 여기서 논문은 기존의 단순 U-Net 대신 다양한 additive noise로 generalize 할 수 있도록 Conv-TasNet을 적용함
- 결과적으로 noise extractor는 degraded speech에서 additive noise를 추출하고 noise encoder는 hidden noise representation을 output 함
  1. 즉, target degraded waveform $\mathbf{x}_{deg}$는 noise extractor에 input 되고, large dataset의 noisy speech에서 noise waveform을 output 하도록 pre-training 됨
  2. 이후 output noise waveform $\mathbf{x}_{noise}$는 mel-spectrogram $\mathbf{y}_{noise}$로 변환되어 frame-level noise encoder에 input 됨
  3. 그러면 noise encoder는 target mel-spectrogram과 동일한 frame 수를 가지고 length regulator의 output에 추가되는 noise representation $\mathbf{h}_{noise}$를 output 함
- 추론 시에는 frame-level additive noise 없이 output speech를 생성하기 위해, $\mathbf{x}_{noise}(n)=0, \,\, \forall n$으로 정의된 silence를 사용

- Utterance-Level Environmental Representation Learning

DRSpeech는 frame-level noise representation 외에도 utterance-level environmental representation도 학습함
- 이때 additive noise와 environmental distortion을 모두 포함하는 speech에서 environmental condition만 추출할 수 있도록 additive noise를 제거하는 denoiser를 도입
- 구조적으로는 noisy speech waveform에서 denoised speech waveform을 output 하는 pre-trained Conv-TasNet을 사용
  1. Target degraded waveform $\mathbf{x}_{deg}$는 denoiser에 input 되어 denoised speech waveform $\mathbf{x}_{denoised}$를 output 함
  2. Mel-spectrogram $\mathbf{y}_{denoised}$는 $\mathbf{x}_{denoised}$로부터 얻어지고, utterance-level environment encoder에 input 되어 environmental representation $\mathbf{h}_{env}$를 얻음
    - 이때 utterance-level environment encoder에 style-token layer를 사용하여 environmental embedding $\mathbf{h}_{env}$를 TTS 모델에 conditioning 함
- 추론 시에 average clean embedding $\bar{\mathbf{h}}_{env,clean}$은 TTS 모델에 condition 됨
  - $\bar{\mathbf{h}}_{env,clean}$은 clean environmental condition에서 record 된 모든 training data에 대해 $\mathbf{h}_{env}$를 averaging하여 얻어짐
- BUT, 단순히 training 중에 각 utterance의 $\mathbf{h}_{env}$를 conditioning 하고 추론 중에 average clean embedding $\bar{\mathbf{h}}_{env, clean}$을 사용한다고 해서, desired degradation-robust training이 가능한 것은 아님
  1. 이는 $\mathbf{h}_{env}$가 utterance-dependent 하고 speaker characteristic과 entangle 되어 있기 때문
  2. 즉, $\bar{\mathbf{h}}_{env, clean}$이 반드시 clean condition을 나타내지 않으므로 해당 embedding을 사용하면 합성 품질이 저하될 수 있음
    - 따라서 disentangled clean embedding을 얻을 수 있는 regularization method가 추가적으로 필요함

- Training Objective with Regularization Term

DRSpeech의 training objective는 mel-spectrogram의 $L1$ loss 외에도 pitch/energy에 대한 Mean Squared Error (MSE) loss를 포함함
- 이때 앞선 loss function들의 합은 FastSpeech2와 같이 $\mathcal{L}_{main}$으로 얻어짐
- 추가적으로 training 중 average clean environmental embedding을 얻기 위해 regularization을 도입함
  1. Regularization을 위해 clean environmental condition의 speech data만을 사용하는 subtask learning을 활용
  2. 해당 subtask learning은 target speech로부터 추정된 environmental embedding $\mathbf{h}_{env, clean}$을 batch 내에서 average 함
    - 이후 averaged embedding $\bar{\mathbf{h}}_{env, clean}$이 TTS 모델에 condition 되고 loss function은 기존과 같이 계산됨
- 결과적으로 $\mathcal{L}_{average}$를 subtask learning에 대한 regularization term이라고 하면, overall training objective는:
  (Eq. 1) $\mathcal{L}=\mathcal{L}_{main}+\alpha\mathcal{L}_{average}$
  - $\alpha=1.0$ : weighting term
- 해당 regularization을 통해 utterance-level encoder는 linguistic content나 speaker characteristic과 같은 utterance-dependent information과 disentangle 되는 acoustic aspect만 추출할 수 있음

3. Experiments

- Settings

Dataset : VCTK, PNL, LibriTTS
Comparisons : FastSpeech2+ConvTasNet (Enhancement TTS), DenoiSpeech (Noise-Robust TTS)

- Results

Reverb, noise+reverb와 같은 noisy 환경에서 DRSpeech는 기존보다 뛰어난 성능을 보임

MOS 측면에서도 DRSpeech는 가장 우수한 성능을 달성함

Ablation study 측면에서 regularization이 적용된 DRSpeech는 그렇지 않은 경우보다 더 나은 성능을 보임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis (0)	2024.07.08
[Paper 리뷰] MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech (0)	2024.07.05
[Paper 리뷰] VECL-TTS: Voice Identity and Emotional Style Controllable Cross-Lingual Text-to-Speech (0)	2024.07.02
[Paper 리뷰] DelightfulTTS2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders (0)	2024.07.01
[Paper 리뷰] XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model (0)	2024.06.30

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

1. Introduction

2. Method

- Frame-Level Noise Representation Learning

- Utterance-Level Environmental Representation Learning

- Training Objective with Regularization Term

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바