[Paper 리뷰] NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-Robust Expressive TTS

티스토리 뷰

Paper/TTS

[Paper 리뷰] NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-Robust Expressive TTS

feVeRin 2024. 11. 10. 10:02

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-Robust Expressive TTS

Expressive text-to-speech는 다음의 어려움이 존재함
- Reference audio에 background noise가 포함된 경우 highly dynamic prosody information을 추출하기 어려움
- Unseen speaking style에 대한 generalization이 가능해야 함
NoreSpeech
1. Knowledge distillation을 통해 teacher model에서 noise-agnostic speaking style을 학습하는 diffusion model에 기반한 DiffStyle module을 채택
2. Style transfer의 generalization을 향상하기 위해 controllable quantized latent space로 style feature를 mapping 하는 VQ-VAE block을 도입
3. Length mismatched reference utterance에서 textual input으로 style transfer가 가능한 parameter-free text-style alignment module을 적용
논문 (ICASSP 2023) : Paper Link

1. Introduction

Expressive text-to-speech (TTS)는 다양한 speaker의 timbre, emotion, speaking style을 반영하는 것을 목표로 함
- 대표적으로 Meta-StyleSpeech는 multi-speaker TTS를 위해 meta-learning training strategy를 활용함
  1. GenerSpeech의 경우 speaking style를 transfer 하는 multi-level style adaptor를 도입함
  2. BUT, 해당 방식들은 대부분 reference audio가 noise interference가 없는 ideal environment에서 record 되었다고 가정함
    - 따라서 expressive TTS model은 noisy한 real-world scenario에서는 적합하지 않을 수 있음
- 한편으로 reference audio에서 noise effect를 줄이기 위해 다음과 같은 방법들을 고려할 수 있음
  1. Pre-trained speech enhancement model을 사용하여 noise를 제거하는 방법
    - 해당 방식은 speech enhancement (SE) model 성능에 크게 의존함
  2. Styler와 같이 information bottleneck을 도입하거나 adversarial training을 통해 noise information을 decomposing 하는 방법
    - 해당 방식은 복잡한 parameter setting과 training trick이 필요하다는 단점이 있음
- 즉, 기존 방식들은 대부분 noise information을 noisy reference에서 directly separate 하고 remaining part에서 style information을 추출함
  - 결과적으로 noise diversity와 dynamic time-frequency information을 무시하게 됨

-> 그래서 noisy reference audio에서 style information을 직접 추출하지 않고 distribution modeling과 관련된 parameter를 학습하여 style을 reconstruct 하는 NoreSpeech를 제안

NoreSpeech
- Noisy reference audio를 condition으로하는 latent space에서 deep style representation을 directly generate 하는 knowledge distillation-based conditional diffusion model을 활용
  - 해당 DiffStyle module은 diffusion-based speech enhancement model인 CDiffuSE를 기반으로 구축되고, pre-trained teacher model의 supervise하에서 prosody-related style feature를 생성함
- 추가적으로 style transfer의 generalization을 향상하기 위해 다음 2가지 방식을 도입
  1. Length-mismatched reference utterance에서 textual input으로 style transfer를 수행하기 위해 parameter-free style-alignment module을 도입
  2. Unseen speaking style을 transfer하기 위해 VQ-VAE module을 사용해 style feature를 controllable latent space에 mapping

< Overall of NoreSpeech >

Knowledge distillation-based conditional diffusion model을 기반으로 한 noise-robust expressive TTS model
결과적으로 기존보다 뛰어난 합성 성능을 달성

2. Method

- Problem Formulation

Style transfer는 unseen style을 reference utterance로부터 추출하여 유사한 speech sample을 생성하는 것을 목표로 함
- 이때 NoreSpeech는 reference utterance에 background noise가 포함되어 있는 경우를 고려함
  - 즉, Styler와 같이 speaker identity information이 noise-robust speaker encoder를 통해 noisy reference에서 추출될 수 있다고 가정
- BUT, 해당 style information은 noise의 영향을 받으므로 논문은 noisy reference에서 얻어지는 style feature가 clean reference의 결과와 유사하게 얻어지는 것을 목표로 함

- Overview

NoreSpeech는 GenerSpeech를 backbone으로 사용하고 크게 4가지 component로 구성됨
- Encoder : phoneme sequence를 deep representation으로 mapping 하는 역할
- DiffStyle : noisy spectrogram을 기반으로 style feature를 생성하는 역할
- Feature Fusion : style, text feature를 combine하는 역할
- Decoder : feature를 mel-spectrogram으로 mapping하는 역할

- DiffStyle

DiffStyle은 conditional diffusion model, speaker encoder, 2개의 VQ-VAE block으로 구성됨
- Conditional diffusion model은 speaker style을 represent하는 fine-grained style feature를 생성하는 역할
- Speaker encoder는 speaker identity를 represent하는 global speaker embedding을 생성하는 역할
  - 이때 noisy reference utterance를 input으로 사용
Speaker Encoder
- 논문은 global speaker identity characteristic을 capture 하기 위해 generalizable Wav2Vec 2.0 model을 채택함
- 이때 Wav2Vec 2.0 encoder에 average pooling layer와 fully-connected layer를 추가하여 classification task에서 encoder를 fine-tuning 함
  - Fine-tuning 시에는 AMsoftmax loss가 사용됨
Conditional Diffusion Model
- 논문은 noisy audio로부터 noise-agnostic style feature를 생성하는 conditional diffusion model을 training 하는 것을 목표로 함
  - 이를 위해 knowledge distillation을 채택하여 style teacher model을 통해 clean audio에서 style feature를 추출한 다음, 해당 style feature를 diffusion model을 training objective로 사용
- Style Teacher Model
  1. 논문에서는 2가지의 style teacher를 고려함
    - Supervised Learning (SL) 기반의 expressive TTS model인 GenerSpeech
    - Self-Supervised Learning (SSL) 기반의 speech decomposition model인 NANSY
  2. 결과적으로 NoreSpeech의 training은 GenerSpeech와 NANSY를 pre-training 한 다음, 해당 style teacher model을 통해 guide 됨
- Diffusion Model
  1. Diffusion Probabilistic Model은 diffusion process를 reversing 하는 neural network를 training 함
    - 즉, unknown data distribution $p_{data}(\mathbf{x}_{0})$에서 $i.i.d.$ sample $\{\mathbf{x}_{0}\in\mathbb{R}^{D}\}$가 주어지면
    - Diffusion model은 $p_{data}(\mathbf{x}_{0})$를 marginal distribution $p_{\theta}(\mathbf{x}_{0})=\int p_{\theta}(\mathbf{x}_{0},...,\mathbf{x}_{T-1}|\mathbf{x}_{T}) \cdot p(\mathbf{x}_{T})d\mathbf{x}_{1:T}$로 근사함
  2. 해당 conditional diffusion model을 구현하기 위해, NoreSpeech는 CDiffuSE를 기반으로 구축됨
    - 즉, noisy mel-spectrogram을 reshpae 하는 shallow convolution layer $\tau_{\theta}()$와 WaveNet-structured diffusion model을 가짐
  3. 결과적으로 $\mathbf{x}_{0}$를 style feature라고 하면, training loss function은:
    (Eq. 1) $\mathcal{L}_{diff}=\mathbb{E}_{\text{ST}(\mathbf{y}_{c}),\mathbf{y}_{n}, \epsilon\sim\mathcal{N}(0,I),t}\left[|| \epsilon-\epsilon _{\theta}(\mathbf{x}_{t},t,\tau_{\theta}(\mathbf{y}_{n}))||_{2}^{2}\right]$
    - $\text{ST}$ : style teacher model, $\mathbf{y}_{c}$ : clean mel-spectrogram, $\mathbf{y}_{n}$ : noisy mel-spectrogram
    - $t$ : time-step index, $\epsilon_{\theta}$ : learnable parameter
Vector Quantization
- 생성된 style feature의 variabillity를 고려하기 위해, Vector Quantization block을 사용하여 생성된 style feature를 controllable latent space로 mapping 함
- 먼저 latent embedding space $\mathbf{e}\in\mathbb{R}^{K\times H}$가 있다고 하자
  1. 여기서 $K$는 discrete latent space size, $H$는 각 latent embedding vector $\mathbf{e}_{i}$의 dimensionality
    - 논문에서는 $K=H=256$으로 설정
  2. Representation sequence가 embedding에 commit 되고 output이 grow 하지 않도록 commitment loss를 도입함:
    (Eq. 2) $\mathcal{L}_{c}=|| z_{e}(\mathbf{x})-\text{sg}[\mathbf{e}] ||_{2}^{2}$
    - $z_{e}(\mathbf{x})$ : Vector Quantization block output
    - $\text{sg}[\cdot]$ : stop gradient operator

- Feature Fusion

Feature Fusion module은 phoneme representation과 style feature를 fuse 하는 역할을 수행함
- 이때 논문은 fine-grained style feature와 text encoder output 간의 dimension mismatch를 고려하기 위해 parameter-free style-align module을 도입함
- 먼저 style feature와 text feature의 dimension을 각각 $t_{style}, t_{text}$라고 하자
  1. $t_{style}<t_{text}$인 경우, linear interpolation operation을 채택하여 style feature를 upsample 함
  2. $t_{style}>t_{text}$인 경우, $t_{style},t_{text}$ 간의 ratio를 계산한 다음 ratio에 따라 style feature의 consecutive frame을 average 하여 style feature를 downsampling 함

- Pre-training and Loss Function

Speaker Encoder Pre-Training
- LibriTTS dataset을 기반으로 Wav2Vec 2.0 encoder를 fine-tuning 하여 사용함
Pre-Training Style Teacher
- GenerSpeech teacher의 경우, emotion embedding을 사용하지 않고 LibriTTS dataset을 기반으로 training 됨
  - 이후 GenerSpeech의 style adaptor를 사용하여 clean speech에서 fine-grained prosodic feature를 얻음
- NANSY teacher의 경우, 먼저 LibriTTS dataset에서 NANSY2를 training 함
  - 이후 pre-trained model을 사용하여 style feature를 추출함
Loss Function
- 결과적으로 final loss function은 다음과 같이 구성됨
  1. $\mathcal{L}_{dur}$ : Duration prediction loss
    - Predicted/Ground-Truth phoneme-level duration 간의 MSE
  2. $\mathcal{L}_{mel}$ : Mel-reconstruction loss
  3. $\mathcal{L}_{post}$ : Postnet의 negative log-likelihood
  4. $\mathcal{L}_{c}$ : Commitment loss
  5. $\mathcal{L}_{di}$ : (Eq. 1)을 따르는 diffusion loss

3. Experiments

- Settings

Dataset : LibriTTS
Comparisons : FastSpeech2, Styler, GenerSpeech

- Results

전체적으로 NoreSpeech의 성능이 가장 뛰어남

AXY test 측면에서도 NoreSpeech가 가장 선호됨

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] FlashSpeech: Efficient Zero-Shot Speech Synthesis (0)	2024.11.24
[Paper 리뷰] PitchFlow: Adding Pitch Control to a Flow-Matching based TTS Model (0)	2024.11.17
[Paper 리뷰] GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech (0)	2024.11.09
[Paper 리뷰] MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech (0)	2024.10.19
[Paper 리뷰] PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language Model (2)	2024.10.12

최근에 올라온 글

최근에 달린 댓글

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-Robust Expressive TTS

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-Robust Expressive TTS

1. Introduction

2. Method

- Problem Formulation

- Overview

- DiffStyle

- Feature Fusion

- Pre-training and Loss Function

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바