[Paper 리뷰] ProsodyFlow: High-Fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models

티스토리 뷰

Paper/TTS

[Paper 리뷰] ProsodyFlow: High-Fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models

feVeRin 2025. 2. 2. 10:31

ProsodyFlow: High-Fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models

Text-to-Speech에서 diverse, natural prosody를 반영하는 것은 여전히 한계가 있음
ProsodyFlow
- Large self-supervised speech model과 conditional flow matching을 결합해 prosodic feature를 modeling
- Speech LLM을 통해 acoustic feature를 추출하고 해당 feature를 prosody latent space에 mapping 한 다음, conditional flow matching을 사용해 input text로 condition 된 prosodic vector를 생성
논문 (ACL 2025) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 input text로부터 high-quality, natural-sounding speech를 합성하는 것을 목표로 함
- BUT, natural, diverse prosodic attribute를 capture 하는 speech를 생성하는 것에는 한계가 있음
  1. 이를 해결하기 위해 FastSpeech2와 같이 explict pitch/energy prediction을 사용하거나 StyleTTS2와 같이 reference prosody encoder를 활용할 수 있음
  2. BUT, 기존 방식들은 rich prosodic information을 fully extract 하지 못하고 diversity 없이 average distribution을 학습하는 경향이 있음
- 한편으로 Wav2Vec 2.0, HuBERT, WavLM과 같은 self-superviesd speech language model은 speech content understanding, semantic information capturing, prosodic feature extracting에 효과적임
  1. 해당 model은 diverse, unlabeled speech data에 대해 large-scale pre-training을 활용하여 robust, comprehensive speech representation을 학습함
  2. 결과적으로 prosody representation의 key component인 pitch, rhythm, intonation 등의 local/global variation을 모두 capture 할 수 있음
- 추가적으로 flow matching-based TTS model을 통해 high-quality, fast-speedh synthesis를 수행할 수 있음
  1. Grad-TTS, FastDiff와 같은 complex probabilistic framework와 달리 flow matching model은 distribution을 directly matching 하는 방법을 학습하여 training stability를 크게 향상함
  2. 이를 통해 VoiceBox, VoiceFlow, Matcha-TTS와 같이 higher efficiency와 high-quality speech를 달성할 수 있음

-> 그래서 prosody modeling을 위해 self-supervised pre-trained model과 conditional flow matching을 결합한 ProsodyFlow를 제안

ProsodyFlow
- Self-supervised WavLM model을 통해 acoustic feature를 추출하고 prosody latent space에 mapping
- Conditional flow matching을 통해 prosody distribution을 학습하고 text로 condition 된 prosody vector를 sampling

< Overall of ProsodyFlow >

Self-supervised model과 conditional flow matching을 결합한 TTS model
결과적으로 기존보다 뛰어난 합성 품질을 달성

2. Method

ProsodyFlow는 pre-trained WavLM model을 활용하여 prosody를 추출하고 recording으로부터 style information $s$를 추출하는 non-autoregressive end-to-end TTS architecture를 활용함
- Prosody $s$는 Adaptive Instance Normalization (AdaIN)을 통해 decoder, duration, pitch predictor에 integrate 됨
- 이후 conditioned flow matching은 predicted prosody vector $s'$을 generate 함
  - 이를 통해 diverse prosodic style을 가진 high-quality speech를 합성 가능

- Overview

ProsodyFlow는 end-to-end training을 지원하는 StyleTTS2 framework를 기반으로, training stablity를 보장하고 process acceleration을 위한 two-stage training을 도입함
- 먼저 first stage에서 encoder-decoder structure는 $\mathcal{L}_{first}=\mathcal{L}_{mel}+\mathcal{L}_{GAN}$으로 구성된 loss function으로 학습됨
  1. 이때 $t$를 text input, $x$를 mel-spectrogram, $w$를 waveform이라고 하자
  2. 그러면 text encoder는 phoneme을 hidden representation $h_{text}$로 처리하고, pre-trained ASR model을 사용하여 ground-truth alignment $\text{align}=\text{ASR}(t,x)$와 aligned phoneme encoding $h_{align}=\text{align}\cdot h_{text}$를 얻음
  3. 한편 WavLM prosody encoder는 waveform에서 prosody vector $s$를 추출하고, pre-trained pitch extractor는 mel-spectrogram에서 ground-truth pitch $F_{0}$와 energy $N$을 retireve 함
  4. 이후 decoder는 $\text{Decoder}(s,F_{0},N,h_{align})$과 같이 waveform을 생성함
    - 여기서 speech quality 향상을 위해 BigVGAN의 Multi-Period Discriminator (MPD), Multi-Resolution Discriminator (MRD)가 사용됨
- Second stage에서는 $\mathcal{L}_{jointly}=\mathcal{L}_{mel}+\mathcal{L}_{dur}+\mathcal{L}_{F_{0}}+\mathcal{L}_{N}+\mathcal{L}_{CFM}+\mathcal{L}_{GAN}$의 loss function으로 module을 jointly training 함
  1. 이때 pre-trained language model인 PLBert를 활용하여 text에서 rich semantic information을 추출함
    - 이를 통해 StyleTTS2와 같이 text encoder와 predictor를 decouple 할 수 있음
  2. PLBert output을 $h_{bert}=\text{PLBert}(t)$라고 하자
    - 그러면 $h_{bert}$와 $s$는 모두 predictor를 training 하기 위한 input으로 사용되고, predictor는 predicted duration, pitch, energy를 $d',F'_{0},N'$과 같이 생성함
    - $d',F'_{0},N'=\text{Predictor}(h_{bert},s)$
  3. Predicted aligned text embedding은 $h_{pred}=h_{text}\cdot d'$과 같이 계산되고, synthesized speech는 $\text{waveform}=\text{Decoder}(s,F'_{0},N',h_{pred})$와 같이 생성됨
  4. 추가적으로 conditional flow matching은 latent prosody space에서 noise distribution과 target distribution 간의 Ordinary Differential Equation (ODE)를 학습하여 prosody vector $s'$을 예측하는데 사용됨
- 결과적으로 ProsodyFlow는 해당 two-stage process를 통해 training을 stabilize 하고 prosody capture를 향상함

- WavLM Prosody Encoder

WavLM은 large-scale unlabelled data에서 rich representation을 학습하는 self-supervised speech model
- 구조적으로 WavLM은 12-layer Transformer architecture를 가짐
  1. 여기서 각 layer output을 sequence length dimension을 따라 average 함
    - 해당 averaged output은 self-attention module을 통해 처리되어 input speech에 대한 feature vector를 생성함
  2. 이후 convolutional mapping layer는 downsampling convolutional block을 적용하여 attended feature를 fixed-size prosody vector space로 변환함
    - 이를 통해 semantic information 내의 complex pattern, relationship을 capture 하여 representation을 enhance 할 수 있음
- 결과적으로 WavLM encoder는 most-relevant information을 추출하고 condense 하여 expressive speech synthesis에 활용할 수 있는 compact prosody representation을 생성함

- Conditional Flow Matching

Conditional Flow Matching (CFM)은 conditioning information $h_{bert}$를 generative process에 incorporating 하여 flow matching framework를 확장함
- 즉, CFM은 input condition과 target data characteristic을 mapping 하는 conditional vector field를 학습함
  1. 먼저 conditional vector field $\mathbf{v}(\mathbf{x},t|\mathbf{c})$가 주어졌을 때, flow는 다음의 ODE로 나타낼 수 있음:
    (Eq. 1) $\frac{d\mathbf{x}(t)}{dt}=\mathbf{v}(\mathbf{x}(t),t|\mathbf{c})$
    - $\mathbf{c}$ : condition (text/phonetic input), $\mathbf{x}(t)$ : time $t$에서의 sample state
    - $\mathbf{v}$ : distribution 간 transport cost를 minimize 하도록 학습됨
  2. Conditional flow model을 training 하기 위해서는 learned vector field $\mathbf{v}(\mathbf{x},t|\mathbf{c})$가 probability path를 따라 true conditional vector field $\mathbf{u}(\mathbf{x}|\mathbf{c})$를 approximate 하도록 보장하는 loss function이 필요함:
    (Eq. 2) $\mathcal{L}_{CFM}(\theta)=\mathbb{E}_{t,q(\mathbf{x}_{1}),p_{t}(\mathbf{x}|\mathbf{x}_{1})} || \mathbf{u}(\mathbf{x}|\mathbf{x}_{1})-\mathbf{v}(\mathbf{x};\theta)||^{2}$
    - $t \sim \mathcal{U}[0,1]$ : interval $[0,1]$에서 uniformly sample 되는 값
    - $q(\mathbf{x}_{1})$ : data distribution, $p_{t}(\mathbf{x}|\mathbf{x}_{1})$ : time $t$에서 conditional probability density function
    - $\mathbf{v}(\mathbf{x};\theta)$ : $\theta$로 parameterize 된 neural network
- 해당 CFM loss는 intractable marginal probability density와 vector field를 conditional probability density와 conditional vector field로 replace 하여 learning process를 tractable 하게 만듦
  - 여기서 $\theta$에 대한 $\mathcal{L}_{CFM}$의 gradient는 original Flow Matching loss $\mathcal{L}_{FM}(\theta)$와 identical 함

3. Experiments

- Settings

Dataset : LJSpeech
Comparisons : DiffProsody, FastSpeech2, VITS, Grad-TTS, StyleTTS2

- Results

전체적으로 ProsodyFlow가 가장 우수한 합성 품질을 달성함

Prosody Flow Matching
- NFE condition에 따른 성능 차이를 비교해 보면, $n=1$일 때도 ProsodyFlow는 baseline model 수준의 합성 품질을 달성할 수 있음
- 논문은 speed, quality 간의 balance를 위해 $n=8$을 사용

Ablation Study
- Ablation study 측면에서 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors (0)	2025.03.03
[Paper 리뷰] BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting (1)	2025.02.16
[Paper 리뷰] FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS (0)	2025.01.13
[Paper 리뷰] VoiceLDM: Text-to-Speech with Environmental Context (0)	2025.01.04
[Paper 리뷰] Flowtron: An Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (0)	2024.12.29

최근에 올라온 글

최근에 달린 댓글

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ProsodyFlow: High-Fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models

ProsodyFlow: High-Fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models

1. Introduction

2. Method

- Overview

- WavLM Prosody Encoder

- Conditional Flow Matching

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바