[Paper 리뷰] ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation

티스토리 뷰

Paper/TTS

[Paper 리뷰] ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation

feVeRin 2026. 1. 14. 13:28

ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation

Text-to-Speech system에서 speaking style control은 여전히 한계가 있음
ParaStyleTTS
- Prosodic, paralinguistic speech style modeling을 separate 하는 2-level style adaptation architecture를 도입
- 추가적으로 low-resource deployment와 다양한 prompt formulation에 대한 consistent style을 유지
논문 (CIKM 2025) : Paper Link

1. Introduction

Text-to-Speech (TTS) system은 intelligible, human-like speech 뿐만 아니라 다양한 speaking style의 expressive, controllable speech를 생성할 수 있어야 함
- 특히 기존의 TTS model은 prosodic style variation만 control 하므로 gender, age, emotion과 같은 paralinguistic style을 반영하기 어려움
- 한편으로 CosyVoice와 같은 Large Language Model (LLM) 방식은 descriptive style prompt를 통해 paralinguistic control을 반영할 수 있음
  - BUT, 해당 LLM 방식은 autoregressive modeling으로 인해 computationally inefficient 하고 explicit controllability가 떨어진다는 한계점이 있음

-> 그래서 lightweight, controllable paralinguistic modeling을 지원하는 ParaStyleTTS를 제안

ParaStyleTTS
- Prosodic, paralinguistic style을 explicitly disentangle 하여 2-level style adaptation model을 구성
- End-to-End architecture를 기반으로 computational cost를 절감

< Overall of ParaStyleTTS >

2-level style modeling architecture를 활용한 lightweight, expressive TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Text Tokenization

논문은 IPA-based text tokenization을 활용하여 English, Chinese text를 stress, tone과 같은 prosody feature를 포함한 phoneme으로 convert 함
- English word의 경우 CMU Pronouncing Dictionary를 사용하여 ARPAbet phoneme으로 convert 되고 stress marker는 separately extract 되어 style token sequence를 구성함
  - 그런 다음 phoneme을 IPA phoneme으로 mapping 함
- Chinese word의 경우 $\texttt{pypinyin}$을 사용하여 각 character를 Pinyin으로 convert 한 다음, 이를 initial과 final로 split 함
  - Tone은 final에서 strip 되어 style token으로 사용되고, remaining component는 IPA phoneme에 mapping 됨
- 결과적으로 논문의 IPA dictionary는 81 phoneme, Chinese에 대한 5 tone marker, English에 대한 3 stress marker로 구성됨
  - 추가적으로 beginning of sequence를 나타내는 $\texttt{[START]}$, end of sequence를 나타내는 $\texttt{[END]}$, word boundary를 나타내는 $\texttt{[|]}$의 3가지 special token을 사용함

- Token Encoder

ParaStyleTTS는 IPA token $X$, prosody style token $S$를 2개의 distinct embedding layer를 사용하여 vector representation sequence로 project 함
- 이후 sequential information을 preserve 하기 위해 sinusoidal position encoding을 add 하고 IPA/prosody style embedding을 Feed-Forward Transformer (FFT) block에 전달하여 long-range dependency를 modeling 함
  - 여기서 논문은 adjacent token 간의 local contextual dependency를 capture 하기 위해 feed-forward submodule의 2-layer fully-connected network를 2개의 1D convolutional layer로 replace 함
- $\mathbf{X}=[x_{1},x_{2},...,x_{L}]$을 text tokenization을 통해 얻어진 phoneme embedding sequence라고 하자
- 그러면 각 phoneme $x_{l}$을 해당 prosodic style embedding $\mathbf{x}_{t}, \mathbf{s}_{t}^{pho}\in\mathbb{R}^{d_{1}}$과 associate 하여 phoneme-level prosody sequence를 얻을 수 있음:
  (Eq. 1) $\mathbf{S}^{pho}=[\mathbf{s}_{1}^{pho},\mathbf{s}_{2}^{pho},..., \mathbf{s}_{L}^{pho}]$

- Paralinguistic Encoder

Sentence-level paralinguistic style embedding $\mathbf{S}^{para}\in\mathbb{R}^{d_{2}}$는 emotion, age, gender, accent와 같은 sentence-level paralinguistic characteristic을 represent 함
- 해당 embedding은 pre-trained MPNet model을 사용해 descriptive paralinguistic prompt를 $d_{2}$-dimensional embedding으로 encode 하여 얻어짐
- 특히 논문은 dataset의 각 speech sample에 대해 다음과 같은 template를 사용하여 text prompt를 구성함:
  $\texttt{"A [Age][Gender] is speaking [Accent] with [Emotion] emotion."}$
- 해당 prompt는 MPNet에 전달되어 paralinguistic prompt embedding $\mathbf{S}^{para}$를 생성하고, TTS model을 intended paralinguistic style로 condition 하는 데 사용됨

Gender-Specific Style Control을 위한 Text Prompt

- Style Adapter

논문은 style을 서로 다른 control level을 가지는 2가지 category로 divide 함
- 먼저 phoneme-level style은 tone, stress 같이 individual phoneme level에서 fine-grained prosodic variation을 capture 하여 word가 articulate 되는 것에 영향을 미침
- Sentence-level style은 emotion, age, gender와 같이 speech의 global characteristic을 represent 하여 speech에 대한 overall impression에 영향을 줌
Prompt Style Adapter
- Phoneme embedding sequence $\mathbf{X}=[x_{1},x_{2},...,x_{L}]$과 phoneme-level prosody style sequence $\mathbf{S}^{pho}=[\mathbf{s}_{1}^{pho},\mathbf{s}_{2}^{pho}, ...,\mathbf{s}_{L}^{pho}]$에 대해,
  1. Prosodic feature를 phoneme representation에 inject 하는 lightweight adapter를 고려할 수 있음
  2. 이를 위해 논문은 Gated Tanh Unit (GTU) fusion mechanism을 도입함:
    (Eq. 2) $ \tilde{\mathbf{x}}_{t}=\tanh(W_{1}x_{t}+b_{1})\odot \sigma(W_{2}\mathbf{s}_{t}^{pho}+b_{2}),\,\,\, \forall t\in\{1,...,L\}$
    - $W_{1},W_{2}$ : learnable projection weight, $b_{1},b_{2}$ : bias
    - $\odot$ : element-wise multiplication, $\sigma(\cdot)$ : sigmoid function
- (Eq. 2)는 phonetic structure를 preserve 하면서 fine-grained level에서 prosody style이 phoneme representation을 modulate 할 수 있도록 함
Paralinguistic Style Adapter
- Paralinguistic style은 speech 전반에 대해 consistent 하므로 phoneme-level, sentence-level 모두에 영향을 줄 수 있음
- 따라서 논문은 2개의 distinct linear layer를 적용하여 paralinguistic prompt embedding $\mathbf{S}^{para}\in\mathbb{R}^{d_{2}}$를 phoneme-level $\mathbf{S}^{local}$과 sentence-level $\mathbf{S}^{global}$ paralinguistic style embedding으로 project 함:
  (Eq. 3) $\mathbf{S}^{local}=W_{local}\mathbf{S}^{para}+b_{local},\,\,\, \mathbf{S}^{global}=W_{global}\mathbf{S}^{para}+b_{global}$
- 이때 phoneme embedding에 sequence-wise conditioning을 통한 phoneme-level paralinguistic style embedding을 adopt 하기 위해 Feature-wise Linear Modulation (FiLM)을 도입함
  1. 즉, projected style embedding $\mathbf{s}^{sent}$가 주어지면 scaling/bias vector를 다음과 같이 compute 할 수 있음:
    (Eq. 4) $\gamma=W_{\gamma}\mathbf{S}^{local}+b_{\gamma},\,\,\, \beta=W_{\beta}\mathbf{S}^{local}+b_{\beta}$
    - $W_{\gamma},W_{\beta}\in\mathbb{R}^{d_{1}\times d_{2}}$ : learnable projection matrix
    - $b_{\gamma},b_{\beta}\in\mathbb{R}^{d_{1}}$ : bias term
  2. 해당 parameter는 phoneme embedding을 transform 하는 데 사용되는 modulation factor를 생성함:
    (Eq. 5) $\hat{\mathbf{x}}_{t}=\gamma\odot \tilde{\mathbf{x}}_{t}+\beta,\,\,\, \forall t\in\{1,...,L\}$
    - $\odot$ : element-wise multiplication
  3. FiLM-based adapter는 speech 내 각 phoneme에 paralinguistic style을 integrate 함
- 한편으로 sentence-level paralinguistic style embedding $\mathbf{s}^{global}$은 waveform decoder 내에서 sentence-level style adaptation을 guide 하기 위해 training/inference 모두에서 사용됨

- Latent Embedding Learning

ParaStyleTTS는 expressive latent representation modeling, fully end-to-end training, non-autoregressive inference를 위해 Variational AutoEncoder (VAE) framework를 도입함
- 먼저 decoder는 style-integrated phoneme representation $\hat{\mathbf{X}}=[\hat{x}_{1},\hat{x}_{2},...,\hat{x}_{L}]$을 input으로 사용함
  - 이때 global consistency를 보장하기 위해 sentence-level embedding $\mathbf{S}^{global}$은 normalizing flow layer 이전에 prior/posterior encoding과 concatenate 됨
- Latent embedding $\mathbf{Z}\in\mathbb{R}^{N\times T}$는 VAE를 사용하여 Gaussian posterior로부터 sampling됨
  1. Posterior distribution은 ground-truth spectrogram $\mathbf{Y}$와 sentence-level style embedding $\mathbf{S}^{global}$을 통해 conditioning 됨:
    (Eq. 6) $\mathbf{Z}\sim q\left(\mathbf{Z}|\mathbf{Y},\mathbf{S}^{global}\right)=\mathcal{N}(\mu_{post},\sigma_{post})$
    - $\mu_{post},\sigma_{post}$는 spectrogram과 global style embedding으로부터 posterior encoder를 통해 predict됨
  2. Expressive latent representation을 위해 $\mathbf{Z}$에 invertible normalizing flow sequence를 적용하면:
    (Eq. 7) $\mathbf{Z}_{flow}=f_{flow}(\mathbf{Z};\theta_{flow})$
  3. Prior distribution은 해당 transformed latent space로부터 정의되고, style-integrated phoneme sequence $\hat{\mathbf{X}}$에 condition 된 Gaussian으로 modeling 됨:
    (Eq. 8) $p(\mathbf{Z}_{flow}|\hat{\mathbf{X}})=\mathcal{N}(\mu_{prior},\sigma_{prior})$
    - $\mu_{prior}, \sigma_{prior}$는 phoneme embedding $\hat{\mathbf{X}}$와 local style embedding $\mathbf{S}^{local}$을 input으로 사용하는 prior encoder network로부터 predict 됨
  4. Training 시에는 transformed posterior sample과 prior 간의 KL-divergence를 minimize 함:
    (Eq. 9) $\mathcal{L}_{KL}=D_{KL}\left(\mathbf{Z}_{flow}|| p(\mathbf{Z}_{flow}|\hat{\mathbf{X}})\right)$

- Duration Alignment and Modeling

Training 중에 paralinguistic style embedding을 integrate 한 phoneme embedding $\hat{\mathbf{X}}=[\hat{x}_{1},\hat{x}_{2},...,\hat{x}_{L}]$을 latent embedding $\mathbf{Z}$와 align 하기 위해
- 논문은 Monotonic Alignment Search (MAS)를 채택하여 soft alignment matrix $\mathbf{A}\in\mathbb{R}^{L\times T}$를 compute 함
  - $A_{t,j}$ : $\mathbf{Z}$에서 $t$-th phoneme, $j$-th frame 간의 attention weight
- 각 phoneme의 duration $d_{t}$는 time axis에 대한 attention weight를 summation 하여 estimate 됨:
  (Eq. 10) $ d_{t}=\sum_{j=1}^{T}A_{t,j},\,\,\, \forall t\in{1,...,L}$
- 이때 phoneme, style feature에 condition 된 log-duration distribution을 predict 하기 위해 Stochastic Duration Predictor (SDP)를 도입함
  1. Training 시에는 predicted/reference duration 간의 log-domain Mean Squared Error (MSE)를 minimize 함:
    (Eq. 11) $\mathcal{L}_{dur}=\frac{1}{L}\sum_{t=1}^{L}\left(\log \left(d_{t}+\epsilon \right)-\log\left(\hat{d}_{t}+\epsilon \right)\right)^{2}$
    - $\epsilon$ : numerical stability를 위한 constant
  2. 이를 통해 model은 phoneme duration을 intrinsically learning 하면서 paralinguistic style에 influence 되는 duration variation을 capture 할 수 있음

- Training Objective

ParaStyleTTS는 VITS framework의 objective를 추가적으로 combine 하여 optimize 됨
- Generated $\hat{\mathbf{Y}}$와 ground-truth spectrogram에 대한 reconstruction loss $\mathcal{L}_{recon}$:
  (Eq. 12) $\mathcal{L}_{recon}=|| \mathbf{Y}-\hat{\mathbf{Y}}||_{1}$
- Multi-period discriminator $D$에 대한 adversarial loss $\mathcal{L}_{adv}$:
  (Eq. 13) $\mathcal{L}_{adv}=\mathbb{E}_{\hat{\mathbf{Y}}}\left[\left(D(\hat{\mathbf{Y}})-1\right)^{2}\right]$
- Discriminator의 internal feature를 align 하여 adversarial training을 stabilize 하는 feature matching loss $\mathcal{L}_{fm}$:
  (Eq. 14) $\mathcal{L}_{fm}=\sum_{l=1}^{L}\left|\left| D^{(l)}(\mathbf{Y})-D^{(l)}(\hat{\mathbf{Y}})\right|\right|_{1}$
- 결과적으로 얻어지는 total training objective는:
  (Eq. 15) $\mathcal{L}_{total}=\mathcal{L}_{fm}+\mathcal{L}_{KL}+\mathcal{L}_{dur}+\mathcal{L}_{recon}+\mathcal{L}_{adv}$

- Time Complexity Analysis

Computation complexity를 위해 $N$을 text sequence length, $M$을 paralinguistic style prompt length라고 하자
- ParaStyleTTS에서는 phoneme, prosody style token이 Transformer-based FFT block을 통해 separately encode 된 다음, Gated Tanh Unit (GTU) style adapter로 전달되므로 $\mathcal{O}(N^{2})$의 complexity가 나타남
  1. 추가적으로 paralinguistic style prompt는 Transformer-based MPNet encoder를 통해 independently process 되므로 $\mathcal{O}(M^{2})$의 computation이 추가적으로 발생함
  2. 결과적으로 ParaStyleTTS는 $\mathcal{O}(N^{2}+M^{2})$의 combined time complexity를 가짐
- 한편으로 LLM-style fusion model의 경우 text, paralinguistic token을 $N+M$ length의 single sequence로 concatenate 한 다음, LLM으로 jointly encoding 하므로 $\mathcal{O}((N+M)^{2})$의 complexity를 가짐
  - 결과적으로 LLM-style fusion은 ParaStyleTTS에 비해 additional cross-attention cost $\mathcal{O}(NM)$이 발생하므로 computationally inefficient 함

3. Experiments

- Settings

Dataset : Baker, LJSpeech, ESD
Comparisons : VITS, CosyVoice, Spark-TTS, StyleSpeech, LanStyleTTS

- Results

전체적으로 ParaStyleTTS의 성능이 가장 우수함

Speaking Style Controllability
- Style expressiveness 측면에서도 우수한 성능을 달성함

Speaking style embedding에 $t$-SNE를 적용하면 well-separated cluster가 나타남

Resource Usage
- ParaStyleTTS는 763MB의 GPU memory를 사용함

Robustness Style Control
- Style accuracy, robustness 측면에서 ParaStyleTTS는 기존보다 뛰어난 성능을 보임

Per-Class Accuracy Comparison (상) Emotion (중) Age (하) Gender

Fixed prompt를 사용하는 경우에도 distinct cluster가 나타남

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion (0)	2026.02.26
[Paper 리뷰] ARCHI-TTS: A Flow-Matching-Based Text-to-Speech Model with Self-Supervised Semantic Aligner and Accelerated Inference (0)	2026.02.13
[Paper 리뷰] DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis (0)	2025.12.22
[Paper 리뷰] FillerSpeech: Towards Human-Like Text-to-Speech Synthesis with Filler Insertion and Filler Style Control (0)	2025.12.15
[Paper 리뷰] ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching (0)	2025.12.11

최근에 올라온 글

최근에 달린 댓글

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation

ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation

1. Introduction

2. Method

- Text Tokenization

- Token Encoder

- Paralinguistic Encoder

- Style Adapter

- Latent Embedding Learning

- Duration Alignment and Modeling

- Training Objective

- Time Complexity Analysis

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바