[Paper 리뷰] BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting

티스토리 뷰

Paper/TTS

feVeRin 2025. 2. 16. 08:29

Low-resource language에 대한 text-to-speech model이 필요함
BnTTS
- XTTS architecture를 기반으로 하는 speaker adaptation-based text-to-speech model
- Low-resource language의 phonetic, linguistic character를 반영하도록 multilingual pipeline에 integrate 함
논문 (NAACL 2025) : Paper Link

-> 그래서 low-resource language에 대해서도 few-shot adpatation을 수행할 수 있는 BnTTS를 제안

BnTTS
- Low-resource language를 XTTS training pipeline에 integrate
- Low-resource language의 unique phonetic, linguistic feature를 accommodate 할 수 있는 architectural modification을 반영

< Overall of BnTTS >

Preliminaries
- $N$개의 token이 있는 sequence $\mathbf{T}=\{t_{1},t_{2},...,t_{N}\}$과 speaker mel-spectrogram $\mathbf{S}=\{s_{1},s_{2},...,s_{L}\}$이 주어졌을 때, BnTTS는 speaker characteristic과 match 되는 speech $\hat{\mathbf{Y}}$를 생성함
- Ground-truth mel-spectrogram을 $\mathbf{Y}=\{y_{1},y_{2},...,y_{M}\}$이라고 하면, synthesis process는:
  (Eq. 1) $\hat{\mathbf{Y}}=\mathcal{F}(\mathbf{S},\mathbf{T})$
  - $\mathcal{F}$ : text, speaker spectrogram을 condition으로 speech를 생성하는 역할

Audio Encoder
- Vector Quantized-Variational AutoEncoder (VQ-VAE)는 mel-spectrogram frame $\mathbf{Y}$를 discrete token $M\in\mathcal{C}$로 encoding 함
  - $\mathcal{C}$ : vocab/codebook
- 이후 embedding layer는 해당 token을 $d$-dimensional vector $\mathbf{Y}_{e}\in\mathbb{R}^{M\times d}$로 transform 함

Conditioning Encoder & Perceiver Resampler
- XTTS의 conditioning encoder는 $k$-head Scaled Dot-Product Attention의 $l$개 layer로 구성되고, 이후 Perceiver Resampler가 추가됨
  - 여기서 speaker spectrogram $\mathbf{S}$는 intermediate representation $\mathbf{S}_{z}\in\mathbb{R}^{L\times d}$로 변환되고, 각 attention layer는 Scaled Dot-Product Attention mechanism을 적용함
- Perceiver Resampler는 variable input length $L$에 대해 fixed output dimensionality $\mathbf{R}\in\mathbb{R}^{P\times d}$를 생성함

Text Encoder
- Text token $\mathbf{T}=\{t_{1},t_{2},...,t_{N}\}$을 continuous embedding space에 project 하여 $\mathbf{T}_{e}\in\mathbb{R}^{N\times d}$를 생성함

Large Language Model (LLM)
- 논문은 transformer-based LLM의 decoder portion을 활용함
- 이때 speaker embedding $\mathbf{S}_{p}$, text embedding $\mathbf{T}_{e}$, ground-truth spectrogram embedding $\mathbf{Y}_{e}$를 concatenate 하여 input을 구성함:
  (Eq. 2) $\mathbf{X}=\mathbf{S}_{p}\oplus\mathbf{T}_{e}\oplus\mathbf{Y}_{e}\in\mathbb{R}^{(N+P+M)\times d}$
- LLM은 $\mathbf{X}$를 input으로 하여 text, speaker, spectrogram embedding에 대한 hidden state를 가지는 output $\mathbf{H}$를 생성함
- 추론 시에는 text, speaker embedding만 concatenate 되어 spectrogram embedding $\{h_{1}^{Y},h_{2}^{Y},...,h_{P}^{Y}\}$를 output 함

HiFi-GAN Decoder
- HiFi-GAN decoder는 LLM output을 realistic speech로 변환하여 speaker characteristic을 preserve 함
- 이때 LLM의 speech head output $\mathbf{H}_{Y}=\{h_{1}^{Y},h_{2}^{Y},...,h_{P}^{Y}\}$를 input으로 사용하고, speaker embedding $\mathbf{S}$는 $\mathbf{H}_{Y}$와 match 되도록 resize 되어 $\mathbf{S}'\in\mathbb{R}^{P\times d}$가 됨
- 그러면 final audio waveform $\mathbf{W}$는:
  (Eq. 3) $\mathbf{W}=g_{\text{HiFi}}(\mathbf{H}_{Y}+\mathbf{S}')$
- 이를 통해 HiFi-GAN decoder는 speaker의 unique quality를 maintain 하면서 input text에 해당하는 speech를 생성할 수 있음

Reference-independent Evaluation
- BnTTS는 SMOS, Naturalness, Clarity와 같은 subjective evaluation 측면에서 우수한 성능을 보임
- 특히 zero-shot 방식인 BnTTS-0에 비해 few-shot 방식인 BnTTS-n이 더 높은 speaker fidelity와 intelligibility를 일관적으로 달성함

Effect of Sampling and Prompt Length on Short Speech Generation
- BnTTS는 $T=0.85, \text{TopK}=50$의 default setting을 사용하는 경우, 30 character 미만의 short audio sequence generation에 대해 낮은 성능을 보임
- 이 경우 short prompt를 사용하고 $T=1.0, \text{TopK}=2$로 조정하면 성능을 향상할 수 있음

[Paper 리뷰] Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization (0)	2025.03.17
[Paper 리뷰] DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors (0)	2025.03.03
[Paper 리뷰] ProsodyFlow: High-Fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models (3)	2025.02.02
[Paper 리뷰] FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS (0)	2025.01.13
[Paper 리뷰] VoiceLDM: Text-to-Speech with Environmental Context (0)	2025.01.04