[Paper 리뷰] NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

티스토리 뷰

Paper/TTS

[Paper 리뷰] NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

feVeRin 2025. 3. 26. 20:31

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

Multiple speaker에 대한 adapter를 활용하여 personalized text-to-speech model을 구성할 수 있음
NanoVoice
- Multiple reference를 parallel fine-tuning 할 수 있는 batch-wise speaker adaptation을 활용
- 추가적으로 speaker adaptation parameter를 줄이기 위해 parameter sharing을 도입하고, trainable scale matrix를 incorporate
논문 (ICASSP 2025) : Paper Link

1. Introduction

VALL-E, VoiceBox와 같은 speaker-adaptive Text-to-Speech (TTS) model은 target speaker voice를 accurately mimic 하는 것을 목표로 함
- Zero-shot approach의 경우 adaptation을 위한 additional training cost가 필요하지 않지만 TTS model 구성을 위해 large dataset과 unique Out-of-Domain (OOD) voice가 필요함
- 한편으로 one-shot adaptation의 경우 pre-trained multi-speaker TTS model을 fine-tuning 해야 하지만 desired speaker voice에 effectively adapt 할 수 있음
  - 해당 fine-tuning approach를 활용하면 OOD data에 대한 robustness를 향상하고 pre-training phase의 data/model size requirement를 줄일 수 있음
- 특히 UnitSpeech는 pre-trained diffusion-based model을 fine-tuning 하여 personalization을 수행함
  1. VoiceTailor의 경우 Low-Rank Adaptation (LoRA)와 같은 parameter-efficient fine-tuning을 활용함
  2. BUT, 기존의 naive fine-tuning method는 각 task를 sequentially fine-tune 하므로 computationally inefficient 하고 memory-intensive 함

-> 그래서 parameter-efficient speaker adaptive TTS model인 NanoVoice를 제안

NanoVoice
- VoiceTailor를 backbone으로 하여 batch-wise fine-tuning scheme을 도입해 multiple reference에 대한 speaker adaptation을 가속
- 추가적으로 parameter efficiency를 위해 모든 reference에서 adapter를 sharing

< Overall of NanoVoice >

Batch-wise fine-tuning과 paramter-sharing을 활용한 parameter-efficient speaker adaptive TTS model
결과적으로 기존 수준의 합성 품질을 유지하면서 45%의 parameter 절감이 가능

2. Method

NanoVoice는 multiple reference를 simultaneously personalize 하는 것을 목표로 함
- 이를 위해 parameter-efficient one-shot TTS model을 기반으로 LoRA를 integrate 함

- UnitSpeech and VoiceTailor

UnitSpeech는 diffusion-based one-shot TTS model으로써, mel-spectrogram $X_{0}$를 noise vector $X_{1}\sim\mathcal{N}(0,I)$로 progressively transform 하는 forward process를 활용함
- 즉, noise schedule $\beta_{t}$와 random noise vector $\epsilon_{t}\sim \mathcal{N}(0,I)$가 주어지면 any timestep $t$에서 corrupted mel-spectrogram $X_{t}$를 다음과 같이 얻을 수 있음:
  (Eq. 1) $X_{t}=\sqrt{\lambda_{t}}X_{0}+\sqrt{1-\lambda_{t}}\epsilon_{t},\,\,\,\lambda_{t}=e^{-\int_{0}^{t}\beta_{s}ds}$
- Target speaker $S$의 voice와 transcript $c$를 사용하여 speech를 합성하기 위해서는, pre-defined forward process에 대한 reverse trajectory와 corrupted mel-spectrogram $X_{t}$에 대한 score $\nabla X_{t}\log p(X_{t}|c,S)$가 필요함
  1. 따라서 UnitSpeech는 해당 score를 predict 하기 위해 network $s_{\theta}(X_{t}|c,S)$를 train 한 다음, 이를 mel-spectrogram 생성에 사용함
  2. 여기서 training을 위한 loss function과 score network $s_{\theta}$에 대한 generation formulation은:
    (Eq. 2) $\mathcal{L}(\theta)=\mathbb{E}_{t,X_{0},\epsilon_{t}}\left[|| \sqrt{1-\lambda_{t}}s_{\theta}(X_{t}|c,S)+\epsilon_{t}||_{2}^{2}\right]$
    (Eq. 3) $X_{t-\Delta t}=X_{t}+\beta_{t}\left(\frac{1}{2}X_{t}+s_{\theta}(X_{t}|c,S)\right)\Delta t+\sqrt{\beta_{t}\Delta t}z_{t}$
    - $z_{t}$ : standard normal distribution $\mathcal{N}(0,I)$를 따르는 random vector
- 논문은 (Eq. 2)를 활용하여 LibriTTS dataset에서 pre-train 된 UnitSpeech를 활용함
  - Fine-tuning 시에는 target speaker의 reference data를 사용하여 parameter를 adapt 함
- 한편으로 VoiceTailor는 pre-trained TTS model의 attention module에 low-rank adapter를 inject 하여 parameter-efficient speaker adaptation을 수행할 수 있음
  1. 즉, matrix $W_{0}$를 가지는 linear layer에 대해 $B\in\mathbb{R}^{d\times r}, A\in\mathbb{R}^{r\times k}$인 matrix $\Delta W=BA$를 inject 하고 scale factor $\alpha$를 사용하여 $W=W_{0}+\alpha\cdot BA$를 얻음
  2. 이때 $r$을 $d,k$보다 작게 설정하면, $B,A$의 parameter가 $W_{0}$보다 작아지므로 fewer paramter 만으로도 fine-tuning이 가능함
    - 따라서 NanoVoice 역시 VoiceTailor를 따라 $r=2$의 low-rank adaptation을 적용함

- Batch-Wise Fine-Tuning Scheme with Parameter Sharing

NanoVoice는 multiple reference voice를 사용하여 multiple adapter를 구축하는 방식으로 VoiceTailor를 확장함
- 먼저 $N$ reference speech를 batch 하여 batched reference sample $X'_{0}\in\mathbb{R}^{N\times L}$을 구성함
  - $L$ : reference sample에 대한 mel-spectrogram의 maxium length
- 그러면 각 reference에 대해 $N$ low-rank matrix를 stack 하여 new matrix $B'\in \mathbb{R}^{b\times d\times r}, A'\in\mathbb{R}^{b\times r\times k}$를 얻을 수 있음
  1. 이때 fine-tuning 중에 batch-wise matrix multiplication을 활용하여 각 reference sample에 대한 loss/gradient가 separately calculate 되도록 함
  2. 결과적으로 해당 approach를 통해 batch 내에서 independent computation을 수행할 수 있고, faster speaker adaptation을 달성할 수 있음
- 추가적으로 논문은 efficiency를 더 향상하기 위해 personalization에 less critical 한 parameter를 share 함
  1. 특히 low-rank adapter에 대해 다음 4가지 구성을 활용하여 speaker adaptation을 수행할 수 있음:
    - $(B',A')$ : 모든 matrix를 batch-wise로 사용하는 baseline setup
    - $(B,A')$ : $B$가 모든 reference에 대해 share 되는 setup
    - $(B',A)$ : $A$가 모든 reference에 대해 share 되는 setup
    - $(B,A)$ : 모든 matrix를 share 하는 setup
  2. 결과적으로 NanoVoice는 $A'$을 batch-wise manner로 활용하고 모든 reference voice에 대해 $B$를 share 하는 $(B',A)$를 채택함
    - 특히 $B'$의 parameter 수는 total trainable parameter의 $2/3$를 차지하므로, 해당 sharing approach를 통해 parameter 수를 크게 절감할 수 있음

- Lightweight Scale Matrix

Batch-wise matrix $A'$을 shared matrix $B$와 combine 하면 trainable parameter를 절감할 수 있지만 performance degradation이 발생함
- 따라서 논문은 LoRA capacity를 boosting 하기 위해, 각 reference에 대한 stacked scale vector로 구성된 trainable scale matrix $m'\in\mathbb{R}^{N\times 1\times k}$를 도입함
- 먼저 scale matrix $m'$은 pre-trained model $W_{0}$의 column-wise weight로 initialize 됨
  1. 이때 fine-tuning/inference 중에 $W$를 compute 할 수 있도록, $m'$을 $W_{0}+\alpha\cdot BA'$에 directly applying 하지 않고 $W_{0}+\alpha\cdot BA'$을 $||W_{0}+\alpha\cdot BA'||_{c}$와 같이 column-wise norm으로 normalize 함
  2. 이후 scale matrix $m'$에 대해 element-wise multiplication을 수행함
    - $m'$은 single-scale vector training과 달리 multiple speaker reference에 대해 batch 됨
- 해당 approach를 통해 fewer additional parameter만으로도 speaker adaptation에 대한 성능을 향상할 수 있음

3. Experiments

- Settings

Dataset : LibriTTS
Comparisons : VoiceTailor, UnitSpeech, XTTS, CosyVoice

- Results

NanoVoice는 21K의 parameter와 585 hours의 dataset 만으로 CosyVoice 수준의 합성 품질을 달성함

Ablation Study
- 모든 reference에 대해 $B$를 share 하는 경우, trainable parameter의 37.1%만 사용하면서 sharing matrix를 사용하지 않는 batch-wise adapter 수준의 SECS를 달성할 수 있음
- $A$를 share 하는 경우 trainable parameter는 1.75배 증가하지만 $B$ sharing에 비해 성능이 저하됨

Normalization, Scale matrix를 제거하는 경우 성능저하가 발생함

Analysis
- NanoVoice는 batched adapter 수가 변화하더라도 consistent speaker similarity를 유지함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts (0)	2025.04.03
[Paper 리뷰] VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance (0)	2025.04.02
[Paper 리뷰] SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow (0)	2025.03.25
[Paper 리뷰] Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization (0)	2025.03.17
[Paper 리뷰] DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors (0)	2025.03.03

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

1. Introduction

2. Method

- UnitSpeech and VoiceTailor

- Batch-Wise Fine-Tuning Scheme with Parameter Sharing

- Lightweight Scale Matrix

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바