[Paper 리뷰] PEFT-TTS: Parameter-Efficient Fine-Tuning for Low-Resource Text-to-Speech via Cross-Lingual Continual Learning

티스토리 뷰

Paper/TTS

feVeRin 2025. 9. 5. 17:09

Low-resource Text-to-Speech를 위한 model이 필요함
PEFT-TTS
- Parameter-Efficient Fine-Tuning을 위해 3가지의 adapter를 도입
- Text embedding을 개선하기 위한 Condition Adapter, input representation을 refine 하는 Prompt Adapter, generation efficiency를 향상하는 DiT LoRA Adapter를 활용
논문 (INTERSPEECH 2025) : Paper Link

Text-to-Speech (TTS) model을 새로운 language에 adapt 하기 위해서는 상당한 data가 필요함
- Fine-tuning 역시 generalization을 위해 large-scale multi-speaker dataset이 요구됨
- 특히 entire TTS model에 대한 full fine-tuning는 newly adapted language에 대한 새로운 weight set을 학습해야 하므로 computationally expensive 하고 parameter-inefficient 함
  - 추가적으로 catastrophic forgetting이 발생할 수 있으므로 multilingual application에서 활용하기 어려움

-> 그래서 효과적인 language adaptation을 지원할 수 있는 PEFT-TTS를 제안

PEFT-TTS
- Pre-trained F5-TTS를 기반으로 Conditioning Adapter, Prompt Adapter, DiT LoRA Adapter의 3가지 adapter module을 도입
- 추가적으로 linguistic consistency를 maintain 하면서 speaker identity에 대한 generalization을 향상하기 위해 DropPath regularization을 incorporate

< Overall of PEFT-TTS >

F5-TTS는 Diffusion Transformer (DiT)를 기반으로 한 fully non-autoregressive (NAR) TTS model으로써 flow matching을 활용하여 동작함
- 추가적으로 E2-TTS의 simple padding strategy를 활용하여 text가 speech와 align 되는 text-guided speech-infilling task를 통해 training 됨
- 구조적으로:
  1. Text Embedding
    - Local, hierarchical linguistic pattern을 capture 하기 위해 input을 ConvNeXt-v2 block을 사용하여 dense representation으로 convert 함
  2. Input Embedding
    - Processed text embedding을 masked mel-spectrogram, flow-matching latent variable과 concatenate 하여 proper conditioning을 보장함
  3. DiT Block
    - Long range dependency와 natural prosody를 modeling 하기 위해 Transformer-based diffusion layer stack으로 구성됨
- Multilingual corpus에 대해 pre-train 된 F5-TTS는 cross-linguistic phoneme, prosodic pattern을 capture 할 수 있으므로 cross-lingual adaptation에 적합함
  - BUT, limited single-speaker dataset을 통해 fine-tuning 하는 경우 catastrophic forgetting의 risk가 존재함

PEFT-TTS는 low-resource language adaptation을 위해 다음 3가지 adapter module을 도입함
Conditioning Adapter
- ConvNeXt-v2-based text embedding module의 depth-wise convolution layer에 Conv-Adapter를 attach 함
- Conv-Adapter는 depth-wise convolution 다음에 point-wise convolution이 이어지고, feature response를 modulate 하기 위한 squeeze-and-excitation (SE) parameter가 사용됨
  - Compression factor $\gamma$는 parameter efficiency와 adaptation capability 간의 balancing을 수행함
Prompt Adapter
- 논문은 text, audio feature concatenation 이후에 위치한 linear projection layer에 LoRA를 적용함
  - 해당 layer의 trainability는 prounciation accuracy와 speaker similarity에 영향을 미침
- 이때 해당 trade-off를 fine-tuning 하기 위해, 새로운 language에 대한 controlled adaptation을 지원하는 DropPath mechanism을 도입함
  - DropPath는 training 중에 residual path를 randomly drop 하여 new language나 speaker feature에 overfitting 되는 것을 방지함
DiT LoRA Adapter
- 논문은 LoRA를 speaker characteristic과 closely associate 되어 있는 DiT block에 적용함
  - 이를 통해 PEFT-TTS는 pre-trained capability를 preserve 하면서 training data에 adpat 할 수 있음
- 이때 LoRA rank를 $16$으로 설정하여 model flexibility를 maintain 하면서 speaker-specific bias를 limit 함

[Paper 리뷰] HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis (0)	2025.10.18
[Paper 리뷰] ControlSpeech: Towards Simultaneous and Independent Zero-Shot Speaker Cloning and Zero-Shot Language Style Control (0)	2025.09.14
[Paper 리뷰] GST-BERT-TTS: Prosody Prediction without Accentual Labels for Multi-Speaker TTS using BERT with Global Style Tokens (0)	2025.09.01
[Paper 리뷰] EATS-Speech: Emotion-Adaptive Transformation and Priority Synthesis for Zero-Shot Text-to-Speech (0)	2025.08.25
[Paper 리뷰] APTTS: Adversarial Post-Training in Latent Flow Matching for Fast and High-Fidelity Text-to-Speech (0)	2025.08.20