[Paper 리뷰] ZCS-CDiff: A Zero-Shot Code-Switching TTS System with Conformer-Based Diffusion Model

티스토리 뷰

Paper/TTS

[Paper 리뷰] ZCS-CDiff: A Zero-Shot Code-Switching TTS System with Conformer-Based Diffusion Model

feVeRin 2025. 5. 28. 17:41

ZCS-CDiff: A Zero-Shot Code-Switching TTS System with Conformer-Based Diffusion Model

Code-Switching Text-to-Speech는 zero-shot scenario에서 활용하기에 한계가 있음
ZCS-CDiff
- Speech feature를 disentangle 하고 diffusion model을 사용하여 해당 disentangled attribute를 modeling
- Conformer-based WaveNet을 denoising network로 활용하여 attribute modeling을 개선
- 추가적으로 speaker-assist module을 통해 speaker similarity를 향상
논문 (ICASSP 2025) : Paper Link

1. Introduction

Code-Switching (CS)는 conversation 중에 language를 altering 하는 것을 의미함
- BUT, 기존 Code-Switching Text-to-Speech (CS TTS) model은 unseen speaker에 대한 Zero-Shot (ZS) scenario에서 high-fidelity speech를 생성하는데 한계가 있음
- 한편으로 YourTTS와 같은 ZS TTS model은 single language만 사용하므로 CS task에 specialize 되지 않음
  - 특히 ZS CS를 위해서는 highly entangled speech feature를 효과적으로 disentangle 해야 함

-> 그래서 효과적인 ZS CS TTS task를 위한 ZCS-CDiff를 제안

ZCS-CDiff
- Speech feature를 multiple attribute로 decompose 하는 feature disentanglement approach를 채택
  - 이를 통해 서로 다른 attribute에 대한 precise control을 지원함
- Disentangle attribute를 modeling 하기 위해 diffusion model을 활용하고 long-sequence feature modeling을 지원하기 위해 Conformer-based WaveNet을 도입
- 추가적으로 generated CS speech와 reference speech 간의 speaker similarity를 향상하기 위해 speaker-assist module을 적용

< Overall of ZCS-CDiff >

Conformer-based Diffusion model을 활용한 zero-shot code-switching TTS model
결과적으로 기존보다 우수한 합성 성능을 달성

2. Method

- Conformer-based WaveNet

논문은 diffusion model을 위해 Conformer-based WaveNet을 denoising network로 사용함
- 이때 DiffWave architecture를 기반으로 residual block 외부의 conolutional layer와 residual block의 internal structure를 modify 함
  - 이를 통해 diffusion model이 multiple attribute를 효과적으로 modeling 함
- 먼저 Transformer의 $\text{Attention}\rightarrow \text{Norm}\rightarrow \text{MLP}$ design을 따라 aggregation block을 구성하여 residual block의 external convolution layer를 replace 함
  1. Input feature $F\in\mathbb{R}^{B\times C\times L}$이 주어지면, Conformer $\text{Cfmr}$, LayerNorm $\text{LN}$을 통과하여 global representation을 추출함
    - $C$ : channel 수, $L$ : sequence length
  2. 이후 CNN, SiLU activation이 적용되어 local detail을 capture 함
    - 이때 resulting feature는 $F'\in\mathbb{R}^{B\times C\times L}$
  3. 즉, 해당 process는 아래와 같이 formulate 됨:
    (Eq. 1) $ F'=\text{SiLU}(\text{CNN}(\text{LN}(\text{Cfmr}(F))))$
- Internal residual block의 경우, condition feature $c\in\mathbb{R}^{B\times C\times L}$과 noise feature $x_{t}\in\mathbb{R}^{B\times C\times L}$을 further process하기 위해 Conformer를 활용함
  1. Residual block이 data를 처리할 때, $x_{t}$는 time step $t\in \mathbb{R}^{B\times C\times 1}$과 combine 되어 $x'_{t}\in\mathbb{R}^{B\times C\times L}$을 생성함
  2. 이후 Conformer를 통해 $x'_{t}, c$에서 각각 feature를 추출하고 해당 feature를 aggregate 함
  3. 최종적으로 aggregated feature는 convolution layer를 통과한 다음, channel dimension을 따라 residual out $\text{RO}\in\mathbb{R}^{B\times \frac{C}{2}\times L}$과 skip out $\text{SO}\in\mathbb{R}^{B\times \frac{C}{2}\times L}$로 evenly split 됨:
    (Eq. 2) $x'_{t}=x_{t}+t,\,\,\, \text{RO, SO}=\text{Split}(\text{CNN}(\text{Cfmr}(x'_{t})+\text{Cfmr}(c)))$

- Speaker-Assist Module

Speaker-Assist module은 multi-scale feature extraction module과 feature fusion module로 구성됨
- Multi-scale extraction module은 local, global feature extraction을 모두 수행함
  1. Input speaker feature를 $F\in\mathbb{R}^{B\times C}$라고 하면, feature $F$는 multi-scale feature를 추출하기 위해 channel dimension을 따라 $F_{1}\in\mathbb{R}^{B\times \frac{C}{2}}, F_{2}\in\mathbb{R}^{B\times \frac{C}{2}}$ size로 evenly split 됨
  2. 여기서 $F_{1}$은 local information을 capture 하기 위해 CNN에 전달되고 $F_{2}$는 long-range global information을 capture하기 위해 Conformer에 전달됨
    - 이후 local, global feature는 각각 $F_{local}\in\mathbb{R}^{B\times \frac{C}{2}}, F_{global}\in\mathbb{R}^{B\times \frac{C}{2}}$와 같이 주어짐
  3. 최종적으로 $F_{local},F_{global}$은 channel dimension을 따라 concatenate 되어 final multi-scale feature $F'\in\mathbb{R}^{B\times C}$를 구성함:
    (Eq. 3) $ F_{1},F_{2}=\text{Split}(F),\,\,\, F'=\text{Cat}\left(\text{CNN}(F_{1}),\text{Cfmr}(F_{2})\right)$
    - 해당 fused feature는 local, global information을 모두 포함하므로 speaker characteristic에 대한 comprehensive representation을 제공함
- Multi-scale feature $F'$을 다른 feature와 simply combining 하면 multi-scale feature 내에 포함된 information을 fully utilize 할 수 없음
  - 따라서 논문은 non-causal WaveNet을 feature fusion module로 사용함
- 결과적으로 추출된 multi-scale speaker feature $F'$은 다른 feature $F_{other}\in\mathbb{R}^{B\times C\times L}$과 combine 된 다음, feature fusion module에 input 되어 semantic information $F_{fusion}\in\mathbb{R}^{B\times C\times L}$을 얻음:
  (Eq. 4) $F_{fusion}=\text{Proj}\left(\text{WaveNet}(\text{Pre}(F_{other}),F')\right)$
  - 해당 multi-scale extraction, fusion process를 통해 expressive composite feature를 생성할 수 있고, higher speaker similarity를 달성할 수 있음

- Speech Disentanglement

논문은 speech를 content, pitch, energy, speaker, duration representation으로 disentangle 하여 speech-related attribute를 control 하고 generation quality를 개선함
- Content Representation
  - Speaker-independent speech content representation을 추출하기 위해 soft HuBERT를 사용함
- Pitch/Energy Representation
  - FastSpeech2를 따라 pitch, energy를 추출함
- Speaker Representation
  - H/ASP를 speaker encoder로 사용하여 speaker representation을 추출함
- Duration Representation
  - Montreal Forced Aligner (MFA)를 사용하여 duration을 추출함

- Modeling Disentangled Attributes with the Diffusion Models

논문은 content, pitch, energy, duration, mel-spectrogram representation을 각각 modeling 하기 위해 DiffUnit, DiffPitch, DiffEnergy, DiffDuration, DiffMel을 사용함
- 각 diffusion model은 training phase에서 clean data $x_{0}$를 directly predict 하는 strategy를 채택함:
  (Eq. 5) $x_{t}=\alpha_{t}x_{0}+\sigma_{t}\epsilon,\,\,\,\epsilon\sim\mathcal{N}(0,1)$
  - $\epsilon$ : standard Gaussian distribution $\mathcal{N}(0,1)$에서 sampling 된 noise
  - $\alpha_{t}=\prod_{i=1}^{t}\sqrt{1-\beta_{i}}$ : current state $x_{t}$에 대한 original data $x_{0}$의 influence를 결정하는 time-dependent coefficient
  - $\sigma_{t}=\sqrt{1-\alpha_{t}^{2}}$ : 각 time step에서 add 되는 noise의 양을 specify 하는 역할
- Training
  1. 논문은 Markov chain 내에서 time step $t$를 $[1,T]$ range에서 uniformly sample 함
    - 이를 기반으로 clean feature $x_{0}$는 (Eq. 5)의 reparameterization을 통해 $x_{t}$로 diffuse 됨
  2. Entire process는 fixed noise schedule $\beta_{1},...,\beta_{T}$에 따라 clean data $x_{0}$를 noisy data $x_{T}$로 directly transform 하는 것과 같음
    - 여기서 sequence $\{\beta_{t}\}_{t=1}^{T}$는 $[0,1]$ range를 가짐
  3. 이후 condition $c$, time step $t$, noisy data $x_{t}$가 Conformer-based denoiser $\theta$에 input 되어 clean data $x_{0}$를 recover 하고, 이때 parameter는 다음과 같이 gradient descent를 통해 update 됨:
    (Eq. 6) $\nabla_{\theta}\left|\left| x_{0}-f_{\theta}\left(\alpha_{t}x_{0} +\sigma_{t}\epsilon |c,t\right)\right|\right|_{2}^{2}$
  4. 결과적으로 model training loss는:
    (Eq. 7) $\mathcal{L}_{training}=\mathcal{L}_{p}+\mathcal{L}_{e}+\mathcal{L}_{d}+\mathcal{L}_{u}+\mathcal{L}_{m}$
    - $\mathcal{L}_{p}=||p-\hat{p}||_{2}^{2}, \mathcal{L}_{e}=||e-\hat{e}||_{2}^{2},\mathcal{L}_{d}=||d-\hat{d}||_{2}^{2}, \mathcal{L}_{u}=||u-\hat{u}||_{2}^{2},\mathcal{L}_{m}=||m-\hat{m}||_{2}^{2}$
    - $p,e,d,u,m$ : 각각 target pitch, energy, duration, soft unit, mel-spectrogram
    - $\hat{p},\hat{e},\hat{d},\hat{u},\hat{m}$ : 각각 predicted pitch, energy, duration, soft unit, mel-spectrogram
- Inference
  1. 추론 시에는 undistributed $x_{0}$를 iteratively predict 하고 각 step에서 posterior distribution을 통해 appropriate perturbation을 add back 하여 increasing detailed feature를 gradually generate 함
  2. 즉, denoising model $f_{\theta}(x_{t}|c,t)$는 $\hat{x}_{0}$를 predict한 다음, $x_{t}$와 predicted $\hat{x}_{0}$를 기반으로 (Eq. 8)에서 posterior distribution $q(x_{t-1}|x_{t},\hat{x}_{0})$를 사용해 $x_{t-1}$을 sampling 함
  3. 최종적으로 $T$ iteration 이후 final clean data $x_{0}$를 얻음:
    (Eq. 8) $q(x_{t-1}|x_{t},\hat{x}_{0})=\mathcal{N}(x_{t-1};\tilde{\mu}_{t}(x_{t},\hat{x}_{0}),\tilde{\beta}_{t}I)$
    - $\tilde{\mu}_{t}(x_{t},\hat{x}_{0})=\frac{\alpha_{t-1}\beta_{t}}{\sigma_{t}}\hat{x}_{0}+ \frac{\sqrt{1-\beta_{t}}(\sigma_{t}-1)}{\sigma_{t}}x_{t},\,\tilde{\beta}_{t}=\frac{\sigma_{t-1}}{\sigma_{t}}\beta_{t}$

3. Experiments

- Settings

Dataset : AISHELL-3, LibriTTS
Comparisons : YourTTS, FastSpeech2

- Results

전체적으로 ZCS-CDiff의 성능이 가장 뛰어남

Cross-lingual TTS 측면에서도 ZCS-CDiff는 가장 우수한 성능을 보임

ZS CS TTS task에 대해서도 뛰어난 성능을 달성함

Ablation Study
- 각 component를 제거하는 경우 성능 저하가 발생함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-Controllable TTS (0)	2025.06.06
[Paper 리뷰] LiveSpeech: Low-Latency Zero-Shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes (0)	2025.05.29
[Paper 리뷰] MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform (0)	2025.05.27
[Paper 리뷰] E3-VITS: Emotional End-to-End TTS with Cross-Speaker Style Transfer (0)	2025.05.23
[Paper 리뷰] InstantSpeech: Instant Synchronous Text-to-Speech Synthesis for LLM-driven Voice Chatbots (0)	2025.05.20

최근에 올라온 글

최근에 달린 댓글

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] ZCS-CDiff: A Zero-Shot Code-Switching TTS System with Conformer-Based Diffusion Model

ZCS-CDiff: A Zero-Shot Code-Switching TTS System with Conformer-Based Diffusion Model

1. Introduction

2. Method

- Conformer-based WaveNet

- Speaker-Assist Module

- Speech Disentanglement

- Modeling Disentangled Attributes with the Diffusion Models

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바