[Paper 리뷰] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

티스토리 뷰

Paper/TTS

[Paper 리뷰] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

feVeRin 2024. 4. 6. 11:54

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

Semantic token과 acoustic token으로 나누어진 discrete speech token을 활용하면 text-to-speech의 성능을 향상 가능
대표적으로 VALL-E와 SPEAR-TTS는 짧은 speech prompt에서 추출된 acoustic token에 대한 autoregressive continuation으로 zero-shot speaker adaptation이 가능함
- BUT, 해당 autoregressive 모델은 순차적으로 수행되므로 speaker editing에는 적합하지 않고, audio codec 모델로 인해 제한적인 품질을 가짐
UniCATS
- Speech continuation과 editing이 모두 가능한 unified context-aware text-to-speech framework
- Acoustic model로써 CTX-txt2vec을 사용하고 vocoder로써 CTX-vec2wav를 사용하여 구성됨
- CTX-txt2vec은 contextual VQ-diffusion을 사용하여 input text에서 semantic token을 예측
- CTX-vec2wav는 contextual vocoding으로 acoustic context를 반영하고, 해당 semantic token을 waveform으로 변환
논문 (AAAI 2024) : Paper Link

1. Introduction

최근 semantic token과 acoustic token과 같은 discrete speech token이 음성 작업에 자주 활용되고 있음
- 대표적으로 vq-wav2vec, wav2vec 2.0, HuBERT와 같은 semantic token은 discrimination이나 masking prediction을 위해 사용됨
  - 주로 제한된 acoustic detail을 제공하여 articulation information을 capture 함
- 한편으로 SoundStream, Encodec 등에서 도입된 acoustic token은 speech reconstruction을 위해 사용됨
  - 이러한 token은 speaker identity와 같은 acoustic detail을 capture 함
일반적으로 neural Text-to-Speech (TTS)는 mel-spectrogram 예측과 vocoding의 two-stage pipeline으로 구성됨
- 이때 VQTTS는 새로운 intermediate representation으로써 discrete speech token을 활용하여 기존 mel-spectrogram 보다 우수한 TTS 성능을 달성했음
  - 비슷하게 AudioLM은 language modeling에 wav2vec 2.0과 w2v-BERT를 사용하여 뛰어난 음성 합성 성능을 보임
  -> 이러한 모델들은 대부분 autoregressive 모델을 활용하기 때문에, target speaker에 대한 짧은 speech prompt 만으로도 input text를 기반으로 acoustic token을 생성하여 speech continuation을 수행할 수 있음
- 한편으로 context-aware TTS에는 이러한 speech continuation 외에도 speech editing task도 존재함
  - 이때 speech editing은 아래 그림과 같이 input text를 기반으로 음성을 합성하면서, surrounding context와의 smooth concatenation을 보장하는 것을 목표로 함
  -> 따라서 speech editing은 continuation과 달리 preceding context $A$와 following context $B$를 모두 고려해야 함
- BUT, discrete speech token을 활용하는 기존의 TTS 모델들은 아래의 한계점을 가지고 있음
  1. 대부분의 left-to-right로 진행되는 autoregressive (AR) 모델임
    - 결과적으로 preceding/following context가 모두 제공되는 speech editing에 사용할 수 없음
  2. Acoustic token 구성 시, residual vector quantization (RVQ)로 인해 각 frame에 대한 multiple index가 생성됨
    - 해당 방식은 TTS 변환 과정에서 prediction challenge와 complexity를 야기함
  3. 해당 TTS 모델의 품질은 audio codec model의 성능에 의해 크게 좌우됨

-> 그래서 위 한계점들을 해결하고, speech continuation과 editing 모두에 활용할 수 있는 context-aware TTS 모델인 UniCATS를 제안

UniCATS
- Acoustic model인 CTX-txt2vec과 vocoder인 CTX-vec2wav 두 가지 component로 구성됨
- CTX-txt2vec은 contextual VQ-diffusion을 활용하여 input text에서 semantic token을 예측함으로써, semantic context를 incorporate 하고 surrounding context와의 seamless concatenation을 유지함
- CTX-vec2wav는 contextual vocoding을 활용하여 앞선 semantic token을 waveform으로 변환하고 seaker identity와 같은 acoustic context를 반영함

< Overall of UniCATS >

Speech continuation, editing을 모두 지원하는 unified context-aware TTS 모델
CTX-txt2vec에 contextual VQ-diffusion을 도입해 surrounding context와 seamlessly concatenate 되는 sequence를 생성
CTX-vec2wav에 contextual vocoding을 도입하여 acoustic content를 waveform 변환에 반영
결과적으로 우수한 continuation과 editing 성능을 달성

2. UniCATS

UniCATS는 speech continuation과 editing, 2가지 task를 모두 처리하는 것을 목표로 하고, acoustic model인 CTX-txt2vec과 vocoder인 CTX-vec2wav로 구성됨

- CTX-txt2vec with Contextual VQ-Diffusion

CTX-txt2vec은 contextual VQ-diffusion을 사용하여 input text에서 semantic token을 예측해 surrounding context와의 seamless concatenation을 가능하게 함
- 이를 위해 vq-wav2vec token을 semantic token으로 활용하고, VQ-diffusion을 contextual VQ-diffusion으로 확장함
VQ-Diffusion
- VQ-diffusion은 discrete data에 대한 Markovian process를 활용함
- 먼저 discrete index sequence $x_{0}= [x_{0}^{(1)}, x_{0}^{(2)},...,x_{0}^{(l)}]$로 구성된 data sample이 있다고 하자 ($x_{0}^{(i)} \in \{1,2,...,K\}$)
  1. 이때 $t$ step의 corruption으로 얻어지는 sequence를 $x_{t}$라고 하고 simplicity를 위해 $i$를 생략해서 나타냈을때, forward process는:
    (Eq. 1) $q(x_{t}|x_{t-1})=v^{\top}(x_{t})Q_{t}v(x_{t-1})$
    - $v(x_{t}) \in \mathbb{R}^{(K+1)}$ : $x_{t}=k$인 one-hot vector (즉, $k$-th value만 1이고 나머지는 0)
    - $K+1$의 index value는 special token $\mathrm{[mask]}$에 해당함
    - $Q_{t}\in \mathbb{R}^{(K+1)\times(K+1)}$은 $t$-th step에 대한 transition matrix를 나타냄
  2. 결과적으로 multiple forward step을 integrating하면:
    (Eq. 2) $q(x_{t}|x_{0})=v^{\top}(x_{t})\bar{Q}_{t}v(x_{0})$
    - $\bar{Q}_{t}=Q_{t} \dots Q_{1}$
  3. 이제 Bayesian's Rule을 적용하면:
    (Eq. 3) $q(x_{t-1}|x_{t},x_{0})=\frac{q(x_{t}|x_{t-1},x_{0})q(x_{t-1}|x_{0})}{q(x_{t}|x_{0})} =\frac{\left( v^{\top}(x_{t})Q_{t}v(x_{t-1})\right) \left( v^{\top}(x_{t-1})\bar{Q}_{t-1}v(x_{0})\right)}{v^{\top}(x_{t})\bar{Q}_{t}v(x_{0})}$
- VQ-diffusion model은 transformer block의 stack을 사용하여 구성되고, $y$로 condition된 $x_{t}$로부터 $x_{0}$의 분포 $p_{\theta}(\tilde{x}_{0}|x_{t},y)$를 추정하도록 training됨
  - 결과적으로 reverse process 동안 아래의 (Eq. 4)에서 $x_{t}, y$가 주어졌을 때 $x_{t-1}$을 sampling할 수 있음:
  (Eq. 4) $p_{\theta}(x_{t-1}|x_{t},y)=\sum_{\tilde{x}_{0}}q(x_{t-1}|x_{t},\tilde{x}_{0})p_{\theta}(\tilde{x}_{0}|x_{t},y) $
Contextual VQ-Diffusion
- UniCATS는 input text가 condition $y$로 작용하고 생성되는 semantic token이 data $x_{0}$로 represent 되는 speech editing/continuation task를 수행하는 것을 목표로 함
  - 따라서, 앞선 standard VQ-diffusion과 달리 generation process는 data $x_{0}$와 관련된 additional context token $c^{A}, c^{B}$가 필요함
- 결과적으로 CTX-txt2vec은 다음의 probability를 모델링해야 함:
  (Eq. 5) $p_{\theta}(\tilde{x}_{0}|x_{t},y,c^{A},c^{B})$
- 이때 contextual VQ-diffusion을 위해, diffusion step $t$에서 corrupted semantic token $x_{t}$를 clean preceding/following context token $c^{A}, c^{B}$와 chronological order로 concatenate 함
  - 이후 combined sequence $[c^{A}, x_{t}, c^{B}]$는 transformer-based VQ-diffusion model에 제공되고,
  - 이를 통해 모델은 transformer-based block의 self-attention을 사용하여 contextual information을 효과적으로 integrate 할 수 있음
- 여기서 (Eq. 4)와 유사하게, 다음과 같이 posterior를 계산할 수 있음:
  (Eq. 6) $p_{\theta}(x_{t-1}|x_{t},y,c^{A},c^{B})=\sum_{\tilde{x}_{0}}q(x_{t-1}|x_{t},\tilde{x}_{0})p_{\theta}(\tilde{x}_{0}|x_{t},y,c^{A},c^{B})$
Model Architecture
- CTX-txt2vec은 text encoder, duration predictor, length regulator, VQ-diffusion decoder로 구성됨
  - Text/phoneme token의 sequence는 transformer block으로 구성된 text encoder에 의해 encoding 된 다음, duration prediction에 사용됨
  - 이후, text encoder의 output은 해당 duration 값을 기반으로 expand 되어 semantic token의 length와 match 하는 text encoding $h$를 생성함
- VQ-diffusion decoder의 input sequence $[c^{A},x_{t},c^{B}]$는 $t$ diffusion step으로 corrupt 된 data $x_{t}$를 preceding/following context $c^{A}, c^{B}$와 concatenate 하여 구성됨
  - 이때 data와 context를 distinguish 하기 위해 input과 동일한 length의 binary indicator sequence를 사용함
  - 이를 위해 embedding table을 사용하여 indicator sequence를 embedding으로 변환하여 input에 추가하고, positional encoding과 함께 project 하고 combine 함
- CTX-txt2vec의 VQ-diffusion block은 transformer를 기반으로 함
  1. 이때 cross-attention 대신 linear projection을 적용한 다음, self-attention layer에 text encoding $h$를 추가함
    - 이를 통해 $h$와 semantic token 간의 strict alignment를 accommodate 할 수 있음
  2. 이러한 $N$개의 block을 통과한 다음, output은 $p_{\theta}(\tilde{x}_{0}|x_{t},y,c^{A},c^{B})$의 분포를 예측하기 위해 softmax를 사용하여 layer-normalize 됨
    - Transformer-based VQ-diffusion decoder는 input과 동일한 length의 output sequence를 생성하므로 $x_{t}$에 해당하는 output segment만 $\tilde{x}_{0}$로 고려하고, 나머지 segment는 discard 함

Contextual VQ-Diffusion을 사용한 CTX-txt2vec

Training Scheme
- Training 중에 각 utterance는 아래 3가지 configuration을 가질 수 있고, 이 중에서 random 하게 선택하여 활용함 (이때 각 configuration의 비율은 각각 $0.6, 0.3, 0.1$로 설정)
  1. Context $A, B$가 모두 포함되는 경우
    - Utterance는 context $A$와 $x_{0}$, context $B$ 3가지 segement로 randomly divide 됨
    - 이를 위해 $x_{0}$의 length를 randomly determine 하고 $x_{0}$의 starting position을 randomly determine 함
    - 이때 $x_{0}$의 왼쪽, 오른쪽에 위치한 segment를 각각 context $A, B$로 사용함
  2. Context $A$만 포함되는 경우
    - Context $A$의 length를 2~3초로 randomly determine 함
    - 이렇게 결정된 initial segment를 context $A$로 보고 오른쪽에 있는 나머지 segment를 $x_{0}$으로 사용
  3. Context가 없는 경우
    - 전체 utterance가 context 없이 $x_{0}$로 사용됨
- Context와 생성할 data의 분할이 결정되면, (Eq. 2)를 사용하여 $x_{0}$를 corrupt 하여 $x_{t}$를 생성함
  - 이후 corrupted segment는 관련된 context와 concatenate 되어 VQ-Diffusion decode를 위한 input으로 사용됨
- CTX-txt2vec에 대한 training objective $\mathcal{L}_{\textrm{CTX-txt2vec}}$은 duration prediction loss $\mathcal{L}_{\textrm{duration}}$과 VQ-diffusion loss $\mathcal{L}_{\textrm{VQ-diffusion}}$의 weighted summation과 같음:
  (Eq. 7) $\mathcal{L}_{\textrm{CTX-txt2vec}}=\mathcal{L}_{\textrm{duration}}+\gamma\mathcal{L}_{\textrm{VQ-diffusion}} $
  - $\gamma$ : hyperparameter

Inference Algorithm
- Speech editing을 위한 inference process는 아래 [Algorithm 1]과 같음
- 먼저 생성할 음성의 phoneme $y^{D}$를 제공된 context phoneme과 concatenate 함
  - 해당 combined sequence는 duration prediction을 위해 text encoder에 제공됨
- $y^{D}$에 해당하는 예측된 duration $\tilde{d}^{D}$는 context와 유사한 speech speed를 유지하기 위해, factor $\alpha$를 사용하여 rescale 됨
- 이후 VQ-diffusion의 reverse process에 따라 context semantic token을 사용하여 fully corrupted $x_{T}$로부터 data를 iteratively refine 함
  - 이를 통해 최종적으로 edited semantic token $[c^{A},x_{0}, c^{B}]$를 얻음

- CTX-vec2wav with Contextual Vocoding

Semantic token을 waveform으로 변환할 때, speaker identity와 같은 acoustic context를 반영하기 위해 CTX-vec2wav에 contextual vocoding을 도입하여 speaker embedding과 acoustic token에 대한 의존성을 제거함
Model Architecture
- CTX-vec2wav는
  1. 먼저 semantic token은 2개의 semantic encoder를 통해 project 되고 encode 됨
  2. 이후 HiFi-GAN의 generator와 동일한 convolution & upsampling layer를 통과하여 waveform을 생성함
  3. 추가적인 auxiliary feature adaptor는 두 semantic encoder 사이에 위치함
    - 해당 module은 FastSpeech2의 variance adaptor와 유사하게, pitch/energy/probability of voice (POV)와 같은 3D auxiliary feature에 대한 conditioning을 가능하게 함
- Training 시 CTX-vec2wav는 ground-truth auxiliary feature를 semantic encoder의 output으로 예측하도록 학습됨
  - 추론 시에는 예측된 auxiliary feature는 condition으로 활용함
- 이때 semantic token은 speaker identity와 관련한 sufficient acoustic detail이 부족하고, 대신 주로 articulation information을 capture 하는 것으로 나타남
  1. 따라서 CTX-vec2wav는 기존 방식과는 달리 mel-spectrogram $m$을 활용하여 acoustic context를 prompt 하는 방식을 사용
  2. 여기서 CTX-vec2wav의 semantic encoder는 $M$개의 Conformer-based block으로 구성되고, 각 block은 cross-attention layer를 사용하여 mel-spectrogram의 acoustic context를 반영함
  3. 즉, 아래 그림과 같이 mel-spectrogram $m$은 cross-attention layer 이전 단계에서 consecutive frame을 integrate 하는 1D convolution 기반의 mel encoder에 의해 $m'$으로 encoding 되어 사용됨
- 결과적으로 mel-spectrogram에는 position encoding이 없으므로 $m'$은 unordered feature의 collection이 됨
  - 이러한 특성으로 인해 training 중에 mel-spectrogram의 short segment만 활용하더라도, 추론 시 cross-attention을 통해 acoustic context를 prompt 하는 다양한 length의 mel-spectrogram을 활용할 수 있음
Training Scheme
- CTX-vec2wav training 중에 speaker label이 불완전한 dataset를 효과적으로 활용하기 위해, 각 training utterance 내에서 speaker identity가 일정하게 remain 된다고 가정함
- 해당 가정을 기반으로, 아래 그림의 (a)와 같이 각 utterance를 2개의 segment로 나눔
  - 첫 번째 segment는 mel-spectrogram을 추출하고 acoustic context를 prompt 하는 데 사용되고
  - 두 번째 segment는 semantic token을 추출하고 vocoding을 수행하는 데 사용됨
- CTX-vec2wav의 training process는 HiFi-GAN과 동일하고, auxiliary feature 예측을 위해 추가적으로 $L1$ loss를 사용함
  - 이때 multi-task warmup technique을 채택해 training 함

- Unified Framework for Context-Aware TTS

UniCATS는 각 semantic token과 mel-spectrogram을 통해 semantic, acoustic context를 prompt 함
1. 먼저 앞선 [Algorithm 1]에 따라 edited semantic token $[c^{A}, x_{0}, c^{B}]$가 얻어짐
2. 이후 해당 token은 $[m^{A}, m^{B}]$ context의 mel-spectrogram으로 indicate 되는 speaker information과 함께 waveform으로 vocoding 됨
  - Speech continuation과 editing은 context $B$의 유무에 있기 때문에, [Algorithm 1]의 speech editing 과정에서 context $B$를 제외하여 speech continuation으로 generalize 할 수 있음
  - 결과적으로 UniCATS는 continuation과 editing, 2가지의 context-aware TTS task를 모두 처리할 수 있음

3. Experiments

- Settings

Dataset : LibriTTS
Comparisons : FastSpeech2, VALL-E

- Results

Speech Resynthesis from Semantic Tokens
- Semantic token을 vocoding 하는 HiFi-GAN과 AudioLM을 기준으로 CTX-vec2wav의 성능을 비교해 보면
- 제안된 CTX-vec2wav는 naturalness와 speaker similarity 측면에서 가장 우수한 resynthesis 성능을 보임
- 특히 acoustic token을 사용하는 Encodec과 비교하여도 CTX-vec2wav는 더 높은 성능을 달성함

Speech Continuation for Zero-Shot Speaker Adaptation
- FastSpeech2와 VALL-E에 대해 zero-shot speaker adaptation 성능을 비교해 보면
- 정량적/주관적 평가 모두에서 UniCATS가 가장 우수한 성능을 보임

Speaker Editing
- Utterance를 context $A$, 생성할 segment $x$, context $B$의 3가지 segment로 나누어 speech editing 성능을 비교해 보면
- UniCATS는 segment length에 상관없이 일관된 성능을 보이는 것으로 나타남

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] PromptTTS++: Controlling Speaker Identity in Prompt-based Text-to-Speech using Natural Language Descriptions (0)	2024.04.11
[Paper 리뷰] PromptTTS: Controllable Text-to-Speech with Text Descriptions (0)	2024.04.08
[Paper 리뷰] Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-End Emotional Speech Synthesis (0)	2024.04.03
[Paper 리뷰] Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance (0)	2024.04.02
[Paper 리뷰] HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-Supervised Representations for Speech Synthesis (0)	2024.03.29

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

1. Introduction

2. UniCATS

- CTX-txt2vec with Contextual VQ-Diffusion

- CTX-vec2wav with Contextual Vocoding

- Unified Framework for Context-Aware TTS

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바