[Paper 리뷰] Mels-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-to-Speech System via Disentangled Style Tokens

티스토리 뷰

Paper/TTS

[Paper 리뷰] Mels-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-to-Speech System via Disentangled Style Tokens

feVeRin 2024. 4. 24. 10:21

Mels-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-to-Speech System via Disentangled Style Tokens

효과적인 emotion transfer를 위해 disentangled style token을 활용할 수 있음
Mels-TTS
- Global style token에서 영감을 받아 emotion, language, speaker, residual information을 disentangle 하는 개별적인 style token을 활용
- Attention mechanism을 적용하여 각 style token에서 target speech에 대한 speech attribute를 학습
논문 (ICASSP 2024) : Paper Link

1. Introduction

Text-to-Speech (TTS) 작업은 multi-emotion, multi-lingual TTS로 확장되고 있지만, 여전히 real-world scenario에서는 합성의 한계가 존재함
- 특히 emotion transfer/cross-lingual TTS는 source speaker로부터 emotion/language information을 학습하여 target speaker에 반영하는 것을 목표로 함
  - BUT, content, speaker identity, emotion, language 등의 speech attribute는 본질적으로 intertwining 되어 있음
- 따라서 desired speech attribute를 target speaker에 잘 transfer하기 위해서는 해당 attribute를 분리할 수 있어야 함
  1. 만약 해당 speech attribute에 적절한 label이 존재하는 경우, 보다 쉽게 disentangle할 수 있지만 unseen label에 대해서는 대응이 어렵고 cross-lingual scenario에는 적합하지 않음
  2. Speech attribute 중에서도 emotion은 상당한 complexity와 variability를 띄기 때문
- 이러한 disentanglement 문제를 극복하기 위해 Reference-based TTS를 고려해 볼 수 있음
  1. 앞서 unsupervised 방식으로 emotion을 추출하는 방법이 제안되기는 했으나, 여전히 exclusive separation을 보장하지는 못함
  2. 특히 utterance 내에는 speech attribute의 intensity에 대한 상당한 variation이 존재함
    - 즉, intended information이 utterance에서 명확하게 전달되지 않으면 reference-based system은 reference speech에서 information을 제대로 추출할 수 없음
  3. 결과적으로 individual utterance 내에서 다양한 speech attribute의 intensity가 fluctuate 하는 경우, intended informatino의 정확한 추출을 방해함

-> 그래서 이러한 reference-based TTS의 disentanglement 문제를 해결하고 multi-emotion, multi-lingual, multi-speaker 작업을 효과적으로 수행하는 Mels-TTS를 제안

Mels-TTS
- Global Style Token (GST)에서 영감을 받아 style control을 위한 emotion encoder를 설계
  - GST 방법론을 사용하여 target speaker의 reference embedding과 style token 간의 similarity를 학습
- Disentanglement ability를 향상하기 위해 speaker, language, residual, emotion에 대한 4가지 style token을 활용
- Training phase에서 attention mechanism을 활용해 각각의 disentangled style token이 target speech에 미치는 영향을 학습함
  - 이를 통해 다양한 speech attribute의 영향을 개별적으로 학습하여 성공적인 disentanglement를 달성
- 추론 시에는 disentangled style token 중에서 desired emotion token을 selectively choice 함
  - 해당 token은 attention을 통해 desired reference embedding에서 emotion embedding을 추출하는 데 사용되고 robust emotion transfer를 가능하게 함

< Overall of Mels-TTS >

Global style token을 기반으로 emotion, language, speaker, residual information을 disentangle 함
Attention mechanism을 적용하여 각 style token에서 target speech에 대한 speech attribute를 학습
결과적으로 multi-lingual, multi-speaker 환경에서 기존 reference-based TTS 보다 뛰어난 emotion transfer 성능을 달성

2. Method

- Overall Architecture

Mels-TTS는 Tacotron-variant를 기반으로 함
- Text encoder는 phoneme sequence를 text embedding으로 처리하고 decoder는 autoregressive 하게 acoustic feature를 생성함
  1. 각 decoder step에서 pre-net은 추론 중에 예측되는 target acoustic feature의 previous frame을 처리
  2. Pre-net output은 previous attention context와 concatenate 되어 current context vector를 생성
  3. 이는 decoder RNN stack으로 전달되어 target acoustic feature와 stop token을 예측함
- Mels-TTS는 emotion, speaker, language information을 처리하는 것을 목표로 함
  1. 이를 위해 disentangled style token이 포함된 emotion encoder에서 emotion embedding을 추출하고 text embedding에 concatenate
    - 여기서 speaker, language ID는 look-up table을 통해 처리되어 disentangled style token을 생성
  2. Linear layer 이후, speaker, language embedding은 decoder의 pre-net output과 concatenate되어 speaker와 language에 대한 control을 제공함

- Emotion Encoder with Disentangled Style Tokens

Mels-TTS는 reference encoder와 style attention을 활용
- Reference encoder는 target speech의 mel-spectrogram을 reference embedding으로 처리하여 style attention에 대한 query 역할을 수행함
- 한편으로 style attention에 대한 key, value를 얻기 위해 speaker, language, emotion, residual의 4가지 speech attribute를 나타내는 disentangled style token을 활용함
Disentangled Style Tokens
1. Emotion Token Sets
  - Emotion token set은 reference embedding에서 emotion information만 학습하도록 구성
    - Dataset 내의 imbalanced emotion의 문제를 해결하고 balanced learning이 가능하도록 하기 위함
  - 각 emotion token set은 emotion category 내의 다양한 nuance를 capture 하기 위해 randomly initialized embedding bank로 구성됨
  - 이후 emotion ID를 사용하여 target mel-spectrogram의 emotion에 따라 해당하는 token set을 choice 함
2. Speaker Token
  - Emotion embedding을 지원하기 위해 speaker token을 활용하여 reference embedding에서 speaker information을 segregate 함
  - Training 중에 speaker token은 speaker look-up table의 output embedding을 활용하여 reference embedding의 speaker information을 학습
  - 추론 시에는 disentangled style token에서 speaker token을 excluding 함으로써 speaker information을 포함하지 않고 reference embedding에서 emotion embedding을 추출
3. Language Token
  - Language token은 reference embedding에서 language detail을 separate 하는 것을 목표로 함
    - 앞선 speaker token과 유사하게 동작
  - Language token은 학습 시에는 language look-up table을 활용하지만 추론 시에는 사용하지 않음
4. Residual Token Set
  - Speaker, language attribute를 제외한 non-emotion information 내의 additional detail을 반영하는 역할
  - Residual token set는 randomly initialized embedding으로 구성
    - 추론 시에는 앞선 speaker, language token과 비슷하게 exclude 됨
Style Attention
1. Training
  - Reference embedding $R \in R 1 \times d R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>R</mi><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>R</mi></mrow></msub></mrow></msup></math>$ 에서 다양한 speech information을 capture 하기 위해, Mels-TTS는 all dinsentangled style token $T A \in R N A \times d T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>A</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>A</mi></mrow></msub><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></mrow></msup></math>$ 를 style attention mechanism의 key, value로 사용함
  - 해당 token은 selected target emotion token set $T E \in R N E \times d T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>E</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>E</mi></mrow></msub><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></mrow></msup></math>$ , speaker token $T S \in R N S \times d T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></mrow></msup></math>$ , language token $T L \in R N L \times d T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>L</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>L</mi></mrow></msub><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></mrow></msup></math>$ , residual token set $T R \in R N R \times d T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>R</mi></mrow></msub><mo>\in</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mrow data-mjx-texclass="ORD"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>R</mi></mrow></msub><mo>\times</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub></mrow></msup></math>$ 로 구성
    - $T E <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>E</mi></mrow></msub></math>$ 는 training data 내의 emotion diversity에 따라 달라질 수 있는 emotion token set에서 choice 됨
    - 논문에서는 neutral, happy, sad, angry에 대한 emotion token을 나타내는 4개의 emotion token set ${T n, T h, T s, T a} <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">{</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow></msub><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>h</mi></mrow></msub><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow></msub><mo fence="false" stretchy="false">}</mo></math>$ 를 활용
  - 여기서 $N E, N S, N L, N R <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>E</mi></mrow></msub><mo>,</mo><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo>,</mo><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>L</mi></mrow></msub><mo>,</mo><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>R</mi></mrow></msub></math>$ 은 각각 emotion token set, speaker token, language token, residual token set의 token 수를 나타내고, 그 합은 $N A <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>A</mi></mrow></msub></math>$ 와 같음
  - 최종적으로 style attention을 위해 다음과 같이 $n h <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>n</mi><mrow data-mjx-texclass="ORD"><mi>h</mi></mrow></msub></math>$ 개의 head를 가지는 multi-head attention을 사용:
    (Eq. 1) $MultiHead (R, T A) = concat i \in [n h] [H (i) (R, T A)] W O <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">MultiHead</mi></mrow><mo stretchy="false">(</mo><mi>R</mi><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>A</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">concat</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>\in</mo><mo stretchy="false">[</mo><msub><mi>n</mi><mrow data-mjx-texclass="ORD"><mi>h</mi></mrow></msub><mo stretchy="false">]</mo></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><msup><mi>H</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msup><mo stretchy="false">(</mo><mi>R</mi><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>A</mi></mrow></msub><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">]</mo></mrow><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>O</mi></mrow></msub></math>$
    (Eq. 2) $H (i) (R, T A) = Attention (R W (i) Q, T A W (i) K, T A W (i) V) <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>H</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msup><mo stretchy="false">(</mo><mi>R</mi><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>A</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">Attention</mi></mrow><mo stretchy="false">(</mo><mi>R</mi><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>A</mi></mrow></msub><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>A</mi></mrow></msub><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>V</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo stretchy="false">)</mo></math>$
    - $Attention (Q, K, V) = softmax (Q K T / \sqrt d k) V <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">Attention</mi></mrow><mo stretchy="false">(</mo><mi>Q</mi><mo>,</mo><mi>K</mi><mo>,</mo><mi>V</mi><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">softmax</mi></mrow><mo stretchy="false">(</mo><mi>Q</mi><msup><mi>K</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msup><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><msqrt><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub></msqrt><mo stretchy="false">)</mo><mi>V</mi></math>$ , $d k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub></math>$ : 각 head의 dimension
    - Projection parameter $W (i) Q, W (i) K, W (i) V <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>Q</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>V</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></msubsup></math>$ 는 $i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ -th head에 대해 학습되고, $W O <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>O</mi></mrow></msub></math>$ 는 모든 head에 대한 attention output의 concatenation을 emotion embedding에 project 함
2. Inference
  - Emotion-specific salience를 위해 speaker, language, residual token set을 deactivate 함
  - 여기서 $T A <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>A</mi></mrow></msub></math>$ 대신 selected emotion token set $T E <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>T</mi><mrow data-mjx-texclass="ORD"><mi>E</mi></mrow></msub></math>$ 를 (Eq. 1), (Eq. 2)의 style attention에 대한 key, value로 사용

- Inference of Mels-TTS

Emotion 전반에 걸쳐 음성을 합성하기 위해,
- 먼저 emotion 별로 representative emotion embedding을 결정함
  - 이때 emotion token set에서 desired emotion token set을 select
- 다음으로 모든 training database uttterance에 대한 emotion embedding을 계산하고, 각 emotion에 대한 mean emotion embedding을 얻음
  - 해당 평균에 가장 가까운 utterance의 emotion embedding이 representative emotion embedding의 역할을 수행함
- 이러한 representative emotion embedding을 사용하면 추론 과정에서 reference speech 없이도 desired emotion이 포함된 음성을 안정적으로 합성할 수 있음
  - Desired representative emotion embedding은 text encoder output과 concatenate 되고,
  - Desired speaker, language ID는 decoder에 제공되기 전에 look-up table과 linear layer를 통해 처리됨

3. Experiments

- Settings

Dataset : AI-Hub Korean database, VCTK
Comparisons
- LB : Label-based
- GST : Global Style Token-based
- GST-C : GST + emotion classifier

- Results

Subjective Evaluation
- 먼저 English -> Korean의 경우, MOS측면에서 Mels-TTS가 가장 우수한 성능을 보임
- 마찬가지로 Korean -> English에서도 Mels-TTS는 nautralness, emotion similarity 측면에서 가장 뛰어난 결과를 달성

Objective Evaluation
- Pre-trained emotion classifier를 통해 합성된 음성의 emotion accuracy를 평가해 보면
- Mels-TTS에 대한 결과가 ground-truth와 가장 비슷하게 나타남

Ablation Study
- 한편으로 SLR token을 제거하는 경우 emotion classification의 성능이 저하되는 것으로 나타남
  - BUT, SLR token이 없더라도 Mels-TTS는 여전히 다른 모델들보다 우수한 성능을 보임
- 추가적으로 t-SNE를 사용해 emotion embedding을 시각화해 보면
  - 아래 그림의 (a)와 같이 SLR token이 없는 경우 utterance가 제대로 cluster 되지 못함
  - (b)와 같이 SLR token을 사용하는 Mels-TTS는 emotion embedding에 대한 완벽한 clustering을 수행할 수 있음
- 즉, SLR token은 speech attribute를 disentangle 하고 emotion embedding이 emotion information에 집중하도록 함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model (0)	2024.04.28
[Paper 리뷰] Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching (0)	2024.04.25
[Paper 리뷰] MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis (0)	2024.04.23
[Paper 리뷰] DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation (0)	2024.04.19
[Paper 리뷰] VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching (0)	2024.04.18

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] Mels-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-to-Speech System via Disentangled Style Tokens

Mels-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-to-Speech System via Disentangled Style Tokens

1. Introduction

2. Method

- Overall Architecture

- Emotion Encoder with Disentangled Style Tokens

- Inference of Mels-TTS

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역