[Paper 리뷰] EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

티스토리 뷰

Paper/TTS

[Paper 리뷰] EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

feVeRin 2024. 7. 22. 09:32

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Emotional text-to-speech는 pre-defined label로 제한되므로 emotion의 변화를 효과적으로 반영하지 못함
EmoSphere-TTS
- Emotional style, intensity를 control 하는 spherical emotion vector를 채택
- Human annotation 없이 arousal, valence, dominance pseudo-label을 사용하여 Cartesian-spherical transformation을 통해 emotion을 모델링
- Dual conditional adversarial network를 통해 multi-aspect characteristic을 반영하여 음성 품질을 개선
논문 (INTERSPEECH 2024) : Paper Link

1. Introduction

Emotional text-to-speech (TTS)는 PromptStyle, ZET-Speech 등에서 우수한 합성 품질을 보이고 있지만, 여전히 high-level interpretable emotion control에 대해서는 한계가 있음
- 일반적으로 emotional TTS는 emotion label과 reference audio를 통해 emotional expression을 control 함
  1. 대표적으로 relative attribute 방식은 learned ranking function이나 distance-based quantization을 활용하여 fine-grained emotional intensity를 반영
  2. Scaling factor 방식은 emotion embedding에 multiply 되어 emotion intensity를 control
- BUT, 해당 방식들은 emotion label이나 reference에 기반하므로 emotion expression을 uniform style로 reduce 하고, mismatch로 인해 nuance를 capture 하기 어렵다는 문제가 있음
- 한편으로 arousal, valence, dominance (AVD)에 대한 emotional dimension을 활용하여 expression을 control 하는 방식을 고려할 수 있음
  - 해당 emotion dimension은 continuous, fine-grained description을 제공하므로 discrete emotion보다 detail 한 control이 가능

-> 그래서 spherical emotion vector space를 활용한 emotional TTS 모델인 EmoSphere-TTS를 제안

EmoSphere-TTS
- Speech emotion recognition의 pseudo-labeling에 대한 AVD의 emotional dimension을 도입
- Cartesian-spherical transformation을 통한 spherical emotion vector space를 구성하여 Cartesian coordinate에서의 emotion 모델링 한계를 극복
- 추가적으로 dual conditional adversarial training을 통해 음성 품질을 개선

< Overall of EmoSphere-TTS >

Emotion sphere와 dual conditional adversarial training을 활용한 emotional TTS 모델
결과적으로 기존보다 뛰어난 controllability와 합성 품질을 달성

2. Method

- Emotional Style and Intensity Modeling

EmoSphere-TTS는 다음의 component를 중심으로 spherical emotion vector space를 구성하여 다양한 emotional expression을 모델링함:
- AVD Encoder
- Cartesian-Spherical Transformation
AVD Encoder
- Human annotation의 emotional dimension을 사용하는 대신 wav2vec 2.0 기반의 SER model을 채택하여 audio에서 consistently continuous, detailed representation을 추출함
- 이때 해당 model은 Cartesian coordinate에서 $[0, 1] [0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ range에 속하는 $e k i = (d a, d v, d d) e_{k i} = (d_{a}, d_{v}, d_{d}) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>k</mi><mi>i</mi></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow></msub><mo>,</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>v</mi></mrow></msub><mo>,</mo><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 에 대한 예측을 생성함
  - $d a d_{a} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow></msub></math>$ : arousal, $d v d_{v} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>v</mi></mrow></msub></math>$ : valence, $d d d_{d} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow></msub></math>$ : dominance
  - $e k i e_{k i} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>k</mi><mi>i</mi></mrow></msub></math>$ : $k k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ -th emotion의 $i i <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math>$ -th coordinate
Cartesian-Spherical Transformation
- 논문은 emotion의 complex nature를 모델링하기 위해 neutral center에서 relative distance와 angle vector를 represent 하는 spherical emotion vector space를 도입함
- Emotion style과 intensity를 continuous scalar로 control하는 coordinate transformation을 기반으로, 다음 가정에 따라 AVD pseudo-label의 모든 point를 spherical coordinate로 변환함
  1. Emotional intensity는 neutral emotion center에서 멀어질수록 증가함
  2. Neutral emotion center에 대한 angle은 emotional style을 결정함
- 먼저 neutral emotion center $M M <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>M</mi></math>$ 을 origin으로하여 transformed Cartesian coordinate $e' k i = (d' a, d' v, d' d) e_{k i}^{'} = (d_{a}^{'}, d_{v}^{'}, d_{d}^{'}) <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>k</mi><mi>i</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup><mo>=</mo><mo stretchy="false">(</mo><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup><mo>,</mo><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>v</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup><mo>,</mo><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup><mo stretchy="false">)</mo></math>$ 를 얻음:
  (Eq. 1) $e′ki=eki−M,whereM=1Nn∑Nni=1enie′ki=eki−M,whereM=1Nn∑Nni=1eni<math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>k</mi><mi>i</mi></mrow><mo data-mjx-alternate="1">′</mo></msubsup><mo>=</mo><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>k</mi><mi>i</mi></mrow></msub><mo>−</mo><mi>M</mi><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mtext>where</mtext><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>M</mi><mo>=</mo><mfrac><mn>1</mn><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow></msub></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow></msub></mrow></munderover><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>n</mi><mi>i</mi></mrow></msub></math>$
  - $N n N_{n} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>N</mi><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow></msub></math>$ : neutral coordinate $e n i e_{n i} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mrow data-mjx-texclass="ORD"><mi>n</mi><mi>i</mi></mrow></msub></math>$ 의 총 개수
- 그러면 Cartesian coordinate에서 spherical coordinate $(r, ϑ, φ) (r, ϑ, φ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>r</mi><mo>,</mo><mi>ϑ</mi><mo>,</mo><mi>φ</mi><mo stretchy="false">)</mo></math>$ 로의 transformation은:
  (Eq. 2) $r = \sqrt d' a 2 + d' v 2 + d' d 2 r = \sqrt{{d_{a}^{'}}^{2} + {d_{v}^{'}}^{2} + {d_{d}^{'}}^{2}} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi><mo>=</mo><msqrt><msup><mrow data-mjx-texclass="ORD"><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo>+</mo><msup><mrow data-mjx-texclass="ORD"><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>v</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo>+</mo><msup><mrow data-mjx-texclass="ORD"><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow><mo data-mjx-alternate="1">'</mo></msubsup></mrow><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup></msqrt></math>$
  (Eq. 3) $ϑ=arccos(d′dr),φ=arctan(d′vd′a)ϑ=arccos(d′dr),φ=arctan(d′vd′a)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϑ</mi><mo>=</mo><mi>arccos</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>d</mi></mrow><mo data-mjx-alternate="1">′</mo></msubsup><mi>r</mi></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi>φ</mi><mo>=</mo><mi>arctan</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>v</mi></mrow><mo data-mjx-alternate="1">′</mo></msubsup><msubsup><mi>d</mi><mrow data-mjx-texclass="ORD"><mi>a</mi></mrow><mo data-mjx-alternate="1">′</mo></msubsup></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
- Cartesian-Spherical transformation 이후, radial distance $r <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi></math>$ 을 $[0, 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ range로 scale 해 emotion intensity를 normalize 함
  1. 여기서 min-max normalization process는 interquartile range technique을 사용
  2. 추가적으로 directional angle $ϑ, φ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϑ</mi><mo>,</mo><mi>φ</mi></math>$ 를 각각 A, V, D axis의 positive/negative direction으로 정의되는 8개의 octant로 segmenting 하여 emotion style을 quantize 함

- Spherical Emotion Encoder

Spherical emotion encoder는 spherical emotion vector space와 emotion ID를 blend 하여 spherical emotion embedding을 구성함
- 먼저 projection layer를 통해 emotion class embedding과 emotion style vector의 dimension을 align 함
- 이후 해당 projection을 concatenate 하고 softplus activation과 layer normalization (LN)을 적용함
- 최종적으로 spherical emotion embedding $h e m o <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>e</mi><mi>m</mi><mi>o</mi></mrow></msub></math>$ 는 다음과 같이 projected emotion intensity vector에 merge 됨:
  (Eq. 4) $h e m o = LN (softplus (concat (h s t y, h c l s))) + h i n t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>e</mi><mi>m</mi><mi>o</mi></mrow></msub><mo>=</mo><mtext>LN</mtext><mo stretchy="false">(</mo><mtext>softplus</mtext><mo stretchy="false">(</mo><mtext>concat</mtext><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>t</mi><mi>y</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>l</mi><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi></mrow></msub></math>$
  - $h s t y, h i n t, h c l s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>t</mi><mi>y</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>n</mi><mi>t</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>l</mi><mi>s</mi></mrow></msub></math>$ : 각각 emotional style vector, emotional intensity vector, emotional class embedding에 대한 projection layer의 output

- Dual Conditional Adversarial Training

EmoSphere-TTS의 합성 품질을 개선하기 위해, multiple CNN-based discriminator를 도입해 adversarial training을 수행함
- 해당 discriminator는 multiple stacked 2D-convolutional layer와 fully connected (FC) layer로 구성된 Conv2D stack을 활용함
  - Input으로는 서로 다른 length $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 의 random window를 가지는 random mel-spectrogram clip을 사용
- 논문은 GANSpeech를 따라 emotion, speaker embedding을 활용하여 multi-aspect characteristic을 capture 함
  1. 여기서 한 Conv2D stack은 mel-spectrogram clip만 receive 하고 나머지 stack은 condition embedding과 mel-spectrogram clip의 combination을 receive 함
    - Concatenation을 위해 mel-spectrogram clip의 length와 match 하도록 condition embedding은 extend 됨
  2. 결과적으로 discriminator $D <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi></math>$ , generator $G <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>G</mi></math>$ 에 대한 loss function $L <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow></math>$ 은:
    (Eq. 5) $LD=∑c∈{spk,emo}∑tE[(1−Dt(yt,c))2+Dt(ˆyt,c)2]<math xmlns="http://www.w3.org/1998/Math/MathML"><mstyle displaystyle="true" scriptlevel="0"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>D</mi></mrow></msub><mo>=</mo><munder><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>c</mi><mo>∈</mo><mo fence="false" stretchy="false">{</mo><mi>s</mi><mi>p</mi><mi>k</mi><mo>,</mo><mi>e</mi><mi>m</mi><mi>o</mi><mo fence="false" stretchy="false">}</mo></mrow></munder><munder><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></munder><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mo stretchy="false">[</mo><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>c</mi><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo>+</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>c</mi><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo stretchy="false">]</mo></mstyle></math>$
    (Eq. 6) $LG=∑c∈{spk,emo}∑tE[(1−Dt(ˆyt,c))2]<math xmlns="http://www.w3.org/1998/Math/MathML"><mstyle displaystyle="true" scriptlevel="0"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>G</mi></mrow></msub><mo>=</mo><munder><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>c</mi><mo>∈</mo><mo fence="false" stretchy="false">{</mo><mi>s</mi><mi>p</mi><mi>k</mi><mo>,</mo><mi>e</mi><mi>m</mi><mi>o</mi><mo fence="false" stretchy="false">}</mo></mrow></munder><munder><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></munder><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mo stretchy="false">[</mo><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><msub><mi>D</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><mi>c</mi><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msup><mo stretchy="false">]</mo></mstyle></math>$
    - $y t, ˆ y t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>y</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub><mo>,</mo><msub><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ : 각각 ground-truth, generated mel-spectrogram
    - $c <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>c</mi></math>$ : condition type

- TTS Model

Emotion style, intensity information을 제공하는 emotion spherical vector를 제외한 나머지 architecture는 FastSpeech2의 구성을 따름
- 이때 speaker ID는 다양한 speaker characteristic을 나타내기 위해 embedding $h s p k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">h</mi></mrow><mrow data-mjx-texclass="ORD"><mi>s</mi><mi>p</mi><mi>k</mi></mrow></msub></math>$ 에 mapping 되고, speaker/emotion embedding을 concatenate 하여 variance adaptor로 전달됨
- 추론 시에는 manual style, intensity vector를 사용하여 emotional expression을 control 함
  - 결과적으로 spherical emotion vector space에서 emotion style과 intensity를 manipulate 함으로써 다양한 emotion을 반영 가능

3. Experiments

- Settings

Dataset : Emotional Speech Dataset (ESD)
Comparisons : FastSpeech2

- Results

Model Performance
- 전체적인 성능 측면에서 EmoSphere-TTS가 가장 우수한 성능을 달성함

Emotion Intensity Controllability
- Relative attribute는 intensity를 control 하는데 효과적이지만, intensity가 증가함에 따라 pitch도 함께 증가함
- Scaling factor는 sad emotion에서는 뛰어난 성능을 보이지만, static emotion에 대해서는 낮은 성능을 보임
- 그에 비해 EmoSphere-TTS는 여러 emotion에 대해 안정적인 성능을 달성함

한편으로 relative attribute에서 emotion label만 고려하는 경우 subtle emotional nuance를 capture 하기 어렵고 uniform style로 reduce 될 수 있음
- 반면 EmoSphere-TTS는 주어진 intensity scale에 따라 적절한 pitch를 모델링함

Emotion Style Shift
- Style vector가 shifting 되는 경우, emotion intensity pattern이 shifted axis에 따라 변화함
- 즉, spherical emotion vector는 다양한 emotional expression을 반영하고 detailed manipulation을 제공함

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] STEN-TTS: Improving Zero-Shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework (0)	2024.07.26
[Paper 리뷰] PVAE-TTS: Adaptive Text-to-Speech via Progressive Style Adaptation (0)	2024.07.25
[Paper 리뷰] Mega-TTS2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis (0)	2024.07.21
[Paper 리뷰] RAD-MMM: Multilingual Multiaccented Multispeaker Text to Speech (0)	2024.07.19
[Paper 리뷰] SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models (2)	2024.07.12

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

1. Introduction

2. Method

- Emotional Style and Intensity Modeling

- Spherical Emotion Encoder

- Dual Conditional Adversarial Training

- TTS Model

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역