Let IT Begin

[Paper 리뷰] SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

feVeRin — Tue, 14 Jul 2026 14:11:04 +0900

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Self-Supervised Learning은 speech, audio event understanding에 대한 gap이 존재함
SPEAR
- Continuous teacher representation에 multi-codebook vector quantization을 적용하여 semantic, acoustic information을 capture
- 추가적으로 asymmetric pre-training loss와 token mixing을 활용해 robustness를 향상
논문 (ICML 2026) : Paper Link

1. Introduction

Speech processing을 위해서는 linguistic, paralinguistic information을 반영할 수 있어야 함
- 이를 위해 Wav2Vec 2.0, HuBERT, Data2Vec과 같은 Self-Supervised Learning (SSL) model을 활용할 수 있음
- BUT, real-world environment는 multiple, overlapping sound source로 구성되어 있으므로 단순히 distinct domain으로 취급하면 intrinsic synergy를 ignore 할 수 있음

-> 그래서 speech, audio domain 전반에 대한 unified SSL representation인 SPEAR를 제안

SPEAR
- Multi-codebook Vector Quantization (MVQ)를 사용하여 richer acoustic, temporal detail을 반영
- 추가적으로 domain balance를 위해 asymmetric dual-domain objective와 token mixing mechanism을 적용

< Overall of SPEAR >

MVQ와 dual-domain pre-training을 활용해 speech, general audio representation을 jointly learning한 SSL representation
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Multi-Codebook Vector Quantization

논문은 Multi-codebook Vector Quantization (MVQ)을 사용해 masked prediction SSL objective에 대한 fine-grained discrete target을 생성함
- MVQ는 $K$ trainable code vector를 포함한 $N$ parallel codebook을 사용함
  1. Input feature vector $x\in\mathbb{R}^{d}$가 주어지면, MVQ는 $z=\text{Encode}(x;\mathcal{Q})=(z_{1},...,z_{N})$과 같이 $N$ discrete token의 tuple로 encoding 함
  2. 각 token $z_{n}$은 $[0,K-1]$의 integer index로써 $n$-th codebook에서 어떤 code를 select 할지를 specify 함
    - 해당 selected vector는 direct-sum scheme으로 original feature vector $x$를 approximate 하는 데 사용됨
- MVQ는 feature space를 $N$ distinct subspace로 partition 하고, 각 subspace는 single codebook으로 govern 됨
  - 이때 multi-codebook design에서는 representable state가 $K^{N}$으로 exponentially grow 하므로 fine-grained representation을 생성할 수 있음

Overview

- Unified Speech and Audio Representation Learning

MVQ-based Masked Token Prediction
- Single teacher scenario에서 pre-training objective는 student encoder $\mathcal {S}$를 train 하여 pre-trained SSL teacher $\mathcal{T}$에서 추출한 fine-grained MVQ token을 masked-token prediction으로 predict 함
- Student encoder $\mathcal{S}$는 front-end processor, feature encoder $\mathcal{F}$로 구성됨
  1. Front-end processor는 raw input waveform $w$를 length $T$의 frame-level representation $\mathbf{X}=\{x_{1},...,x_{T}\}$로 convert 함
  2. Masking operation은 $\mathbf{X}$에 적용되고, frame set $\mathcal{M}$을 randomly sampling 한 다음 $\{x_{t}|t\in\mathcal{M}\}$을 learnable mask embedding $m$으로 replace 하여 masked input $\hat{\mathbf{X}}$를 생성함
  3. 이후 feature encoder $\mathcal{F}$는 $\hat{\mathbf{X}}$를 사용하여 contextualized representation $\mathbf{H}=\{h_{1},...,h_{T}\}$를 생성함
    - $h_{t}\in\mathbb{R}^{d}$
- Prediction target을 생성하기 위해, $w$를 teacher $\mathcal{T}$에 전달하여 frame-level representation sequence $\mathbf{E}=\mathcal{T}(w)=\{e_{1},...,e_{T}\}$를 얻음
  - 해당 representation은 pre-trained MVQ quantizer $\mathcal{Q}$를 통해 frame-by-frame으로 quantize 되어 fine-grained discrete token $\mathbf{Z}=\{z_{1},...,z_{T}\}$를 pre-training target으로 생성함
  - $z_{t}=\text{Encode}(e_{t};\mathcal{Q})$
- Student model은 contextualized representation $\mathbf{H}$로부터 target token $\mathbf{Z}$를 predict 하도록 training 됨
  1. Multi-codebook masked prediction loss는 MVQ quantizer의 각 codebook에 대해 $N$ independent prediction loss의 sum으로 구성됨
  2. 각 loss는 모든 frame에 대한 cross-entropy objective이고, masked/unmasked frame에 대한 adjustable weight $\alpha$를 가짐:
    (Eq. 1) $\mathcal{L}_{single}(\mathbf{H},\mathbf{Z})=\frac{1}{N}\sum_{n=1}^{N}\left[\alpha\mathcal{L}_{n,m} + (1-\alpha)\mathcal{L}_{n,u}\right]$
    (Eq. 2) $\mathcal{L}_{n,m}=\sum_{t\in\mathcal{M}}-\log p_{n}(z_{t,n}|h_{t})$
    (Eq. 3) $\mathcal{L}_{n,u}=\sum_{t\notin\mathcal{M}}-\log p_{n}(z_{t,n}|h_{t})$
    - $\mathcal{L}_{n,m},\mathcal{L}_{n,u}$ : $n$-th codebook의 masked/unmasked frame loss
    - $p_{n}(z_{t,n}|h_{t})$ : feature encoder의 $N$ independent linear prediction head로 생성된 $n$-th codebook의 time $t$에서 correct token $z_{t,n}$의 predicted probability
Asymmetric Dual-Domain Pre-Training
- 논문은 speech와 general audio data의 mixture를 기반으로 dual-domain pre-training을 수행해 unified representation을 학습함
  1. 이를 위해 두 expert teacher model $\mathcal{T}_{speech},\mathcal{T}_{audio}$와 해당 pre-trained MVQ quantizer $\mathcal{Q}_{speech}, \mathcal{Q}_{audio}$를 사용함
  2. 각 input waveform에 대해 teacher representation $\mathbf{E}_{speech},\mathbf{E}_{audio}$를 추출하고, quantizer를 사용해 MVQ token set $\mathbf{Z}_{speech},\mathbf{Z}_{audio}$를 얻음
- 추가적으로 논문은 다음 3가지의 dual-domain pre-training strategy를 고려함:
  - JOINT
    - 각 sample $w$는 domain에 관계없이 $\mathbf{Z}_{speech},\mathbf{Z}_{audio}$에 대한 두 가지 loss를 compute 함
  - DISJOINT
    - 각 sample $w$는 $w$와 동일한 domain의 teacher target에 대한 하나의 loss만 compute 함
  - ASYMMETRICAL
    - Speech data의 경우 $\mathbf{Z}_{speech}$에 대해서만 loss를 compute 하고 audio의 경우 $\mathbf{Z}_{speech},\mathbf{Z}_{audio}$ 모두에 대해 compute 됨
- 여기서 논문은 ASYMMETRICAL strategy를 채택하여 speech token $\mathbf{Z}_{speech}$를 input data에 대한 universal prediction target으로, audio token $\mathbf{Z}_{audio}$를 general audio에 대한 target으로 사용함
- 결과적으로 asymmetrical dual-domain pre-train objective는:
  (Eq. 4) $\mathcal{L}_{dual}=\mathcal{L}_{single}(\mathbf{H},\mathbf{Z}_{speech})+\mathbf{1}_{audio}\lambda \cdot \mathcal{L}_{single}(\mathbf{H},\mathbf{Z}_{audio})$
  - $\mathcal{L}_{single}$ : (Eq. 1)의 single-domain masked prediction loss, $\mathbf{1}_{audio}$ : indicator function으로써, input이 general audio면 $1$ 아니면 $0$을 return 함, $\lambda$ : hyperparameter

Token Mixing

Token Mixing
- Secondary signal을 단순 noise로 취급하면 overlapped, noisy speech에서 generality를 저해할 수 있음
- 따라서 논문은 token mixing을 도입하여 mixed audio sample의 각 source energy를 기준으로 multiple source의 MVQ token을 stochastically combining 하고, augmented training target을 dynamically construct 함
  1. Original signal을 $w$, randomly sampled signal을 $w'$이라고 하면, 먼저 두 signal을 mix 하여 augmented training sample을 생성함
  2. 이후 clean teacher target $\mathbf{Z}, \mathbf{Z}'$도 signal power에 따라 mix 되어 augmented pre-training target $\hat{\mathbf{Z}}$를 생성함:
    (Eq. 5) $\hat{z}_{t,n}=\left\{\begin{matrix} z_{t,n} & \text{with probability}\,\,1-\beta, \\ z'_{t+\tau,n} & \text{with probability}\,\,\beta, \\ \end{matrix}\right.$
    - $\tau$ : mixing delay
  3. $\beta\in[0,1]$는 scalar mixing coefficient로써 두 mixed signal의 signal power $\mathcal{P}$에서 derive 됨:
    (Eq. 6) $\beta=\frac{\mathcal{P}(w')}{\mathcal{P}(w)+\mathcal{P}(w')}$
    - 해당 mixed target을 통해 secondary signal을 signal power에 비례하여 jointly learning 할 수 있음

Pre-Training Configurations

3. Experiments

- Settings

Dataset : 아래 표 참고
Comparisons : WavLM, BEATs, EAT, USAD

Datasets

- Results

전체적으로 SPEAR의 성능이 가장 뛰어남

Model 성능 비교

SUPERB benchmark에 대해서도 최고의 성능을 보임

SUPERB Benchmark

HEAR benchmark에서도 SPEAR의 성능이 가장 뛰어남

HEAR Benchmark

Ablation Study
- Token mixing을 활용하면 더 나은 성능을 달성할 수 있음

Token Mixing의 효과

Asymmetric dual-domain training 역시 성능 향상에 유효함

Dual-Domain Pre-Training

[Paper 리뷰] DisCo-Speech: Controllable Zero-Shot Speech Generation with a Disentangled Speech Codec

feVeRin — Mon, 13 Jul 2026 14:32:39 +0900

DisCo-Speech: Controllable Zero-Shot Speech Generation with a Disentangled Speech Codec

기존 codec은 timbre, prosody의 entanglement로 인해 independent control이 어려움
DisCo-Speech
- Parallel encoder와 hybrid loss를 사용하여 speech를 content, prosody, timbre의 tri-factor로 disentangle
- Unified content-prosody token을 구성해 disentanglement-reconstruction trade-off를 balance
논문 (ACL 2026) : Paper Link

1. Introduction

Codec-based Language Model (LM) 기반의 Text-to-Speech (TTS) model은 우수한 성능을 보이고 있음
- BUT, 기존의 acoustic codec은 timbre, prosody coupling으로 인해 zero-shot controllable TTS가 어려움
- 이를 해결하기 위해 NaturalSpeech3와 같이 speech attribute token을 disentangle 하는 codec을 활용할 수 있지만, disentanglement-reconstruction trade-off와 information leakage의 문제가 존재함

-> 그래서 효과적인 codec-based zero-shot controllable LM TTS를 위한 DisCo-Speech를 제안

DisCo-Speech
- Speech를 content, prosody, timbre로 disentangle하는 Tri-Factor Disentanglement architecture를 도입
- Disentangled content와 prosody를 unified timbre-agnostic token으로 fuse 하여 LM prediction을 지원

< Overall of DisCo-Speech >

Tri-factor disentanglement와 fusion-and-reconstruction을 활용한 controllable codec-based TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

DisCo-Speech는 DisCodec과 Text-to-Codec LM으로 구성됨
- DisCodec은 speech를 content-prosody, global timbre token으로 tokenize 하고 waveform으로 reconstruct 함
- Text-to-Codec LM은 text, historical content-prosody token을 기반으로 content-prosody token을 autoregressively generate 함
- 추론 시 LM은 desired prosody를 가지는 speech prompt와 desired text를 prompt로 target text에 대한 prosodic continuation을 통해 content-prosody token을 생성함
  - 이후 DisCodec decoder로 final speech를 생성함

Overview

- DisCodec: Disentangled Speech Codec

DisCodec은 2-stage training paradigm을 활용함
- Tri-Factor Disentangelment
  - Hybrid decoupling constraint하에서 speech를 content, prosody, timbre로 explicitly decouple 함
- Fusion and Reconstruction
  - DisCodec decoder는 content, prosody를 fuse 하여 standard LM을 위한 unified token을 생성함
Stage 1: Tri-Factor Disentanglement
- Stage 1에서는 3개의 parallel encoder를 사용해 content $c$, timbre $t$, prosody $p$를 추출한 다음, FSQ-based quantizer를 사용해 discretize 함
  - 각 attribute branch에는 서로 다른 decoupling constraint를 impose 하여 clear disentanglement를 보장함
- Speech $x$가 주어졌을 때 Stage 1은 다음과 같이 formulate 됨:
  (Eq. 1) $h_{c}=E_{c}(x),\,\,h_{t}=E_{t}(x),\,\,h_{p}=E_{p}(x)$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, g_{t}=\text{CrossAttention}(h_{t},f)$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, q_{c}=Q_{c}(h_{c}),\,q_{p}=RQ_{p}(h_{p}),\,q_{t}=Q_{t}(g_{t})$
  $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hat{x}=D(q_{c},q_{p},q_{t})$
  - $E_{c},E_{t},E_{p}$ : content, timbre, prosody encoder, $Q$ : quantizer, $RQ$ : residual quantizer, $f$ : learnable query, $D$ : decoder
- Content Tokenizer
  1. Content encoder $E_{c}(\cdot)$은 DAC를 따라 convolutional block을 통해 waveform $x$를 frame-level latent $h_{c}$로 downsample 하고, FSQ $Q_{c}(\cdot)$을 통해 $h_{c}$를 quantized embedding $q_{c}$로 quantize 함
  2. 이때 $q_{c}$가 content information 만을 exclusively encode 할 수 있도록, fine-tuned Wav2Vec-based phone recognition model을 활용해 phonetic supervision을 제공함
    - $q_{c}$는 recognition model의 CE-based guidance $\mathcal{L}_{pho}$ 하에서 phone prediction을 학습하는 classifier를 pass 함
  3. 특히 phone-/text-based model을 활용하면 purer content supervision을 제공하여 decoupling complexity를 낮출 수 있음
- Prosody Tokenizer
  1. Prosody의 temporal variation을 capture 하기 위해 prosody encoder $E_{p}(\cdot)$은 dilated causal convolution을 사용해 frame-level sequence $h_{p}$를 생성함
    - 2-layer residual FSQ $RQ_{p}$는 $h_{p}$를 residual-enhanced representation $q_{p}$로 quantize 하여 comprehensive prosody modeling을 지원함
  2. Prosody capturing을 supervise 하기 위해 first FSQ layer는 frame-level $F0$ regression loss $\mathcal{L}_{f0}$로 update 되고 second FSQ layer는 correlation loss $\mathcal{L}_{cor}$로 update 됨:
    (Eq. 2) $ \mathcal{L}_{cor}=\left(\frac{1}{BL}\sum_{b=1}^{B}\sum_{l=1}^{L}\frac{q_{p1}^{(b,l)} \cdot q_{p2}^{(b,l)}}{|| q_{p1}^{(b,l)}||\cdot || q_{p2}^{(b,l)}||}-\alpha\right)^{2}$
    - $B$ : batch size, $L$ : sequence length, $q_{p1}, q_{p2}$ : first/second FSQ layer의 quantized output, $\alpha=0.2$ : target similarity value
  3. 추가적으로 speaker timbre를 제거하기 위해 GRL layer를 first FSQ layer에 적용하고, attribute decoupling을 보장하기 위해 adjustable coefficient $\beta$에 기반한 soft orthogonality constraint $\mathcal{L}_{soft}$를 도입함
  4. 해당 soft constraint는 prosody-content decoupling $\mathcal{L}_{soft}^{p,c}$, prosody-timbre decoupling $\mathcal{L}_{soft}^{p,t}$에 사용됨:
    (Eq. 3) $\mathcal{L}_{soft}^{p,c}=\left(\frac{1}{BL}\sum_{b=1}^{B}\sum_{l=1}^{L}\left| \cos\left( l_{p}^{(b,l)},l_{c}^{(b,l)}\right)\right|-\beta_{c}\right)^{2}$
    (Eq. 4) $\mathcal{L}_{soft}^{p,t}=\left(\frac{1}{BL}\sum_{b=1}^{B}\sum_{l=1}^{L}\left| \cos\left( l_{p}^{(b,l)},q_{t}^{(b)}\right)\right|-\beta_{t}\right)^{2}$
    - $l_{p}, l_{c}$ : quantized prosody, content에 대한 linear-transformed results
- Timbre Tokenizer
  1. 논문은 fixed-length global token sequence를 사용해 global speaker timbre를 capture 함
  2. 먼저 timbre encoder $E_{t}(\cdot)$은 ECAPA-TDNN을 따라 frame-level representation $h_{t}$를 생성함
    - 이후 learnable query $f$와 cross-attention을 통해 fixed-length sequence $g_{t}$로 aggregate 하여 global-consistency timbre information에 adaptively focus 함
  3. FSQ layer $Q_{t}(\cdot)$은 quantization을 통해 global timbre representation $q_{t}$를 생성하고 non-timbre information을 discard 하는 information bottleneck을 implicitly create 함
  4. 추가적으로 speaker timbre modeling을 위해 timbre tokenizer를 speaker classification loss $\mathcal{L}_{spk}$로 directly optimize 함
    - Soft orthogonal constraint $\mathcal{L}_{soft}^{p,t}$는 timbre representation의 prosodic variation을 further eliminate 함
- Decoder
  1. Decoder $D(\cdot)$은 content encoder의 mirror로써 triple stream representation을 waveform을 recombine 함
  2. 이때 DAC의 multi-scale mel-spectrogram loss, waveform reconstruction loss를 사용해 guide 함
Stage 2: Fusion and Reconstruction
- Stage 1 decoder는 disentanglement-reconstruction trade-off로 인해 reconstruction quality의 한계가 있으므로, 논문은 reconstruction quality를 optimize 하는 specialized decoder를 도입함
- Downstream usability를 향상하기 위해 content, prosody의 quantized embedding을 sum 하고, 해당 fused result를 unified token sequence $z_{cp}$로 re-quantize 함
  - 이후 decoder는 해당 quantized embedding $q_{cp}$를 reconstruct에 사용하고, global speaker timbre $q_{t}$에 condition 되어 jointly optimize 됨
- 구조적으로는 BigVGAN generator와 Transformer block을 stack 하여 구성됨
  - Zero-shot controllable inference 시에는 LM-predicted content-prosody token $z_{cp}^{sys}$를 기반으로 target speaker timbre $q_{t}^{trg}$에 condition 되어 waveform을 생성함

2-Stage Training

- Text-to-Codec Language Model

DisCo-Speech는 DisCodec의 disentangled representation을 기반으로 text, prosody relationship을 학습하고 timbre-agnostic content-prosody token $z_{cp}$를 생성함
Training
- Training 시 input sequence는 $ [Ⓢ, t_{c}, Ⓣ,z_{cp},Ⓔ]$와 같이 구성됨
  - $t_{c}$ : text에 대한 Byte Pair Encoding (BPE) sequence, $z_{cp}$ : DisCodec의 unified content-prosody token, Ⓢ, Ⓣ, Ⓔ : sequence start/turn/end에 대한 special token
- Model은 pre-training과 Supervised Fine-Tuning (SFT) process를 통해 next-token prediction으로 training 됨
Inference
- 추론 시 input은 $ [Ⓢ, t_{c}^{prompt}, t_{c}^{sys}, Ⓣ,z_{cp}^{prompt}]$와 같이 구성됨
  - $t_{c}^{prompt}, z_{cp}^{prompt}$ : prompt speech, desired prosody에서 추출된 prompt, $t_{c}^{sys}$ : target text
- Model은 prompt $z_{cp}^{prompt}$에서 prosodic pattern을 capture 하여 $z_{cp}^{sys}$를 생성하고 final waveform은 target speaker timbre $q_{t}^{trg}$을 condition으로 DisCodec decoder에서 $z_{cp}^{sys}$를 통해 synthesize 됨
  - 이때 LLM은 prosody, content로 decoder는 timbre로 separate 되어 있으므로 flexible zero-shot control이 가능함

3. Experiments

- Settings

Dataset : Internal Dataset
Comparisons
- Codec : WavTokenizer, EnCodec, DAC, SpeechTokenizer, X-Codec 등
- TTS : F5-TTS, CosyVoice2, IndexTTS2, Vevo, FireRedTTS, Spark-TTS, Llasa

- Results

DisCoedc은 뛰어난 disentanglement ability를 보임

Codec 성능 비교

DisCodedc은 prosody, content를 효과적으로 disentangling 할 수 있음

Disentanglement

DisCodec은 zero-shot voice conversion task에서도 우수한 성능을 보임

Zero-Shot Voice Conversion

Controllability of DisCo-Speech
- Timbre, prosody control에 대해 DisCo-Speech가 더 선호됨

AB Test

Intelligibility 측면에서도 DisCo-Speech가 가장 우수함

Model 성능 비교

[Paper 리뷰] Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-Training

feVeRin — Tue, 7 Jul 2026 14:10:37 +0900

Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-Training

Fine-grained speaking style을 modeling 하는 것은 어려움
CLSP
- 47k hours speech, 19M fine-grained caption을 포함한 FCaps dataset을 구축
- FCaps dataset을 기반으로 global, fine-grained supervision을 integrate 한 contrastive language-speech pre-trained modeling을 수행
논문 (ACL 2026) : Paper Link

1. Introduction

Speaking style은 gender, age와 같은 speaker-intrinsic characteristic 외에도 intonation, emotion과 같은 temporally varying trait도 포함하고 있음
- BUT, Emotion2Vec과 같은 기존의 speaking style representation은 utterance-level, discrete label에 의존하므로 diversity의 한계가 있음
- 특히 기존 speech style-captioned dataset은 cascaded annotation pipeline으로 인해 error propagation과 semantic misalignment의 문제가 존재함

Open-Source English Speech Style-Captioned Dataset

-> 그래서 개선된 quality의 annotation dataset을 활용한 fine-grained speech-text representation인 CLSP를 제안

CLSP
- Fine-grained style annotation을 보장하는 end-to-end pipeline을 통해 FCaps dataset을 구축
- 해당 FCaps dataset을 기반으로 speech-text dual encoder를 활용한 Contrastive Language-Speech Pre-Trained model (CLSP)를 구축

< Overall of CLSP >

다양한 modality에 대한 unified representation을 보장하는 fine-grained, multi-granular contrastive learning model
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Model Architecture

CLSP는 CLAP의 dual-encoder architecture를 따라 speech, text를 separate encoder로 process 하고, 두 modality를 shared embedding space로 mapping 하는 MLP projection을 적용함
- 논문은 speech, audio unified encoder인 SPEAR-XLarge를 사용해 final encoder layer에서 representation을 추출함
- RoBERTa-base는 text encoder로 사용되고, sentence-level representation은 final-layer의 $\texttt{[CLS]}$ token에서 얻어짐

Overview

- Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-Training

Fine-grained, multi-granular contrastive supervision을 위해 2-stage curriculum을 도입함
- Training은 pure fine-grained alignment에서 cross-granularity generalization, robust fine-grained discrimination으로 progressively shift 됨
- First stage에서는 large-scale data를 활용해 standard contrastive learning으로 speech, text를 align 하고 second stage에서는 multi-positive contrastive learning을 수행함
Stage 1
- Speech clip $\mathbf{x}$와 paired tokenized fine-grained caption $\mathbf{y}_{F}$가 주어진다고 하자
  1. Speech encoder는 frame-level representation을 생성하고 time에 대한 mean pooling으로 aggregate 한 다음, MLP projection과 $\ell_{2}$ normalization을 통해 speech embedding $\mathbf{s}\in\mathbb{R}^{d}$를 얻음
  2. Text의 경우 text encoder에서 final-layer $\texttt{[CLS]}$ hidden state를 구한 다음, MLP projection과 $\ell_{2}$ normalization을 적용해 text embedding $\mathbf{t}_{F}\in\mathbb{R}^{d}$를 얻음
    - $d$ : embedding space dimensionality
- 논문은 InfoNCE loss를 통해 각 paired speech, text는 positive로 non-matching pair는 negative로 사용함:
  (Eq. 1) $\mathcal{L}=-\frac{1}{2N}\sum_{i=1}^{N}\left( \log \frac{\exp(\mathbf{s}_{i}\cdot \mathbf{t}_{F_{i}}/\tau)}{ \sum_{j=1}^{N}\exp(\mathbf{s}_{i}\cdot\mathbf{t}_{F_{i}}/\tau)} + \log\frac{\exp(\mathbf{t}_{F_{i}}\cdot \mathbf{s}_{i}/\tau)}{\sum_{j=1}^{N} \exp(\mathbf{t}_{F_{i}}\cdot \mathbf{s}_{j}/\tau)}\right)$
  - $N$ : batch size, $\tau$ : learnable temperature
  - 이때 모든 embedding은 $\ell_{2}$-normalize 되어 있으므로 dot-product는 cosine-similarity와 equivalent 함
Stage 2
- 논문은 soft target을 활용한 cross-entropy loss인 symmetric multi-positive InfoNCE loss를 도입함
- $N$ batch의 speech sample $\{\mathbf{x}_{i}\}_{i=1}^{N}$이 주어지고, 각각 두 개의 tokenized caption $\{\mathbf{y}_{i},\hat{\mathbf{y}}_{i}\}$로 pair 된다고 하자
  1. 그러면 Stage 1과 동일한 방식으로 speech embedding $\mathbf{s}_{i}\in\mathbb{R}^{d}$와 두 text embedding $\mathbf{t}_{i},\hat{\mathbf{t}}_{i}\in\mathbb{R}^{d}$를 얻을 수 있음
  2. 이후 speech embedding을 $\mathbf{S}=[\mathbf{s}_{1},...,\mathbf{s}_{N}]\in\mathbb{R}^{N\times d}$, text embedding을 $\mathbf{T}=[\mathbf{t}_{1},...,\mathbf{t}_{N},\hat{\mathbf{t}}_{1},...,\hat{\mathbf{t}}_{N}]\in\mathbb{R}^{2N\times d}$와 같이 stack 하고 similarity logit $\mathbf{L}=\mathbf{ST}^{\top}\in\mathbb{R}^{N\times 2N}$을 compute 함
- Audio-to-text direction을 위해, 두 paired text에 probability mass $\lambda, 1-\lambda$를 나머지에는 $0$을 assign 하는 soft target distribution $\mathbf{D}\in\mathbb{R}^{N\times 2N}$을 적용함:
  (Eq. 2) $D_{i,j}=\left\{\begin{matrix} \lambda, & \text{if}\,\, j=1 \\ 1-\lambda, & \text{if}\,\,j=i+N, \\ 0, & \text{otherwise} \\ \end{matrix}\right.$
  - $\lambda=0.5$
- Text-to-audio direction에서는 각 text embedding이 single speech만 가지고, 이때 target distribution $\mathbf{D}'\in\mathbb{R}^{2N\times N}$은:
  (Eq. 3) $D'_{j,i}\left\{\begin{matrix} 1, & \text{if}\,\,j=i\,\,\text{or}\,\,j=i+N \\ 0, & \text{otherwise} \\ \end{matrix}\right.$
- 결과적으로 loss는 두 direction의 average로 정의됨:
  (Eq. 4) $\mathcal{L}=\frac{1}{2}\left(\text{CE}(\mathbf{L}/\tau,\mathbf{D})+\text{CE}(\mathbf{L}^{\top}/\tau,\mathbf{D}')\right)$
  - $\text{CE}(\cdot, \cdot)$ : cross-entropy, $\tau$ : learnable temperature
- 각 training step에서 model은 task scheduler에 따라 2가지 task 중 하나를 sampling 함:
  1. Task 1
    - 각 speech sample은 global caption, fine-grained caption과 pair 되어 cross-granularity generation을 지원함
  2. Task 2
    - 각 speech sample은 두 distinct fine-grained caption과 pair되어 fine-grained discrimination을 향상함
- Training step $t$에서 Task 1은 probability $p_{t}$로 sampling 되고 Task 2는 $1-p_{t}$로 sampling 됨
  1. Static scheduler의 경우 $p_{t}=0$으로 fix 되고, dynamic scheduler의 경우 $p_{t}$는 $T$ training step 동안 $p_{0}$에서 $p_{\min}$까지 linearly decrease 함:
    (Eq. 5) $p_{t}=\max\left(p_{\min},p_{0}=\frac{t}{T}(p_{0}-p_{\min})\right)$
  2. $p_{0}=0.95, p_{\min}=0.50, T=10000$의 dynamic scheduler를 사용했을 때 최적의 결과를 달성함

Multi-Granular Speech Style Caption Similarity Scoring

3. Experiments

- Settings

Dataset : FCaps
Comparisons : LAION-AI CLAP, GLAP, ParaCLAP

Data Annotation Pipeline

- Results

CLSP는 end-to-end setting에서 최적의 성능을 달성함

Pipeline Setting

CLSP는 전체적으로 우수한 성능을 보임

Model 성능 비교

Zero-shot classification 측면에서도 CLSP가 가장 우수함

Zero-Shot Classification

Subjective evaluation 측면에서도 CLSP가 더 선호됨

Subjective Evaluation

CLSP는 model-derived similarity과 subjective MOS에 대해 높은 correlation을 가짐

Correlation

[Paper 리뷰] ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

feVeRin — Thu, 2 Jul 2026 14:37:07 +0900

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

Environmental audio와 함께 speech를 jointly generate 하는 것은 어려움
ImmersiveTTS
- Mutimodal diffusion Transformer를 기반으로 transcript-aligned speech latent와 text-conditioned environmental context를 joint attention으로 fuse
- Semantic consistency를 향상하기 위해 domain-specific representation alignment objective를 도입
논문 (ACL 2026) : Paper Link

1. Introduction

Text-guided audio generation은 크게 Text-to-Speech (TTS)와 Text-to-Audio (TTA)로 나뉨
- TTA는 natural language description을 기반으로 non-speech audio를 생성하고, TTS는 text input을 기반으로 natural-sounding human speech waveform을 생성함
- BUT, TTA는 linguistic content가 포함된 human speech를 생성하기 어렵고, TTS는 acoustic environment를 반영하기 어려움
  - 즉, heterogeneous audio sub-modality를 single model에서 synthesis 할 수 없음
- 이를 위해 VoiceLDM과 같은 environment-aware TTS model을 고려할 수 있지만, 여전히 naturalness 측면에서 한계가 있음

-> 그래서 speech와 environmental audio를 explicitly modeling할 수 있는 ImmersiveTTS를 제안

ImmersiveTTS
- Multimodal Diffusion Transformer (MM-DiT)를 기반으로 dual-stream backbone을 구축해 transcript-aligned speech feature와 text-conditioned environmental context에 assign
- 추가적으로 Representation Alignment (REPA) strategy를 활용하여 cross-modal learning을 개선

< Overall of ImmersiveTTS >

Dual-stream architecture를 활용해 speech, environmental audio를 modeling 하는 environment-aware TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Preliminaries on Flow Matching

Flow matching은 $\mathbb{R}^{d_{z}}$에서 simple prior $\pi_{0}$와 data distribution $\pi_{1}$ 간의 transformation을 학습함
- 해당 transformation은 time $t\in[0,1]$에 대해 Ordinary Differential Equation (ODE)로 정의됨:
  (Eq. 1) $ \frac{d}{dt}Z_{t}=v(Z_{t},t),\,\,\,Z_{0}\sim\pi_{0},\,\,\,Z_{1}\sim\pi_{1}$
  - $v:\mathbb{R}^{d_{z}}\times[0,1]\rightarrow \mathbb{R}^{d_{z}}$ : time-dependent velocity field, $\pi_{0}$ : standard Gaussian distribution $\mathcal{N}(0,I)$
- Field는 neural network $v_{\theta}$로 parameterize 되고, random pair $(Z_{0},Z_{1})$을 connecting 하는 staright path의 veclocity와 neural network velocity 간의 Mean Squared Error를 minimize 하여 training 됨:
  (Eq. 2) $\mathcal{L}_{Flow}(\theta)=\mathbb{E}_{t,Z_{0},Z_{1}}\left[\left|\left| (Z_{1}-Z_{0})-v_{\theta}(Z_{t},t)\right|\right|^{2}\right]$
  -$Z_{t}=(1-t)Z_{0}+tZ_{1}$ : $Z_{0}\sim\pi_{0}, Z_{1}\sim \pi_{1}$ 간의 linear interpolation, $t\in[0,1]$ : time step
- Learned velocity field $v_{\theta}$가 주어지면 flow-based model은 sample을 straight trajectory를 따라 prior $\pi_{\theta}$에서 target distribution $\pi_{1}$로 transport 함

Overview

- Audio Compression

Speech, general audio characteristic에 대한 unified latent space를 capture 하기 위해 AudioLDM2의 pre-trained Variational AutoEncoder (VAE)를 사용함
- $X_{wav}\in\mathbb{R}^{d\cdot f_{s}}$를 duration $d$, sampling rate $f_{s}$의 raw waveform이라고 하자
- Mel bin $F$, mel-spectrogram length $L$에 대해, $X_{wav}$는 log-mel spectrogram $X_{mel}\in\mathbb{R}^{F\times L}$로 convert 됨
  1. VAE encoder는 time-frequency axis를 $\times 4$로 downsampling 하여 $X_{mel}$을 latent representation $Z\in\mathbb{R}^{8\times F/4\times L/4}$로 compress 함
  2. 이후 VAE decoder는 $Z$로부터 $\hat{X}_{mel}$을 reconstruct하고 pre-trained vocoder를 통해 $\hat{X}_{mel}$을 waveform $\hat{X}_{wav}$로 convert 함

- Multimodal Diffusion Transformer for Environment-Aware Text-to-Speech

논문은 linguistic content를 preserve 하면서 environmental context와 align 하는 speech를 생성하는 것을 목표로 함
- Model은 content prompt $y_{cont}$, environment prompt $y_{env}$의 2가지 textual input으로 condition 됨
- 특히 speech latent와 environmental cue 간의 interplay를 modeling 하기 위해 MM-DiT backbone을 사용하고, Flux architecture를 활용해 high-fidelity synthesis를 수행함
  1. Double-stream stage는 $y_{env}$에서 derive 된 fine-grained environment context token을 encode 하는 environmental context stream과 $y_{cont}$로 condition 되어 noisy audio latent $Z_{t}$를 process 하는 speech stream으로 구성됨
    - 이때 parallel stream은 joint attention을 통해 information exchange를 수행함
  2. 이후 speech stream의 representation을 single-stream block으로 forward 하여 self-attention layer를 통해 further refine 함

Environmental Context Stream
- Audio generation을 위해 coarse, global sound semantic encoder인 CLAP과 fine-grained detail encoder인 T5 encoder를 고려함
  - 먼저 $y_{env}$에서 CLAP embedding을 MLP를 사용해 project 하고, diffusion timestep embedding과 combine 하여 AdaLN module을 condition 함
    - 이때 AdaLN scale과 shift parameter $(\gamma, \beta)$를 Transformer block에 대해 modulate 하여 generation process를 globally condition 할 수 있음
  - 추가적으로 $y_{env}$의 token-level T5 embedding에 linear projection을 적용하고 environment context에 대한 input sequence로 사용함
    - 이를 통해 double-stream layer에서 joint attention을 활용해 local environmental detail에 selectively attend 할 수 있음
Environment-Aware Speech Stream
- Content prompt $y_{cont}$를 follow 하는 intelligible speech를 생성하기 위해, 논문은 linguistic feature를 speech stream에 directly inject 하는 explicit temporal alignment를 도입함
- Glow-TTS를 따라 text encoder는 $y_{cont}$를 hidden representation $\tilde{\mu}_{1:L}$로 convert 하고 Monotonic Alignment Search (MAS)는 phoneme-level duration $d'_{1:L}$을 estimate 함
  1. 이후 hidden vector는 $d'$을 기준으로 expand 되어 frame-level pior mel representation $\mu$를 생성함
  2. Text encoder, duration predictor는 prior loss $\mathcal{L}_{Prior}$와 MAS-based duration loss $\mathcal{L}_{Dur}$로 optimize 됨
- Prior representation $\mu$를 audio latent space와 align 하기 위해, $\mu$를 convolution network로 process 함
  1. 이를 통해 얻어진 feature는 noisy latent $Z_{t}$와 channel dimension으로 concatenate 되고 environment-aware speech stream으로 전달됨
  2. 해당 speech stream은 MM-DiT layer 내에서 environment context stream sequence와 joint attention 되어 information exchange를 수행함

Double-Stream DiT Block

- Domain-Specific Representation Alignment

Training stability를 향상하기 위해 논문은 REPA strategy를 도입함
Domain-Specific SSL Encoders
- Domain-specific REPA를 위해 WavLM, ATST-Frame을 target encoder로 하는 dual-teacher strategy를 적용함
- WavLM은 speech-specialized SSL model로써 precise phonetic, linguistic fidelity를 위해 사용됨
- ATST-Frame은 audio-specialized SSL model로써 rich environmental acoustic event를 capture 함
Alignment Objective
- $K$ pre-trained SSL encoder set을 $\{E_{k}\}_{k=1}^{K}$라고 하자
  1. Target audio $X\sim p_{data}$에 대해 $k$-th encoder는 target representation $r_{k}=E_{k}(X)\in \mathbb{R}^{B\times L_{k}\times D_{k}}$를 생성함
    - $B$ : batch size, $L_{k}$ : sequence length, $D_{k}$ : sequence dimensionality
  2. 이때 model은 해당 target과 align 되기 위해 speech stream의 intermediate layer에서 hidden feature $h_{k}\in\mathbb{R}^{B\times L_{h}\times D_{h}}$를 추출함
    - 이후 MLP projector를 통해 $h'_{k}=\text{MLP}_{k}(h_{k})$로 project 하여 Transformer feature를 encoder representation space로 mapping 함
  3. 다음으로 projected feature $h'_{k}$, target feature $r_{k}$의 temporal resolution을 interpolate/pooling 하여 common temporal length $\tilde{L}$에 match 하고, synchronized sequence $\tilde{h}'_{k},\tilde{r}_{k}$를 얻음
- 이때 REPA loss는 cosine similarity $\text{CosSim}(\cdot, \cdot)$으로 정의됨:
  (Eq. 3) $\mathcal{L}_{SSL_{k}}=-\mathbb{E}_{X}\left[\text{CosSim}\left(\tilde{r}_{k}, \tilde{h}'_{k}\right)\right]$
- 결과적으로 total objective는 domain-specific alignment loss의 weighted sum으로 얻어짐:
  (Eq. 4) $\mathcal{L}_{REPA}=\sum_{k=1}^{K}\lambda_{k}\mathcal{L}_{SSL_{k}}$
  - $\lambda_{k}=1$ : hyperparameter

- Training and Inference

Training
- ImmersiveTTS는 4가지 loss로 training 됨
  1. Velocity predictor와 convolutional mapper는 flow matching objective $\mathcal{L}_{Flow}$와 alignment objective $\mathcal{L}_{REPA}$로 optimize 됨
  2. Text encoder와 duration predictor는 conditioning path를 통해 backpropagate 된 gradient를 receive 하고 MAS-based prior loss $\mathcal{L}_{Prior}$, duration loss $\mathcal{L}_{Dur}$로 supervise 됨
- 결과적으로 final objective loss는:
  (Eq. 5) $\mathcal{L}=\lambda_{P}\mathcal{L}_{Prior}+ \lambda_{D}\mathcal{L}_{Dur}+\lambda_{F}\mathcal{L}_{Flow}+\lambda_{R}\mathcal{L}_{REPA}$
  - Training 시 CLAP, T5 encoder는 freeze 되고 time step $t\in(0,1)$은 mean $0$, variance $1$의 logit-Normal distribution에서 draw 됨
- 추가적으로 논문은 flexible control을 위해 content, environment prompt sequence를 $0.1$ probability로 independently masking 하는 Classifier-Free Guidance (CFG)를 채택함
Inference
- 먼저 random noise $Z_{0}\sim \mathcal{N}(0,I)$에서 sampling을 수행함
- 이때 explicit velocity field는 dual-CFG를 사용해 adjust 됨:
  (Eq. 6) $\tilde{v}_{\theta}(Z_{t},y_{env},y_{cont})=v_{\theta}(Z_{t},y_{env},y_{cont}) + \omega_{env}\left(v_{\theta}(Z_{t},y_{env},\emptyset_{cont})-v_{\theta}(Z_{t},\emptyset_{env},\emptyset_{cont})\right) +\omega_{cont}\left(v_{\theta}(Z_{t},\emptyset_{env},y_{cont})-v_{\theta}(Z_{t},\emptyset_{env},\emptyset_{cont})\right)$
  - $\omega_{env},\omega_{cont}$ : 각 modality 별 guidance scale, $\emptyset_{env}, \emptyset_{cont}$ : 각 modality 별 null-condition
- 그런 다음 Euler method를 사용해 (Eq. 1)의 ODE를 solve 함:
  (Eq. 7) $Z_{t+\tau}=Z_{t}+\tau\cdot \tilde{v}_{\theta}(Z_{t},t,y_{env},y_{cont})$
  - 생성된 latent는 VAE decoder로 decode 되고 pre-trained vocoder를 통해 waveform으로 synthesize 됨

3. Experiments

- Settings

Dataset : LibriTTS
Comparisons : VoiceLDM, VoiceDiT

- Results

전체적으로 ImmersiveTTS의 성능이 가장 뛰어남

Model 성능 비교

Seed-TTS dataset에 대해서도 우수한 성능을 보임

Seed-TTS Dataset에서의 성능

Objective evaluation 측면에서도 ImmersiveTTS가 가장 뛰어난 성능을 보임

Objective Evaluation

Representation Alignment
- 서로 다른 teacher alignment strategy에 대해 WavLM, ATST 조합을 사용했을 때 최고의 성능을 달성함

Alignment Strategy 비교

Sampling Step
- $9$ sampling step을 사용했을 때 최적의 trade-off를 만족함

NFE 별 성능

[Paper 리뷰] SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

feVeRin — Wed, 1 Jul 2026 13:00:59 +0900

SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

Speech codec은 semantically-rich representation과 high-quality reconstruction에 대한 trade-off가 존재함
SAC
- Semantic-acoustic dual-stream quantization을 활용해 semantic, acoustic modeling을 disentangling
- 두 개의 dedicated stream을 각각의 respective role에 맞게 optimize
논문 (ACL 2026) : Paper Link

1. Introduction

Speech tokenizer는 continuous speech waveform을 token sequence로 discretize 함
- 특히 EnCodec과 같은 neural audio codec을 활용하면 fine-grained acoustic detail을 capture 할 수 있음
  - BUT, 해당 acoustic token은 semantic content가 부족하므로 text-based Language Model에서 weaker alignment가 발생함
- 이를 해결하기 위해 SpeechTokenizer, X-Codec 등은 semantic supervision을 incorporate 하여 semantic representation을 향상함
  - BUT, semantic alignment 측면에서는 여전히 한계가 있음

Speech Codec 비교

-> 그래서 neural codec의 semantic, acoustic modeling을 모두 개선한 SAC을 제안

SAC
- Semantic stream, acoustic stream으로 구성된 dual-stream architecture를 도입
- 각 pathway를 통해 speech encoding을 explicitly disentangle 하여 semantic-acoustic trade-off를 해결

< Overall of SAC >

Semantic stream, Acoustic stream을 활용한 dual-stream neural codec
결과적으로 기존보다 우수한 성능을 달성

2. Method

SAC은 2개의 discrete encoding stream을 활용함:
- Semantic stream은 pre-trained semantic tokenizer를 사용해 linguistic content를 modeling 함
- Acoustic stream은 speech codec을 통해 semantic token이 missing 하는 acoustic information을 제공함

- Model Architecture

SAC은 VQ-VAE framework를 기반으로 dual-stream encoder-quantizer와 unified codec decoder로 구성됨
Semantic Stream
- 논문은 pre-trained semantic tokenizer를 채택해 semantic stream이 semantic consistency를 보장하도록 함
- 해당 semantic tokenizer는 input speech를 12.5Hz frame rate의 discrete token으로 tokenize 함
  1. Input waveform $x$가 주어지면 tokenizer는 50Hz에서 fine-grained continuous representation $\mathbf{S}_{c}$를 추출함
    - 해당 $\mathbf{S}_{c}$는 auxiliary semantic supervision의 target으로도 사용됨
  2. 다음으로 temporal pooling layer는 $\mathbf{S}_{c}$를 12.5Hz의 $\mathbf{S}$로 downsampling 하고, 해당 feature는 vector quantization을 통해 quantize 되어 quantized embedding $\mathbf{S}_{q}$를 생성함
- Training 시 논문은 semantic tokenizer를 freeze 하여 semantic stream이 acoustic detail에 bias 되지 않고 linguistic content에만 exclusively focus 하도록 함
- 추가적으로 acoustic detail에 대한 temporal alignment를 위해 $\mathbf{S}_{q}$는 ConvNeXt-based adapter로 upsample 되어 semantic feature $\mathbf{S}'_{q}$를 생성함
Acoustic Stream
- Acoustic token은 semantic token에서 missing 된 acoustic detail을 complement 하기 위해 사용됨
- 이를 위해 논문은 EnCodec architecture를 기반으로 stride $\tau$의 stacked convolutional, temporal donwsampling layer를 사용해 frame-level acoustic representation $\mathbf{A}$를 추출함
  1. 추가적으로 DAC를 따라 $\mathbf{A}$를 lower-dimensional embedding space로 mapping 하는 factorized code projection을 적용하고, $L_{2}$ distance 기반의 single-codebook quantization을 수행함
  2. 이때 codebook under-utilization을 방지하기 위해 inactive entry는 randomly sampled embedding으로 re-initialize 됨
- Acoustic representation은 low-bitrate setting에서 25Hz, high-bitrate setting에서 50Hz를 사용하고, 이때 stride $\tau$는 각각 $(2,2,4,5,8), (2,4,5,8)$로 설정됨
Decoder
- Quantized acoustic embedding $\mathbf{A}_{q}$는 feature dimension을 따라 semantic embedding $\mathbf{S}_{q}^{'}$과 concatenate 되어 unified representation $\mathbf{U}$를 생성함
- 이후 해당 joint representation은 ConvNeXt-based pre-net으로 process 되어 fused feature sequence $\mathbf{F}$를 생성함
  - Fused representation $\mathbf{F}$는 semantic stream의 linguistic information과 acoustic stream의 timbre, acoustic detail을 integrate 함
- 결과적으로 $\mathbf{F}$는 mirrored decoder를 통해 waveform $\tilde{x}$를 reconstruct 하는 데 사용되고, 이때 deconvolution stride $\tau=(8,5,4,2)$로 설정됨
  1. 여기서 X-Codec을 따라 key linguistic information을 preserve 하는 auxiliary semantic reconstruction objective를 도입할 수 있음
  2. $\mathbf{F}$는 CNN-based semantic decoder에 input 되어 reconstructed semantic feature $\tilde{\mathbf{S}}_{c}$를 predict 하고, 이때 Mean Squared Error (MSE) loss는:
    (Eq. 1) $\mathcal{L}_{sem}=||\tilde{\mathbf{S}}_{c}-\mathbf{S}_{c}||_{2}^{2}$
    - $\mathbf{S}_{c}$ : ground-truth semantic feature

Overview

- Auxiliary Speaker Feature Supervision

Timbre reconstruction을 향상하기 위해 논문은 explicit speaker feature supervision을 도입함
- 먼저 ERes2Net을 사용해 speaker embedding $\mathbf{S}_{p}$를 supervision target으로 extract 함
- 이후 prediction을 위해 fused representation $\mathbf{F}$의 temporal mean/variance를 compute 하고 global feature $\mathbf{f}$로 concatenate 함
  - 해당 vector는 2-layer MLP projector를 통해 predicted speaker embedding $\tilde{\mathbf{S}}_{p}$를 생성함
- 최종적으로 $\tilde{\mathbf{S}}_{p}, \mathbf{S}_{p}$ 간에 MSE loss를 적용하여 timbre information을 modeling 함:
  (Eq. 2) $ \mathbf{f}=[\text{Mean}_{t}(\mathbf{F});\text{Std}_{t}(\mathbf{F})]$
  (Eq. 3) $ \mathcal{L}_{spk}=\left|\left| \tilde{\mathbf{S}}_{p}-\mathbf{S}_{p}\right|\right|_{2}^{2} =\left|\left| \text{Proj}(\mathbf{f})-\mathbf{S}_{p}\right|\right|_{2}^{2}$

- Training Objective

SAC은 VQ-GAN framework 하에서 optimize 됨
- Reconstruction Loss
  - Reconstruction loss $\mathcal{L}_{recon}$은 DAC를 따라 multiple scale의 reconstructed, ground-truth audio에 대한 $L_{1}$ distance로 얻어짐
- VQ Loss
  1. Acoustic stream에서 codebook은 encoder output과 quantized embedding 간의 $L_{2}$ distance를 minimize 하도록 optimize 되고, gradient는 Straight-Through Estimator (STE)를 통해 propagate 됨
  2. VQ loss $\mathcal{L}_{vq}$는 commitment term을 포함함
- Discriminative Loss
  1. 논문은 Multi-Period Discriminator (MPD), Multi-Scale STFT-based Discriminator (MS-STFT)를 사용함
  2. Discriminator는 least-squares GAN objective로 optimize 되고, generator는 adversarial loss $\mathcal{L}_{adv}$와 feature matching loss $\mathcal{L}_{feat}$를 사용함
- 결과적으로 overall generator loss는:
  (Eq. 4) $\mathcal{L}_{G}=\lambda_{recon}\mathcal{L}_{recon}+\lambda_{vq}\mathcal{L}_{vq}+\lambda_{adv}\mathcal{L}_{adv}+\lambda_{feat}\mathcal{L}_{feat}+\lambda_{sem}\mathcal{L}_{sem}+\lambda_{spk}\mathcal{L}_{spk}$
  - $\lambda$ : hyperparameter

3. Experiments

- Settings

Dataset : Emilia, WenetSpeech4TTS, LibriSpeech, LibriHeavy, MLS
Comparisons : DAC, EnCodec, SpeechTokenizer, SemantiCodec, X-Codec, WavTokenizer 등

- Results

전체적으로 SAC은 우수한 성능을 보임

Model 성능 비교

Low-bitrate에서도 뛰어난 성능을 달성함

Low-Bitrate에서의 성능

Semantic representation 측면에서도 SAC이 가장 뛰어남

Semantic Representation Evaluation

Ablation Study
- Auxiliary feature supervision은 성능 향상에 유효함

Ablation Study

Speech Decoupling
- SAC은 효과적인 information disentanglement이 가능함

Speech Information Disentanglement

실제로 semantic-only, acoustic-only reconstruction은 full reconstruction에 비해 detail을 반영하지 못함

Reconstruction Pattern 별 Spectrogram

[Paper 리뷰] TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

feVeRin — Tue, 30 Jun 2026 12:58:00 +0900

TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

Controllable Text-to-Speech는 여전히 fine-grained intra-utterance expression 측면에서 한계가 있음
TED-TTS
- Causal masking과 monotonic stream alignment를 combine 한 segment-aware emotion conditioning를 도입
- 추가적으로 local duration embedding steering과 global EOS logit modulation을 combine 한 segment-aware duration steering strategy를 적용
논문 (ACL 2026) : Paper Link

1. Introduction

기존 Text-to-Speech (TTS) model의 controllability는 utterance-level로 제한되어 있으므로 human speech의 dynamic expression을 반영하기 어려움
- 이를 해결하기 위해 NautrualSpeech와 같이 text에서 phoneme-/frame-level affective attribute를 directly predict 하거나 expressive pattern을 guide 하는 reference speech를 활용할 수 있음
- BUT, 해당 방식은 large-scale time-aligned annotated speech dataset이나 multi-stage pipeline을 요구함

Training-Free Framework

-> 그래서 model re-training 없이 stable segment-level emotion transition을 지원할 수 있는 TED-TTS를 제안

TED-TTS
- Multi-emotion control을 위해 causal masking과 Monotonic Stream Alignment를 combine한 segment-aware emotion conditioning을 도입
- Local duration embedding steering과 global EOS logit modulation에 기반한 segment-aware duration steering strategy를 적용
- 추가적으로 Multi-Emotion Duration-annotated text dataset (MED-TTS)를 구축

< Overall of TED-TTS >

Segment-aware emotion conditioning, segment-aware duration steering을 활용한 training-free controllable TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

TED-TTS는 IndexTTS2 architecture를 기반으로 Multi-Emotion and Duration-annotated text dataset인 MED-TTS dataset으로 training 됨

Overview

- Segment-Aware Emotion Conditioning

Text-to-Semantic (T2S) module은 text, control embedding에 condition된 autoregressive semantic token prediction task로 formulate 됨
- 먼저 input text가 주어지면 $M$ user-defined semgent $\mathbf{X}=\{X_{1},X_{2},...,X_{M}\}$으로 decompose 하고, 각 segment $X_{m}$에는 condition embedding $\mathbf{C}_{m}=\{\mathbf{I},\mathbf{E}_{m}\}$을 assign 함
  - $\mathbf{I}$ : fixed speaker identity embedding, $\mathbf{E}_{m}$ : segment-specific emotion condition
- Autoregressive T2S에서 semantic token은 explicit segment boundary 없이 continuous stream으로 생성되므로, text segment의 semantic continuity를 preserve 하면서 segment-level condition을 적용하기 어려움
- 이를 해결하기 위해 논문은 2D causal attention mask와 Monotonic Stream Alignment를 combine 하여 smooth intra-utterance emotion transition을 수행함
2D Causal Attention Mask
- 2D causal attention mask는 condition visibility를 semantic context와 disentangle 하여 segment-level condition의 misalignment 문제를 해결함
- 해당 mask는 segment boundary에 대해 text, semantic token의 standard causal attention을 preserve 하면서 condition embedding에 대한 access는 strictly restricting 하여 globally coherent semantic generation을 지원함
  1. $m$-th segment에 속하는 임의의 token에 대해, token은 해당 segment의 conditioning embedding $\mathbf{C}_{m}$에만 attend 하고 다른 condition embedding $\{\mathbf{C}_{j}|j\neq m\}$은 모두 mask out 함
    - 추가적으로 각 condition embedding $\mathbf{C}_{m}$이 다른 condition embedding에 attend 하는 것을 prevent 하여 cross-condition information leakage를 방지함
  2. 즉, emotional style은 locally active condition에 의해 결정되는 반면 semantic content는 standard causal context를 통해 globally visible 하게 유지됨
- BUT, 2D causal attention mask를 적용하기 위해서는 generated semantic token과 source text token 간의 alignment가 필요함
  - 따라서 논문은 Bayesian-style alignment tracking method인 Monotonic Stream Alignment (MSA)를 도입함
Monotonic Stream Alignment (MSA)
- $\mathbf{A}_{i}\in\mathbb{R}^{L\times H\times T}$는 current semantic token $\mathbf{s}_{i}$에서 $L$ layer, $H$ head의 $T$ text token에 대한 raw attention map이고, 이때 $\mathbf{A}^{(l,h)}_{i}$는 head $(l,h)$에 대한 attention vector와 같음
- Online autoregressive decoding 시 MSA는 $\mathbf{s}_{i}$의 alignment를 tracking 하기 위해 text position에 대한 belief distribution을 maintain 함
- $T$ text token에 대한 prior distribution $\hat{\pi}_{i}$, posterior distribution $\pi_{i}$이 주어진다고 하자
  1. 각 decoding step $i$에서 MSA는 previous step의 posterior $\pi_{i-1}$을 monotonic transition operator $\mathcal{P}$를 사용해 text sequence를 따라 propagate 하는 Predict step을 수행함
    - 해당 propagation은 strong temporal monotonicity를 encode 하는 prior distribution $\hat{\pi}_{i}$를 생성함
  2. Monotonic prior $\hat{\pi}_{i}$를 얻은 다음, MSA는 Select step에서 각 head의 attention distribution이 predicted alignment와 일치하는지를 계산하여 most reliable attention head를 select 함:
    (Eq. 1) $(l^{*},h^{*})=\arg\max_{l,h}\hat{\pi}_{i}^{\top}\log \mathbf{A}_{i}^{(l,h)}$
    - $\mathbf{A}^{(l,h)}_{i}$ : head $(l,h)$에 대한 attention vector
    - Resulting head $(l^{*},h^{*})$는 subsequent update에서 most reliable attention observation을 제공함
  3. Update step에서 MSA는 selected attention observation $\mathbf{A}^{(l^{*}, h^{*})}_{i}$와 monotonic prior $\hat{\pi}_{i}$를 combine해 posterior alignment belief를 얻음:
    (Eq. 2) $ \pi_{i}=\frac{\hat{\pi}_{i}\odot \mathcal{G}_{\sigma}\left(\mathbf{A}_{i}^{(l^{*}, h^{*})}\right)}{Z}$
    - $\odot$ : element-wise multiplication, $\mathcal{G}_{\sigma}(\cdot)$ : Gaussian smoothing operator, $Z$ : normalization factor
    - 해당 update는 monotonicity를 보장하면서 real-time attention evidence를 incorporate 하여 stable alignment trajectory $\pi_{i}$를 생성함

Monotonic Stream Alignment

- Segment-Aware Duration Steering

논문은 fully training-free autoregressive setting에서 multi-segment duration control을 지원하기 위해 emotion control framework를 추가적으로 extend 함
Local Duration Embedding Steering
- IndexTTS2를 따라 duration control을 semantic token length로 index 된 dedicated duration embedding에 condition 하고 embedding table $\mathbf{W}_{dur}$를 semantic positional embedding table $\mathbf{W}_{sem}$과 tie 함
- $M$ segment와 desired duration $\mathbf{d}=\{d_{1},d_{2},...,d_{M}\}$을 가진 utterance가 주어지면, 각 segment duration은 codec token rate에 따라 semantic token 수 $\hat{\mathbf{d}}=\{\hat{d}_{1},\hat{d}_{2},...,\hat{d}_{m}\}$으로 convert 됨
  1. 이후 segment-level target을 cumulative token length $\hat{D}_{i}=\sum_{k=1}^{i}\hat{d}_{k}$로 accumulate 하고 segment-wise initial duration embedding을 $\mathbf{D}_{i}=\mathbf{W}_{dur}[\hat{D}_{i}]$와 같이 retrieve 함
  2. 그런 다음 segment-level conditioning input $\mathbf{C}_{m}$으로 concatenate 하여 subsequent generation을 guiding 함
- Autoregressive decoding 시 actual semantic token generation speed는 user-specific target과 deviate 할 수 있음
  - 해당 deviation을 correct 하기 위해, 논문은 adaptive duration table lookup을 기반으로 duration embedding을 dynamically update 하는 local duration embedding steering mechanism을 도입함
- 각 decoding step $i$에서 MSA를 활용해 current aligned text position을 estimate 하고 active segment 내에서 normalized progress indicator (text progress $r_{text}$, semantic progress $r_{sem}$)를 compute 함
  1. 해당 discrepancy $\Delta r=r_{text}-r_{sem}$가 positive value이면 semantic generation이 lagging 하다는 것을 의미하고, 이때 semantic token length는 proportional controller를 통해 adjust 됨:
    (Eq. 3) $ \Delta\hat{D}_{i}=\text{clip}\left(\lfloor k\cdot \Delta r\rceil, -\Delta_{\max}, \Delta_{\max}\right)$
    - $k$ : correction strength, $\lfloor \cdot \rceil$ : nearest integer로의 rounding, $\Delta_{\max}$ : maximum adjustment
  2. Effective segment-wise target은 $\hat{D}_{i}+\Delta\hat{D}_{i}\rightarrow \hat{D}_{i}^{'}$과 같이 update 되고, duration table $\mathbf{W}_{dur}$는 active segment에 대해서만 re-query 되어 updated duration embedding $\mathbf{D}'_{i}$를 얻음
    - 이때 다른 segment의 duration embedding은 unchange 됨
Global EOS Steering
- Autoregressive decoding에서 End-to-Semantic (EOS) token은 sequence termination과 duration을 결정함
- Local duration embedding steering은 local generation pace를 regulate 할 수 있지만, decoding end를 explicitly control하지는 않으므로, 논문은 global EOS strategy를 도입하여 sequence termination을 결정함
- 이를 위해 EOS logit에 adaptive bias를 적용함
  1. Pre-mature termination을 방지하기 위해 모든 non-final segment에서는 EOS generation을 suppress 하고
  2. Final segment에서는 remaining semantic budget에 따라 EOS logit을 progressively adjust 하여 EOS emission을 smoothly encourage 함

3. Experiments

- Settings

Dataset : MED-TTS
Comparisons : MaskGCT, F5-TTS, CosyVoice2, IndexTTS2, Spark-TTS

MED-TTS Dataset

- Results

전체적으로 TED-TTS의 성능이 가장 우수함

Model 성능 비교

다양한 duration scaling에 대해서도 우수한 TTS 성능을 보임

Duration Scaling 별 성능

Ablation Study
- 각 component는 성능 향상에 유효함

Ablation Study

Monotonic Stream Alignment를 활용하면 precise emotion transition을 얻을 수 있음

Alignment Path

TED-TTS는 낮은 token error rate를 보임

Token Error Rate

Segment-aware emotion conditioning (S2), segment-aware duration steering (S3)은 efficiency trade-off를 가짐

Efficiency

[이달슈] 이달의 슈게이즈 6회 - 26년 6월

feVeRin — Sat, 27 Jun 2026 11:35:40 +0900

이달의 슈게이즈 6회 - 26년 6월

* 업로드 당일 기준 작성자 레이더망에 걸린 것들만 올리니 놓치는게 있을 수도 있습니다.

1. 흰천장은 무너졌냐

파란노을의 새 싱글로 포문을 열어봅시다. 지난 11일에 발매된 '상처'는 노스텔직한 분위기가 가미된 무난한 파란노을식 사운드를 들려줍니다. 이전 곡들에 비해선 그나마 괜찮은 편에 속하긴 합니다만, 비유도 직설도 못하는 처참한 가사와 하드코어 할당제처럼 들어간 목적 없는 공격성 때문에라도 이 이상의 칭찬은 못하겠네요.

파란노을 - '상처'

2. NME의 선택

5일 샌프란시스코에서는 올해 초 'NME 100'에 이름을 올렸던 신예 Midrift의 데뷔앨범 <Silhouette>이 발매되었습니다. 전반적으로 앨범은 포스트-하드코어에 기반한 20년대 영미권 슈게이즈의 대세를 그대로 따르면서 괜찮은 불안증과 폭발력을 보여줍니다. 한편 하이프를 넣었던 NME는 앨범에 3.5를 주며 적당한 호응을 보냈는데, 뻔한 사운드보다는 고등학생들로 이루어진 이 밴드의 성장 가능성에 좀 더 기대를 거는 듯합니다.

Midrift - 'Over Anything'

3. 전통의 계승자

Velveteen의 EP <My Dreams are Changing>은 런던 언더그라운드다운 클래식한 슈게이즈를 선보입니다. 앨범을 지배하는 윙윙거리는 기타와 떠도는 보컬은 분명한 레퍼런스를 가지고 있지만, 최근에는 이모(Emo)나 포스트-하드코어 섞인 감정과부하 슈게이즈들이 난립하는 탓에 오히려 이런 정석적인 구성이 더 반갑게 느껴지네요.

Velveteen - 'Shoot Me Down'

4. 앨범명이 중요한가

밴쿠버의 Cherry Pick과 시애틀의 44go가 발매한 신보는, 기존의 지루하고 현학적인 앨범명과는 거리가 먼, <:3>와 <D1>이라는 성의 없는(?) 타이틀을 가지고 있습니다. 그중에서도 <:3>는 삐걱이는 소음과 나른한 보컬이 만들어내는 부조화가 특징입니다. 그리고 <D1>의 경우 브릿팝의 영향이 두드러지는데, 특히 오프닝 'Revolver'가 Oasis를 강하게 연상시켜서 여러모로 듣는 재미가 있습니다.

Cherry Pick - 'Aster'

44go - 'Revolver'

5. 이탈리안 게임-포스트록 바리에이션

이탈리아에서는 4월 이달슈에서도 소개했었던 Klimt 1918의 신보 <Àmor>가 공개되었습니다. 전체적으로 포스트록 활용이 눈에 띄는 앨범인데, 이 밴드의 진정한 매력은 그 웅장한 사운드스케이프 속에서도 서정적인 선율을 잃지 않는다는 것에 있습니다. 여담으로 전/후기 스타일이 꽤 차이나는 밴드이기도 하니 좀 더 메탈릭 하고 고딕적인 색채를 느끼고 싶다면 이들의 1, 2집을 들어보는 것도 추천합니다.

Klimt 1918 - 'Un Été Invincible'

6. 싱글을 따라서

신보가 예고된 두 선공개 싱글로 넘어가봅시다. 먼저 Wishy의 'Lovesick'은 Pains of Being Pure at Heart를 연상시키는 쟁글팝 사운드가 특징으로 10월 <Nature’s Pill>에 수록될 예정입니다. 추가로 퍼지한 기타를 앞세운 The Otals의 '空想するタルトタタン (Kuso suru Tarte Tatin)'는 바로 다음 달에 나올 <Hallucination Club>에 수록될 예정이니 두 앨범 모두 때에 맞춰 찾아보시길 바랍니다.

Wishy - 'Lovesick'

The Otals - '空想するタルトタタン'

7. 저마다의 푸른 채도

이상할 정도로 여름을 좋아하는 일본에서는 5월말부터 여름 테마의 곡들이 쏟아지고 있습니다. 그중에서도 데뷔 EP <私の夢を見ていてね (Watashi no Yume wo Miteitene)>를 공개한 Kurayamisaka 커버밴드 출신의 海風邪 (Umikaze)는, 굵직하고 날카로운 기타와 멜로디컬한 탄산감이 조화된 새초롬한 사운드를 선보입니다. 반면 Beachside Talks는 싱글 'Whale Net'에서 쨍쨍하고 밝은 채도를 채택했는데, 작년 The Otals와의 합작앨범에 들어갔던 'Teenage Summer Lovers'도 그렇고 이런 청춘미 하나는 정말 기막히게 그려내는 것 같습니다.

海風邪 - '雪解け'

Beachside Talks - 'Whale Net'

8. 이젠 정말 따라가기 힘든

마지막으로 22일에 발매된 Project Zia의 신보 <Somnithesia>를 살펴봅시다. 이번 달은 정말 소개하고 싶은 신보가 많았지만, 굳이 이 아티스트를 선정한 이유에는 Virtual Singer라는 기상천외한 컨셉에 있습니다. 일단 (이번 달에 또 앨범을 낸!) 보카게이즈 공장장 路傍の石가 작곡에 참여했고 앨범 자체도 꽤 괜찮습니다만, 작품 외적으로 상당히 기묘합니다. 홈페이지를 가보면 무슨 중성의 안드로이드 이러는데, 路傍の石랑 엮인 것 치고는 남녀보컬이 보컬로이드는 아닌 것 같고 Virtual Singer면 버튜버라는건지.. 요즘 트렌드 따라가기 참 어렵네요.

Project Zia - 'Never Let Me Go'

9. 미처 말하지 못한 앨범들

- Album
Computers for the Military - <Metastability>
LULU Suicide -<Fiber Optic Lovers>
路傍の石 - <例えケーキを切れなくても>
The Hanging Gardens - <Noise>
The Dharma Chain - <Some Kind of Pure State>
Widowspeak - <Roses>

- EP
Glitterspitter - <You Don't Know the Dark>
Seasurfer - <Angels>
Suns - <World Eater>
Thistle - <Backflip>
Found Space - <Cloud Study>
Rainsong - <In Tatters>

- Single

Hardenbergia - 'Down to Blue / Toumei na Ori'

Dog Days - 'Petals'
Deepstale - 'Song for You'
Priyanshu - 'S.O.S.'
Hiromi Yuu - 'Tus Memorias en mi Pecho'
Reverie - 'Never Did'
Laterno - 'Drift'
Chitin - 'Graveller'
Greenhouse - 'Abrasion'

이달의 슈게이즈 6회: 26년 6월

[Paper 리뷰] Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

feVeRin — Fri, 26 Jun 2026 10:37:21 +0900

Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

Large Language Model-based Text-to-Speech는 fine-grained emotion intensity control의 한계가 있음
Emo-LiPO
- Fixed transcript 하에서 각 emotion에 대한 global intensity ordering을 modeling
- 추가적으로 multi-speaker dataset인 ESD-Plus를 구축
논문 (IJCAI 2026) : Paper Link

1. Introduction

Large Language Model (LLM)-based Text-to-Speech (TTS)는 여전히 emotion intensity에 대한 fine-grained control의 한계가 존재함
- 이는 textual emotion description과 realization 간의 semantic-acoustic gap 때문
  - 기존 model은 emotion intensity를 implicitly realize 하므로 high-intensity emotional prompt에서는 intensity expression이 unstable 해짐
- 이를 해결하기 위해 최근에는 preference optimization을 통해 subjective human judgement를 반영함
  - BUT, 대부분의 preference-based TTS는 categorical level의 emotion을 control 하므로 emotion intensity의 ordinal structure를 explicitly modeling 하기 어려움

-> 그래서 fine-grained emotion intensity control을 위한 Emo-LiPO를 제안

Emo-LiPO
- Listwise Preference Optimization (LiPO)를 도입하여 emotion intensity와 prompt-conditioned speech를 explicitly align
- 특히 LLM-based TTS를 Learning-to-Rank (LTR) problem으로 formulate 하여 fine-grained modeling을 지원

< Overall of Emo-LiPO >

LiPO를 활용한 emotion intensity controllable LLM-based TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Problem Formulation

- Task Definition

먼저 text transcript와 emotion specification이 주어지면 LLM-based TTS model은 linguistic content, emotional attribute에 condition 된 speech를 생성함
- 이때 natural language prompt는 emotion category와 non-neutral emotion에 대한 relative intensity level을 specify 함
  1. $\mathcal{C}=\mathcal{C}_{emo}\cup\{neutral\}$을 emotion category라고 하자
    - $\mathcal{C}_{emo}=\{happy, sad, angry\}$이고 $neutral$은 emotionally flat speech에 해당함
  2. 각 $c\in\mathcal{C}_{emo}$에 대해 intensity level에 대한 ordered set $\mathcal{L}=\{l_{1},l_{2},...,l_{K}\}$는 $l_{i}<l_{j}$에 대해 emotional intensity가 strictly increase 한다는 것을 의미함
    - 이때 weakest level $l_{1}$은 neutral speech와 distinct 되어 emotional, non-emotional output 간의 clear bound를 preserve 함
- Text transcript $t$, emotion prompt $P_{c,l}$에 대해 model input은 $x=(t, P_{c,l})$과 같이 구성됨
  - $c$ : emotion category, $l$ : intensity level
- 그러면 TTS model은 $S=\pi_{\theta}(x)$와 같이 speech를 생성하고, 해당 speech는 다음을 만족해야 함:
  1. Content Fidelity
    - Speech는 transcript $t$의 linguistic content를 faithfully convey 해야 함
  2. Category Correctness
    - Speech는 emotion $c$를 정확히 따라야 함
  3. Intensity Ordering
    - $c\in\mathcal{C}_{emo}$에 대해, stronger level은 higher intensity level을 가져야 함

- Learning-to-Rank Formulation via LiPO (Emo-LiPO)

논문은 Listwise Preference Optimization (LiPO) framework를 활용하여 fine-grained emotion intensity control을 Learning-to-Rank (LTR) problem으로 formulate 함
- LiPO는 same text transcript로 생성된 multiple candidate로 정의된 listwise preference data로 학습되므로 emotion category correctness와 global intensity ordering을 modeling 하는데 적합함
- 특히 Emo-LiPO는 listwise formulation $\mathcal{D}_{LiPO}=\{(x=(t,P_{c,l}), y=(\mathcal{T}_{c,l},\psi_{c,l})) \}$을 사용함
  - $\mathcal{T}_{c,l}$ : TTS model이 생성한 speech sample list, $\psi_{c,l}\in [0,1]^{|\mathcal{T}_{c,l}|}$ : 해당 sample과 associate 된 real-valued preference score로써 large value일수록 stronger preference를 가짐

Prompt-Conditioned Emotion Intensity Control

3. Method

- Rule-based Preference Construction

Prompt $P_{c,l}$은 same text transcript $t$를 가진 $K+2$ speech candidate로 구성된 listwise preference set $\mathcal{T}_{c,l}$을 rule-based ranking strategy로 사용함
- 이때 list는 다음과 같이 구성됨:
  1. Exact emotion category $c$, intensity level $l$을 가진 target sample $S_{c,l}$
  2. Same category $c$의 서로 다른 intensity level $l'\in\mathcal{L}\backslash \{l\}$을 가진 $K-1$ same emotion sample
  3. 1개의 neutral sample
  4. 1개의 randomly selected non-target emotion category에 대한 negative sample
- 해당 rule-based strategy 하에서 $K-1$ same-emotion sample은 absolute intensity distance $|l'-l|$에 따라 order 되고 target intensity에 close 한 sample이 prefer 됨
- 결과적으로 preference list는:
  (Eq. 1) $ \mathcal{T}_{c,l}=[S_{c,l}\succ S_{c,l_{closest}}\succ ...\succ S_{c,l_{farthest}}\succ S_{neu}\succ S_{\bar{c}}]$
  - $l_{closest}, l_{farthest}$ : $l$에 따라 증가하는 intensity level을 나타냄
- Rule-based ranking에 기반하여 real-valued preference label vector $\psi_{c,l}\in[0,1]^{|\mathcal{T}_{c,l}|}$을 $\mathcal{T}_{c,l}$ 내의 speech sample에 assign 함
  1. 이를 위해 논문은 index-based preference scoring scheme을 채택하여, 각 sample의 preference label을 ordered list $\mathcal{T}_{c,l}$의 position에 따라 결정함
    - 즉, $i$를 list 내의 sample index라고 할 때 index가 작을수록 stronger preference를 가짐
  2. 이때 $i$-th sample의 preference label은:
    (Eq. 2) $ \psi_{c,l}(i)=1-\frac{i-1}{K+2},\,\,\, i=1,...,K+2$

Overview

- Multi-Stage Optimization

논문은 Emo-DPO를 따라 multi-stage manner로 TTS model을 optimize 함
- LLM-based TTS model은 conditional autoregressive generation task로 formualte 되어 speech token과 speaker embedding에 대한 joint input $x=(t,P_{c,l})$을 사용함
- SFT stage에서는 paired prompt-speech dataset $\mathcal{D}_{SFT}$로 supervised fine-tuning을 수행하여 backbone TTS model $\pi_{base}$를 initialize 함:
  (Eq. 3) $\mathcal{L}_{SFT}(\pi_{base})=\mathbb{E}_{(x,S)\sim \mathcal{D}_{SFT}}[-\log \pi_{base}(S|x)]$
  - $\log \pi_{base}(S|x)$ : autoregressively generated speech token에 대해 token-level cross-entropy를 사용한 teacher forcing으로 compute 됨
  - Resulting model은 reference TTS policy $\pi_{ref}$와 같음
- LiPO stage에서는 $\pi_{ref}$로 initialize 한 다음 listwise preference dataset $\mathcal{D}_{LiPO}$를 사용해 model을 optimize 함
  1. Preference list $\mathcal{T}_{c,l}$와 associated preference label vector $\psi_{c,l}$이 주어지면 listwise preference optimization을 사용해 desired emotion intensity ordering과 align 되는 speech sample을 생성하도록 유도함
  2. 이때 optimization objective는:
    (Eq. 4) $ \mathcal{L}_{LiPO}(\pi_{\theta};\pi_{ref},\beta)=\mathbb{E}_{(x,\mathcal{T}_{c,l},\psi_{c,l})\sim \mathcal{D}_{LiPO}}[r(\psi_{c,l},s)]$
    - $r(\cdot)$ : LiPO framework 하에서 listwise learning-to-rank loss, $s$ : reference policy $\pi_{ref}$에 대한 current policy $\pi_{\theta}$의 ranking score
  3. 각 candidate $S_{i}\in\mathcal{T}_{c,l}$에 대한 score는:
    (Eq. 5) $ s_{i}=\beta\log\frac{\pi_{\theta}(S_{i}|x)}{\pi_{ref}(S_{i}|x)}$
  4. LiPO formualtion을 따라 $r(\cdot)$은 다음과 같이 정의됨:
    (Eq. 6) $ r(\psi_{c,l},s)=-\sum_{(i,j)\in\psi_{c,l}}\lambda_{i,j}(s_{i}-s_{j})$
    - $\psi_{c,l}=\{(i,j)|\psi_{c,l}(i)>\psi_{c,l}(j)\}$ : candidate pair $(S_{i},S_{j})$에 대한 relative order, $\lambda_{i,j}$ : weighting term
- 이후 position index $i$를 기반으로 gain $G(\cdot)$, discount function $D(\cdot)$을 정의함:
  (Eq. 7) $ G(i)=2^{\psi_{c,l}(i)-1},\,\,\, D(i)=\frac{1}{\log(1+i)}$
- $\lambda_{i,j}$는:
  (Eq. 8) $ \lambda_{i,j}=|G(i)-G(j)|\cdot\left|\frac{1}{D(i)}-\frac{1}{D(j)}\right|$
  - $\lambda_{i,j}$가 클수록 list ranking에서 $S_{i}, S_{j}$ 간의 intensity gap이 커짐

4. Experiments

- Settings

Dataset : ESD-Plus
Comparisons : CosyVoice, EmoVoice, Emo-DPO

- Results

전체적으로 Emo-LiPO의 성능이 가장 우수함

Model 성능 비교

Subjective evaluation 측면에서도 Emo-LiPO가 가장 선호됨

Arena Win Rate

Emo-LiPO는 높은 emotion recognition accuracy를 가짐

Emotion Recognition Accuracy

Score Distance Visualization
- Closest same emotion sample 간의 margin $\Delta s_{closest}$, neutral sample 간의 margin $\Delta s_{neu}$, other emotion sample 간의 margin $\Delta s_{\bar{c}}$에 대해, 각 margin은 stable hierarchy를 가짐
- 즉, Emo-LiPO는 same emotion 내에서 fine-grained intensity hierarchy를 preserve 할 수 있음

Sccore Distance Visualization

Ablation Study
- 각 component는 성능 향상에 유효함

Ablation Study

[Paper 리뷰] CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

feVeRin — Tue, 9 Jun 2026 13:06:28 +0900

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

대부분의 text-to-speech system은 single utterance-level emotion을 enforce 함
CoCoEmo
- Activation steering에 대한 multi-rater evaluation protocol을 도입
- Human-like emotional speech를 위한 lightweight steering approach를 적용
논문 (ICML 2026) : Paper Link

1. Introduction

Natural speech는 inherently complex 하고 multiple concurrent, conflicting affective signal이 combine 되는 경우가 많음
- 특히 대부분의 expressive Text-to-Speech (TTS) model은 emotion을 single, globally coherent state로 취급함
  - 이로인해 mixed emotion은 single dominant tone으로 average 됨
- 이를 위해 label granularity를 늘리거나 richer emotion annotation으로 retraining 할 수 있지만, 근본적인 원인을 해결하지는 못함
  1. 한편 steering vector를 활용하면 pre-trained TTS system의 latent representation space에서 controlled directional bias를 반영할 수 있음
    - 특히 mixed emotion은 multiple emotion-specific steering direction으로 나타나고 text-emotion misalignment는 textual content와 independent 하게 acoustic feature를 modulate 하여 express 됨
  2. BUT, steering vector를 Speech Language Model (SLM)에서 효과적으로 적용하기 위해서는 steering 위치, steering 방법, steering evaluation 등에 대한 gap을 해결해야 함

-> 그래서 SLM에서 steering vector의 동작을 분석하여 controllability를 개선한 CoCoEmo를 제안

CoCoEmo
- Modular emotional TTS architecture에 대한 in-depth analysis를 수행하고 evaluation을 위한 multi-rater protocol을 도입
- 추가적으로 optimal SLM layer에 steering vector를 inject 하여 reliable mixed-emotion synthesis를 지원

< Overall of CoCoEmo >

SLM과 같은 hybrid TTS system을 bridge하는 steering vector mechanism
결과적으로 기존보다 우수한 성능을 달성

2. Disentangling Emotion in SLM and Flow-Matching

- Model Overview

Hybrid TTS system은 일반적으로 2-stage architecture를 사용함
- $\mathbf{x}_{i}$를 $i$-th input text sequence, $\mathbf{c}_{ref}$를 target emotion에 대한 reference signal이라고 하자
- First stage에서 TTS language model $f_{SLM}$은 해당 input을 discrete speech token sequence $\mathbf{z}$로 mapping 함:
  (Eq. 1) $ \mathbf{z}_{i}=f_{SLM}(\mathbf{x}_{i},\mathbf{c}_{ref})$
  - $\mathbf{z}=(z_{i}^{1},...,z_{i}^{T})$ : token sequence
- Second stage에서 flow-matching acoustic model $f_{Flow}$는 speech token sequence를 mel-spectrogram으로 transform 하고 pre-trained vocoder $g_{voc}$를 통해 waveform으로 convert 함:
  (Eq. 2) $\mathbf{m}_{i}=f_{Flow}(\mathbf{z}_{i},\mathbf{c}_{ref}),\,\,\, \mathbf{y}_{i}=g_{voc}(\mathbf{m}_{i})$

- Where to Steer 1: Modular Analysis

Cross-Conditioning Diagnostic
- Emotional expression에 대한 SLM과 Flow-Matching module의 contribution을 disentangle 하기 위해 논문은 Cross-Conditioning Diagnostic을 도입함
- $\mathbf{c}^{e},\mathbf{c}^{n}$을 각각 emotional, neutral conditioning signal이라고 하자
  1. SLM-Driven
    - Emotion reference는 speech token $\mathbf{z}_{i}$를 modify 하기 위해 SLM에만 적용되고, flow-matching module은 neutral condition에서 동작함:
    (Eq. 3) $\mathbf{z}_{i}^{e}=f_{SLM}(\mathbf{x}_{i},\mathbf{c}^{e}),\,\,\, \mathbf{m}_{SLM}=f_{Flow}(\mathbf{z}_{i}^{e},\mathbf{c}^{n})$
  2. Flow-Driven
    - SLM은 neutral이고 emotion reference는 flow-matching을 통해서만 도입됨:
    (Eq. 4) $\mathbf{z}_{i}^{n}=f_{SLM}(\mathbf{x}_{i},\mathbf{c}^{n}),\,\,\, \mathbf{m}_{Flow}=f_{Flow}(\mathbf{z}_{i}^{n},\mathbf{c}^{e})$
  3. Emotion이 SLM에서 encode 된다면 SLM-Driven은 stronger emotional expressiveness를 생성해야 함
    - 그렇지 않으면 Flow-Driven이 dominate 함

Cross-Conditioning Diagnostic

Findings and Design Implications
- Energy contour 측면에서 SLM-Driven condition은 emotion 별로 distinct prosodic pattern이 나타나고, Flow-Driven condition은 largely overlapped contour가 나타남
  - 즉, flow-matching module은 prosody를 alter 하지 않고 acoustic rendering에만 관여함
- 위 표의 cross-conditioning diagnostic에서 SLM-Driven은 lower CCC, higher SR STD를 가짐
  - 즉, SLM은 synthesized emotional feature의 variability를 govern 하고 flow-matching은 local rendering을 refine 함
- 결과적으로 SLM이 emotional prosody의 primary driver이므로 emotion steering은 SLM에 적용되어야 함

Energy Contour

- Where to Steer 2: Layer and Operator Selection

Why Linear Separability
- Mixed emotion에서 steering vector는 complex expression을 생성하기 위해 서로 다른 direction을 가리키도록 combine 되므로, linear separability는 steerability의 proxy로 사용될 수 있음
- 즉, higher separability를 가질수록 steering vector를 reliable extract 할 수 있고 combine 할 수 있음
Layer- and Operation-Level Probing for SLM Steering
- $\mathcal{D}=\{(\mathbf{x}_{i},\mathbf{a}_{i},y_{i})\}_{i=1}^{N}$을 $N$ sample로 구성된 dataset이라고 하자
  - $\mathbf{x}_{i}$ : input text, $\mathbf{a}_{i}$ : reference emotional speech, $y_{i}\in\{0,1,...,E\}$ : emotion label
- SLM은 multiple operation $\mathcal{O}^{(l)}$을 가진 $L$ Transformer layer를 가지고, 이때 layer-/operational-wise activation은:
  (Eq. 5) $\mathbf{h}_{i}^{(l,o)}=\left\{\begin{matrix} \text{Op}^{(l,o)}(\mathbf{x}_{i},\mathbf{a}_{i}), & l=1 \\ \text{Op}^{(l,o)}(\mathbf{h}_{i}^{(l-1)}), & l=2,...,L \\ \end{matrix}\right.,\,\,\, o\in\mathcal{O}^{(l)}$
  - $\text{Op}^{(l,o)}$ : attention, feed-forward network와 같은 operation
- Emotion이 most distinctly represent 되는 위치를 identify 하기 위해, 논문은 $y_{i}$를 predict 하는 linear probe $\mathbf{h}_{i}^{(l,o)}$를 training 하고 accuracy를 통해 linear separability를 measure 함
  - Highest discriminability를 가지는 Top-$K$ layer, operation은 steering vector를 추출하고 inject 하는 데 사용됨
Findings and Design Implications
- 아래 그림과 같이 CosyVoice2에서는 10-17 layer가 strong linear separability를 가지고, operation 중에서는 $\texttt{attn\_output}$이 highest discriminability를 보임
  - IndexTTS2의 경우 5-10 layer
- 결과적으로 mid-to-late layer와 attention output은 emotion representation에 대한 highest linear separability를 가짐

Emotion Discriminability

3. Method

위 결과를 바탕으로 논문은 identified model layer에서 각 individual emotion에 대한 steering vector를 추출함
- Mixed-emotion vector는 single-emotion vector의 weighted combination으로 구성되고, emotion proportion에 대한 quantitative control을 지원함
- Steering vector는 linguistic representation과는 independent 하게 emotional acoustic variation으로부터 추출되고 text-emotion mismatch를 handling 함

Overview

- Steering Vector Construction

Single Emotion Steering
- 논문은 mean-difference approach를 활용하여 emotion steering vector를 compute 하고 mean neutral representation에서 mean target emotion representation으로 이동함
  - 이때 acoustic emotion information을 isolate 하기 위해 same speaker, transcript를 가지는 sample만 compare 함
- Emotion label $y_{i}\in\{0,...,E\}$, neutral speech $y_{i}=0$에 대해, dataset $\mathcal{D}=\{(\mathbf{x}_{i},\mathbf{a}_{i},y_{i})\}_{i=1}^{N}$이 주어진다고 하자
  1. 먼저 speaker, linguistic content를 control 하기 위해 speaker-matched neutral-emotion pair를 구성함
  2. 특히 각 target emotion $e\in\mathcal{Y}$에 대해, same speaker의 emotion-$e$ utterance와 neutral utternace를 pair 하여 두 subset $D^{(e)}, D_{0}^{(e)}$를 구성함
  3. Sample $i$에서 select 된 $l$ layer와 operation $o$에서의 last-token activation을 $\mathbf{h}_{i}^{(l,o)}$라고 하면, emotion $e$의 steering vector는 emotion-$e$ sample과 paired neutral counterpart 간의 mean representation과 같음:
    (Eq. 6) $\mathbf{v}_{e}^{(l,o)}=\frac{1}{|\mathcal{D}^{(e)}|}\sum_{i\in\mathcal{D}^{(e)}} \mathbf{h}_{i}^{(l,o)}-\frac{1}{|\mathcal{D}_{0}^{(e)}|}\sum_{j\in\mathcal{D}_{0}^{(e)}} \mathbf{h}_{j}^{(l,o)}$
- 결과적으로 vector $\mathbf{v}_{e}^{(l,o)}$는 latent space에서 emotion $e$에 대한 direction을 capture하고 추론 시 inject되어 target emotion expression을 induce함
  - Mismatch scenario에서 steering vector는 text-implied emotion을 override 하고 internal bias로 동작함
Mixed Emotion Steering
- Mixed emotion은 single emotion vector $\mathbf{v}_{e}^{(l,o)}$를 combine 하여 steering vector를 compute 함
- Target emotion에 대한 weight를 $\{p_{e}\}^{E}_{e=1}$이라 하고 $\sum_{e=1}^{E}p_{e}=1$이라고 할 때, mixed emotion steering vector는:
  (Eq. 7) $\mathbf{v}_{mix}^{(l,o)}=\sum_{e=1}^{E}p_{e}\mathbf{v}_{e}^{(l,o)}$

- Inference-Time Steering

추론 시에는 single emotion steering vector $\mathbf{v}_{e}^{(l,o)}$ 또는 mixed emotion vector $\mathbf{v}_{mix}^{(l,o)}$가 selected Top-$K$ layer와 operation에 inject 됨
- 각 selected layer, operation에서 activation $\mathbf{h}$는 steering을 통해 modulate 됨:
  (Eq. 8) $\tilde{\mathbf{h}}_{i}^{(l,o)}=\mathbf{h}_{i}^{(l,o)}+\alpha\cdot \mathbf{v}^{(l,o)}$
  - $\alpha$ : steering intensity, $\mathbf{v}^{(l,o)}$ : single emotion $\mathbf{v}^{(l,o)}_{e}$ 또는 mixed emotion $\mathbf{v}_{mix}^{(l,o)}$
- 추가적으로 논문은 original activation scale을 preserve 하고 semantic coherence를 maintain 하기 위해 $\tilde{\mathbf{h}}_{i}^{(l,o)}\leftarrow\frac{|| \mathbf{h}_{i}^{(l,o)}||}{||\tilde{\mathbf{h}}_{i}^{(l,o)}||}\cdot \tilde{\mathbf{h}}_{i}^{(l,o)}$와 같이 renormalize 함

- Mixed-Emotion Evaluation

Mixed-emotion synthesis를 evaluate 하기 위해서는 soft ground-truth가 필요함
- 이를 위해 논문은 multi-rater annotation을 활용함
  1. 각 speech recording $\mathbf{a}_{i}$는 $M$ rater에 의해 one-hot vector $y_{i,m}\in\{0,1\}^{|E|}$로 label 됨
  2. 이때 consensus distribution은:
    (Eq. 9) $\mathbf{p}_{i}=\frac{1}{M}\sum_{m=1}^{M}y_{i,m}$
    - e.g., $E=\{\texttt{happy, sad, angry}\}$에 대해 두 rater가 $\texttt{happy}$를 label 하고 한 rater가 $\texttt{sad}$를 label 했다면, $\mathbf{p}_{i}=[\frac{2}{3},\frac{1}{3},0]$과 같음
- 해당 consensus distribution은 (Eq. 7)의 steering vector $\mathbf{v}_{mix}^{(l,o)}$에 대한 mixing weight로 사용되고, synthesized speech는 $\mathbf{p}_{i}$가 derive 되는 동안 ground-truth target speech $\mathbf{a}_{i}$와 compare 됨

4. Experiments

- Settings

Dataset : ESD, RAVDESS, CREMA-D
Comparisons : CosyVoice2, IndexTTS2

- Results

CoCoEmo를 적용하면 더 나은 mixed-emotion synthesis가 가능함

Model 성능 비교

Emotion2Vec Similarity, Target Emotion Probability, Spearman Correlation 측면에서도 우수한 성능을 보임

Mixed-Emotion Synthesis

Text-Emotion Mismatch Speech Synthesis
- Mismatched set에 대해서도 robust 한 성능을 달성함

Mismatched Set에서의 성능

Activation steering을 활용하면 E-SIM을 consistently improve 할 수 있음

Mismatch Synthesis에서 Steering Strength $\alpha$의 효과

Single Emotion Steering
- Steering이 없는 $\alpha=0$에 비해 $\alpha$가 커질수록 TEP가 증가함
- 즉, steering vector를 통해 correct directional bias를 반영할 수 있음

Single Emotion Steering

Layer-wise Steering Analysis
- CosyVoice2에서 17, 14 layer는 highest separability를 가짐

Layer-wise TEP

[Paper 리뷰] FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representation

feVeRin — Mon, 8 Jun 2026 10:57:42 +0900

FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations

Zero-shot Text-to-Speech는 여전히 independent, precise control 측면에서 한계가 있음
FC-TTS
- 2-stage spectrogram generation pipeline과 VQ-VAE-based style encoder를 도입
- 추가적으로 conditioning-aware consistency loss를 도입해 attribute separation과 dual-reference control의 reliability를 향상
논문 (ACL 2026) : Paper Link

1. Introduction

Zero-shot Text-to-Speech (TTS)는 example utterance를 condition으로 style flexibility를 제공할 수 있음
- BUT, 대부분의 zero-shot TTS model은 style, timbre의 entanglement로 인해 independent control이 어려움
- 이를 위해 NANSY++, NaturalSpeech3와 같은 disentangled approach를 고려할 수 있지만, 여전히 imperfect disentanglement와 unseen style-timbre combination에 대한 robustness의 한계가 존재함

-> 그래서 disentangled representation을 modeling을 개선하여 controllability를 향상한 FC-TTS를 제안

FC-TTS
- Disentangled representation을 기반으로 2-stage spectrogram generation pipeline을 도입
- Fine-grained, intra-utterance style variability를 capture 하는 VQ-VAE-based style encoder와 disentangled control을 지원하는 Conditioning-aware Consistency Loss를 도입

< Overall of FC-TTS >

2-stage pipeline과 disentangled representation을 활용한 zero-shot controllable TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Preliminary

- Factorized Speech Codec

Timbre, style controllable TTS를 위해 FACodec을 고려할 수 있음
- 해당 codec은 speech signal을 discrete token의 multiple disentangled stream으로 factorize 하고, 각 stream은 distinct speech attribute를 capture 함
  - Prosody token $\mathbf{c}_{p}$, content token $\mathbf{c}_{c}$, acoustic detail token $\mathbf{c}_{d}$
- 각 stream은 time step $T$, residual quantization level $N_{p}=1, N_{c}=2, N_{d}=3$에 대해 $\mathbb{Z}^{N_{*}\times T}$와 같이 represent 되고, speaker timbre는 continuous global embedding $z_{spk}\in\mathbb{R}^{D}$으로 capture 됨
- 이때 논문은 $z_{spk}, \mathbf{c}_{p}$만 condition으로 사용하고 content token $\mathbf{c}_{c}$, acoustic detail token $\mathbf{c}_{d}$는 information leakage를 방지하기 위해 exclude 됨

- Flow-Matching TTS

Conditional TTS는 phoneme sequence $\mathbf{y}\in\mathbb{Z}^{L}$, speaker timbre와 같은 conditioning information $\mathbf{c}$가 주어졌을 때 log mel-spectrogram과 같은 target speech representation $\mathbf{x}\in\mathbb{R}^{F\times T}$를 생성함
- 해당 conditional generation을 위해 Conditional Flow-Matching (CFM)을 채택할 수 있음
  1. CFM은 isotropic Gaussian $\mathcal{N}(0,I)$와 같은 simple prior $p_{1}(\mathbf{x})$에서 target conditional distribution $p_{0}=p(\mathbf{x}|\mathbf{y},\mathbf{c})$로의 continuous-time transformation을 define 함
  2. $p_{1}$에서 $p_{0}$로의 progression을 describe 하기 위해 CFM은 time $t\in[0,1]$을 따라 sample을 transport 하는 time-dependent flow $\phi_{t}:[0,1]\times \mathbb{R}^{d}\rightarrow \mathbb{R}^{d}$를 도입함
    - 각 step은 marginal distribution $p_{t}(\mathbf{x})$를 가짐
  3. 해당 flow는 각 point에서 instantaneous direction을 specify 하는 velocity field $v_{t}(\mathbf{x}):[0,1]\times \mathbb{R}^{d}\rightarrow \mathbb{R}^{d}$에 의해 drive 되고, relationship은 Ordinary Differential Equation (ODE)를 통해 govern 됨:
    (Eq. 1) $ \frac{d}{dt}\phi_{t}(\mathbf{x})=v_{t}(\phi_{t}(\mathbf{x})), \,\,\, \phi_{1}(\mathbf{x})=\mathbf{x}_{1}$
- True $v$는 unavailable 하므로 CFM은 conditional vector field $v_{t}(\mathbf{x}|\mathbf{x}_{0})$로 $u_{\theta}(\mathbf{x},t,\mathbf{y},\mathbf{c})$를 training 하여 이를 approximate 함
  1. 이때 straight-line Optimal Transport (OT) trajectory가 가장 efficient 하고, 해당 ground-truth velocity는:
    (Eq. 2) $v_{t}^{OT}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathbf{x}_{0}-\mathbf{x}_{1}$
    - $\mathbf{x}_{t}=(1-t)\mathbf{x}_{1}+t\mathbf{x}_{0}$
  2. 이후 model은 (Eq. 3)의 loss를 minimize 하여 predicted velocity $u_{\theta}$를 OT velocity와 align 함:
    (Eq. 3) $\mathcal{L}_{CFM}=\mathbb{E}_{t,\mathbf{x}_{0},\mathbf{x}_{1}}\left[\left|\left| u_{\theta}(\mathbf{x}_{t},t,\mathbf{y},\mathbf{c}) - (\mathbf{x}_{0}-\mathbf{x}_{1})\right|\right|^{2}\right]$

3. Method

FC-TTS에서 timbre condition $z_{spk}$와 style condition $\mathbf{c}_{p}$는 training시 same target에서 extract 되고 추론 시에는 서로 다른 utterance의 reference를 사용함
- 구조적으로 FC-TTS는 두 condition을 sequentially process 함:
  1. Timbre stage에서는 $z_{spk}$를 통해 timbre characteristic을 anchor 하여 blurry spectrogram을 생성함
  2. Style stage에서는 $\mathbf{c}_{p}$를 통해 prosodic characteristic을 imprint 하여 refine 함
- 해당 2-stage framework를 통해 각 reference condition이 intended step에서만 반영되도록 함

- Hierarchical Spectrogram Generation

NaturalSpeech3와 같이 FACodec decoder를 simply reusing 하면 unseen combination에 대한 robust generation을 보장할 수 없으므로 independent timbre-prosody control이 어려움
- 이를 위해 논문은 jointly trained CFM speech decoder를 incorporate 해 hierarchical log mel-spectrogram generation을 수행함
- 먼저 timbre information을 사용하여 blurry spectrogram $\mathbf{h}$를 생성하고, CFM decoder를 통해 style information을 사용하여 complete spectrogram $\mathbf{x}_{0}$로 refine 함
  1. 해당 step은 blurry spectrogram에 대한 Mean Absolute Error (MAE) loss, final output에 대한 CFM loss로 jointly training 됨
    - 특히 MAE objective $\mathcal{L}_{blur}=\mathbb{E}[||\mathbf{h}-\mathbf{x}_{0}||]$는 over-smoothed output을 encourage 함
  2. 추가적으로 information leakage를 방지하기 위해 $z_{spk}$는 same long audio file의 다른 utternace로 randomly replace 됨
- 결과적으로 timbre adapter는 first stage에서 $z_{spk}$를 inject 하여 timbre characteristic을 anchor 하고 style adapter는 $\mathbf{c}_{p}$를 subsequently apply 해 각 reference가 dedicated pathway에만 influence 하도록 보장함

Overview

- VQ-VAE Style Encoding

Zero-shot TTS model은 In-Context Learning (ICL)을 활용해 voice characteristic을 mimic 함
- BUT, ICL은 timbre, style이 consistent 하다고 가정하므로 single utterance 내에서 speaking style이 varying 하는 것을 반영할 수 없음
- 이를 위해 논문은 training 시 target speech에서 extract 된 style representation을 condition으로 제공함
  - 이때 model은 higher-level prosodic pattern을 capture 하지 않고 style reference의 surface-level acoustic feature를 copying 하여 shortcut으로 사용할 수 있음
- 따라서 Transformer encoder, Cross-attention, Finite scalar quantization (FSQ) layer를 combine 한 TCF style encoder를 도입해 phoneme, frame level에서 style representation을 hierarchically modeling 함
  1. Prosody-only Representation
    - 논문은 TCF input으로 prosody token $\mathbf{c}_{p}$만 사용하고, content token $\mathbf{c}_{c}$, acoustic detail token $\mathbf{c}_{d}$는 exclude 함
    - 이를 통해 style encoder가 unintended information을 encoding 하지 않고 rhythmic, intonational pattern만 capture 하도록 할 수 있음
  2. Q-Former Bottleneck
    - Learned query token의 fixed set은 cross-attention을 통해 variable-length encoder output에 attend 하여 fixed latent token으로 compress 됨
    - 해당 bottleneck은 frame-level temporal detail을 discard 하고 high-level stylistic structure만 retain 하도록 force 하여 model이 specific acoustic realization에 overfit 되는 것을 방지함
  3. Vector Quantization
    - Q-Former를 통해 생성된 continuous latent token은 FSQ를 통해 further discretize 됨
    - FSQ는 low-level acoustic residual을 suppress 하고 semantically meaningful style code에 commit 하는 information bottleneck으로 사용됨

- Conditional Consistency Loss

Disentangled TTS에서 condition consistency를 향상하기 위해 논문은 Conditional Consistency Loss (CCL)을 도입함
- 먼저 CFM objective를 reparameterize 하여 FC-TTS decoder가 vector field 대신 log mel-spectrogram을 directly generate 하도록 구성함
  1. 이후 해당 spectrogram으로부터 두 attribute predictor를 training 하여 conditioning prosody token $\mathbf{c}_{p}$, speaker embedding $z_{spk}$를 predict 함
  2. 이때 각 predictor는 non-target conditioning signal도 receive 하여 prosody predictor에는 $z_{spk}$를 timbre predictor에는 $\mathbf{c}_{p}$를 feed 함
- 그러면 CCL은 prosody prediction을 위한 cross-entropy loss, speaker embedding consistency를 위한 negative cosine similarity의 weighted summation으로 얻어짐:
  (Eq. 4) $\mathcal{L}_{CCL}=\lambda_{ccl\text{-}pro}\cdot\mathbb{E}\left[\text{CE}\left(\mathbf{c}_{p}, f\left( \hat{\mathbf{x},z_{spk}}\right)\right)\right]-\lambda_{ccl\text{-}spk}\cdot\mathbb{E}\left[\cos\left( z_{spk},g\left(\hat{\mathbf{x}},\mathbf{c}_{p}\right)\right)\right]$
  - $\text{CE}(\cdot, \cdot)$ : cross-entropy loss, $f(\cdot)$ : prosody predictor, $g(\cdot)$ : speaker embedding predictor

CCL Gradient

4. Experiments

- Settings

Dataset : LibriHeavy
Comparisons : NaturalSpeech3, F5-TTS, DiTTo-TTS, CLaM-TTS

LibriHeavy Dataset

- Results

전체적으로 FC-TTS의 성능이 가장 우수함

Model 성능 비교

Timbre controllability 측면에서도 FC-TTS가 더 뛰어남

Timbre Controllability

Prosody control에서도 더 나은 성능을 보임

Prosody Controllability

AudioLLM-as-a-Judge 측면에서도 FC-TTS가 더 선호됨

AudioLLM-as-a-Judge

Ablation Study
- 각 component는 성능 향상에 유효함

Ablation Study

특히 2-stage design을 활용하면 timbre combination에 대한 robustness를 향상할 수 있음

Mel-Spectrogram 비교

Let IT Begin

[Paper 리뷰] SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

1. Introduction

2. Method

- Multi-Codebook Vector Quantization

- Unified Speech and Audio Representation Learning

3. Experiments

- Settings

- Results

[Paper 리뷰] DisCo-Speech: Controllable Zero-Shot Speech Generation with a Disentangled Speech Codec

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

DisCo-Speech: Controllable Zero-Shot Speech Generation with a Disentangled Speech Codec

1. Introduction

2. Method

- DisCodec: Disentangled Speech Codec

- Text-to-Codec Language Model

3. Experiments

- Settings

- Results

[Paper 리뷰] Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-Training

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-Training

1. Introduction

2. Method

- Model Architecture

- Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-Training

3. Experiments

- Settings

- Results

[Paper 리뷰] ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

1. Introduction

2. Method

- Preliminaries on Flow Matching

- Audio Compression

- Multimodal Diffusion Transformer for Environment-Aware Text-to-Speech

- Domain-Specific Representation Alignment

- Training and Inference

3. Experiments

- Settings

- Results

[Paper 리뷰] SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

1. Introduction

2. Method

- Model Architecture

- Auxiliary Speaker Feature Supervision

- Training Objective

3. Experiments

- Settings

- Results

[Paper 리뷰] TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

1. Introduction

2. Method

- Segment-Aware Emotion Conditioning

- Segment-Aware Duration Steering

3. Experiments

- Settings

- Results

[이달슈] 이달의 슈게이즈 6회 - 26년 6월

이달의 슈게이즈 6회 - 26년 6월

1. 흰천장은 무너졌냐

2. NME의 선택

3. 전통의 계승자

4. 앨범명이 중요한가

5. 이탈리안 게임-포스트록 바리에이션

6. 싱글을 따라서

7. 저마다의 푸른 채도

8. 이젠 정말 따라가기 힘든

9. 미처 말하지 못한 앨범들

[Paper 리뷰] Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

1. Introduction