[Paper 리뷰] Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

티스토리 뷰

Paper/TTS

[Paper 리뷰] Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

feVeRin 2026. 3. 25. 12:54

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

Flow-matching-based Text-to-Speech model은 cross-lingual task에 적용하기 어려움
Cross-Lingual F5-TTS
- Forced alignment를 활용하여 audio prompt를 pre-process 해 word boundary를 얻어 audio prompt로부터 direct synthesis를 수행
- Duration modeling을 위해 다양한 linguistic granularity를 가지는 speaking rate predictor를 도입
논문 (ICASSP 2026) : Paper Link

1. Introduction

Zero-shot Text-to-Speech (TTS)는 input text를 기반으로 주어진 reference speech와 resemble 한 speech를 생성함
- 특히 VoiceBox, E2-TTS와 같은 flow-matching-based model을 활용하면 fast, high-quality synthesis가 가능함
- BUT, 대부분의 flow-matching TTS model은 audio prompt script에 의존하므로 reference transcript가 unavailable 한 경우 활용하기 어려움
- 추가적으로 F5-TTS와 같이 explicit phoneme duration을 사용하는 경우, 서로 다른 language 간의 speech duration ratio mismatch가 발생하므로 cross-lingual task로 확장할 수 없음

-> 그래서 flow-matching-based TTS의 cross-lingual ability를 개선한 Cross-Lingual F5-TTS를 제안

Cross-Lingual F5-TTS
- Massively Multilingual Speech (MMS) forced alignment를 활용하여 word boundary를 반영
- Duration prediction을 개선하기 위해 phoneme, syllable, word-level에 대한 3가지 dedicated Speaking Rate Predictor를 도입

< Overall of Cross-Lingual F5-TTS >

MMS forced alignment와 Speaking Rate Predictor를 활용한 cross-lingual flow-matching-based TTS model
결과적으로 기존보다 우수한 성능을 달성

2. Method

- Preliminary on Flow-Matching-based TTS

E2-TTS, VoiceBox 등은 flow matching을 기반으로 high-quality synthesis를 수행함
- 이때 flow matching은 simple noise distribution $p_{0}$와 data distribution $q$ 간의 probability path $p_{t}$를 match 하는 time-dependent vector field $v_{t}$를 학습하여 sampled flow step $t\in[0,1]$에 대한 flow $\psi_{t}$를 생성함
  1. Training objective는 Conditional Flow Matching (CFM) loss로 formulate됨:
    (Eq. 1) $ \mathcal{L}_{CFM}=\mathbb{E}_{t,q(x_{1}),p(x_{0})}\left|\left| v_{t}\left(\psi_{t}(x)\right)-\frac{d}{dt}\psi_{t}(x)\right|\right|^{2}$
    - 해당 probability path는 Gaussian noise에서 sample $x_{0}\sim p(x_{0})$와 training data의 sample $x_{1}\sim q(x_{1})$을 connect 함
  2. Optimal Transport (OT) formulation을 적용하면, flow $\psi_{t}$는 straight-line trajectory로 정의됨:
    (Eq. 2) $\psi_{t}(x_{0})=(1-t)x_{0}+tx_{1}$
  3. 해당 velocity field는 constant vector $(x_{1}-x_{0})$이고, 이때 OT-CFM loss는:
    (Eq. 3) $\mathcal{L}_{CFM}=\mathbb{E}_{t,q(x_{1}),p(x_{0})}\left|\left| v_{t}\left( (1-t)x_{0}+tx_{1}\right)-(x_{1}-x_{0})\right|\right|^{2}$
- 특히 논문에서는 F5-TTS를 baseline으로 채택함
  1. F5-TTS는 Diffusion Transformer (DiT)를 활용한 fully flow-matching-based TTS model로써 OT-CFM 기반의 text-guided speech-infilling task를 통해 training 됨
  2. 즉, surrounding speech $(1-m)\odot x_{1}$, noisy speech $(1-t)x_{0}+tx_{1}$, extended character sequence $z$가 주어졌을 때 masked speech $m\odot x_{1}$을 predict 함

- MMS Forced Alignment

Transcript-free voice cloning을 위해 Massively Multilingual Speech (MMS) forced alignment tooling을 활용하여 precise word boundary information을 얻음
- MMS forced alignment는 Connectionist Temporal Classification (CTC)로 training 된 Wav2Vec2-based acoustic model을 활용하여 audio frame에 대한 posterior probability를 생성함
  - Long audio recording을 process 하기 위해 MMS는 audio file을 15s segment로 chunk 하여 posterior probability를 생성한 다음, unified alignment matrix로 concatenate 함
- 특히 논문은 Emilia dataset에 MMS forced alignment를 적용하고, training시 original speech, text input과 함께 해당 word boundary를 additionally input 함
  1. 각 training step에서 논문은 word boundary를 randomly select 하고 audio sample을 partition 함
  2. Selected boundary 이전의 audio segment는 audio prompt로 사용되고, 해당 transcription portion은 completely discard 됨
    - Remaining audio segment는 masking 되어 synthesis target으로 사용함

- Speaking Rate Predictor

논문의 transcript-free training approach는 audio prompt transcript를 eliminate 하므로 기존 F5-TTS의 duration estimation을 활용할 수 없음
- 따라서 Cross-Lingual F5-TTS에서는 audio prompt의 acoustic characteristic으로부터 duration을 directly estimate 하는 dedicated Speaking Rate Predictor를 도입함
- 이를 위해 speaking rate prediction을 discrete classification으로 formulate 하고 서로 다른 linguistic granularity에 대해 phoneme-per-second, syllable-per-second, word-per-second의 3가지 model을 training 함
  1. 먼저 $\Delta=0.25$의 uniform interval로 category set $C$를 정의함
    - Phoneme-level model은 $N=72$ class의 $C=\{0.25, 0.5,...,17.75, 18.0\}$을 사용하고 syllable-level, word-level model은 $N=32$ class의 $C=\{0.25, 0.5,...,7.75, 8.0\}$을 사용함
  2. Speaking rate $v$에 대해 ground-truth category는 $c_{gt}=\arg\min_{x\in C}|v-x|$와 같이 minimum distance mapping으로 결정됨
  3. 그러면 각 model은 audio feature를 input으로 하여 해당 linguistic unit에 대한 speaking rate category를 independently predict 함
- Speaking rate predictor는 mel-spectrogram input을 처리하는 Transformer-based architecture를 활용함
  1. 구조적으로는 input mel-spectrogram을 hidden dimension으로 project 하는 mel-projection layer와 2개의 1D convolution layer로 구성됨
  2. Multiple Transformer encoder layer는 sequence를 process 하고 attention-based sequence pooling mechanism은 temporal information을 aggregate 함
  3. 최종적으로 classifier는 speaking rate category에 대한 class probability를 output 함
- Speaking rate category에서 standara Cross-Entropy loss는 sub-optimal 하므로 논문은 speaking rate의 ordinal nature를 반영한 Gaussian Cross-Entropy (GCE) loss를 사용함:
  (Eq. 4) $ \mathcal{L}_{GCE}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}y_{c}^{soft}\log (\hat{y}_{c})$
  - 해당 formulation은 ground-truth에 close 한 category에 higher weight를 assign 하여 minor prediction error에 대한 tolerance를 제공함
- Soft label은 Gaussian kernel을 통해 compute 됨:
  (Eq. 5) $y_{c}^{soft}=e^{-\frac{-(c-c_{gt})^{2}}{2\sigma^{2}}}$
  - $c_{gt}$ : ground-truth category index, $c$ : current category index, $\sigma$ : Gaussian kernel smoothness
- 추론 시에는 audio prompt를 speaking rate predictor에 전달하여 phoneme/syllable/word 단위의 characteristic pace를 estimate 하고, target text는 해당 linguistic unit을 count 하도록 process 됨
  - 이후 target audio duration은 predicted speaking rate에 대한 linguistic unit count의 ratio로 얻어짐

3. Experiments

- Settings

Dataset : Emilia
Comparisons : F5-TTS

- Results

Intra-lingual task에서 Cross-Lingual F5-TTS는 기존보다 나은 성능을 보임

Cross-lingual task에서도 우수한 성능을 달성함

Speaking Rate Predictor
- M1 predictor는 English, M2 predictor는 Chinese dataset에서 우수한 성능을 보임

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech (0)	2026.04.02
[Paper 리뷰] MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis (0)	2026.03.27
[Paper 리뷰] VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency (0)	2026.03.23
[Paper 리뷰] DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis (0)	2026.03.18
[Paper 리뷰] DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance (0)	2026.03.11

최근에 올라온 글

최근에 달린 댓글

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

1. Introduction

2. Method

- Preliminary on Flow-Matching-based TTS

- MMS Forced Alignment

- Speaking Rate Predictor

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바

티스토리 뷰

[Paper 리뷰] Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

document.addEventListener("DOMContentLoaded", function() { renderMathInElement(document.body, { delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false} ] });});

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

1. Introduction

2. Method

- Preliminary on Flow-Matching-based TTS

- MMS Forced Alignment

- Speaking Rate Predictor

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바