[Paper 리뷰] Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

티스토리 뷰

Paper/Language Model

[Paper 리뷰] Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

feVeRin 2025. 5. 1. 09:52

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

Voice Large Language Model은 대부분 single task, monolingual로 제한됨
Make-A-Voice
- End-to-End local/global multiscale transformer를 활용하여 scalable learner를 구성
- Common knowledge를 share 하고 unseen task에 generalize 하여 in-context learning을 향상
- Low-resource language에 대한 data scarcity 문제를 해결하는 multilingual learner를 지원
논문 (ACL 2024) : Paper Link

1. Introduction

Voice Large Language Model (LLM)은 discrete representation space에서 language modeling task를 적용함
- 대표적으로 VALL-E는 audio codec code를 기반으로 text-to-speech에 대한 language modeling을 수행함
  - BUT, 대부분의 voice LLM은 single-task, monolingual의 specific purpose로 제한됨
- 따라서 speech, singing voice 등의 multiple task와 rich-/low-resource data의 multiple language를 지원할 수 있는 voice LLM이 필요함
  - 이를 위해서는 cross-task knowledge sharing, generalization ability, data scarcity alleviating 등을 고려해야 함

-> 그래서 multitask, multilingual voice LLM인 Make-A-Voice를 제안

Make-A-Voice
- Unified voice generation pipeline을 위해 self-supervised token을 활용
  1. Semantic token은 주어진 text/speech의 semantic meaning을 결정함
  2. Acoustic token은 다양한 control condition에 대한 acoustic information을 제공함
- 200K hours의 multilingual data를 활용하여 Text-To-Speech (TTS), Voice Conversion (VC), Singing Voice Synthesis (SVS), Singing Voice Conversion (SVC)의 multiple task를 지원

< Overall of Make-A-Voice >

Multiple task/language를 지원하는 voice LLM
결과적으로 다양한 task에서 기존보다 뛰어난 성능을 달성

2. Method

- Voice Representation

Semantic Tokens
- Speech signal에서 rich linguistic information을 추출하기 위해 Wav2Vec 2.0을 기반으로 53 language에 대해 pre-train 된 XLSR-53을 활용함
- 이후 $k$-means algorithm을 unlabeled speech의 learned representation에 적용하여 20ms frame 마다 $K_{1}$ cluster centroid를 생성함
- 최종적으로 speech utterance $y$는 $ [s_{1},s_{2},...,s_{T}];s_{i}\in\{0,1,...,K_{1}-1\},\forall 1\leq i\leq T$의 semantic token으로 represent 됨
  - $T$ : frame 수
Acoustic Tokens
- SoundStream, DAC와 같은 codec model의 audio encoder $E$는 downsampling rate가 320인 convolution block으로 구성되어 16kHZ에서 20ms frame 마다 continuous representation을 생성함
- Residual Vector Quantizer $Q$는 vector quantization layer를 사용하여 codebook size가 $K_{2}$인 discrete representation $a_{q}$를 생성함
- 최종적으로 모든 codebook을 flatten 하여 speech utterance $y$로부터 acoustic token $ [a_{1},a_{2},...,a_{T}];$ $a_{i}\in\{0,1,..,K_{2}-1\},\forall 1\leq i\leq T$을 얻음
  - $T$ : frame 수

- Make-A-Voice: Controllable Voice LLM

Make-A-Voice는 voice synthesis를 self-supervised token을 사용하는 language modeling task로 casting 함
- 여기서 voice synthesis는 semantic modeling과 acoustic modeling으로 나눠지고 decoder-only language model을 통해 jointly learning 됨
- 특히 논문은 conditioning mechanism으로써 다음을 고려함:
  1. Semantic Modeling
    - Semantic token $s$는 주어진 text/speech의 semantic meaning을 결정하는 데 사용됨
  2. Conditional Acoustic Modeling
    - Acoustic token $a$는 control condition (speaker, emotion, prosody, style)에 의해 guide 되고 self-supervised audio-only data에서 semantic meaning을 기반으로 학습됨
- 최종적으로 unit-based vocoder를 통해 compressed acoustic representation으로부터 high-fidelity waveform을 합성함
Zero-Shot TTS/VC
- Target text $\mathbf{y}$가 주어지면 TTS model은 semantic token $s$를 결정한 다음, reference utterance에서 derive 된 acoustic prompt $\mathbf{a}_{p}$를 사용하여 in-context learning을 수행함
  - Training 시에는 두 non-overlapping speech window를 randomly select 하고 한 window를 prompt, 다른 window를 target으로 처리함
- VC의 경우 $k$-means model을 사용한 HuBERT에서 semantic token을 추출함
Zero-Shot SVS/SVC
- Singing voice는 MIDI representation에 의해 guide 되는 accurate rhythm, pitch control이 필요함
  - 이를 위해 fundamental frequency $\mathbf{F}_{0}$와 phone-level duration이 각각 semantic/acoustic modeling에 제공됨
- $\mathbf{F}_{0}$는 MIDI score를 제공하는 separately-trained neural network에 의해 predict 될 수 있으므로, 논문은 $\mathbf{F}_{0}$를 condition signal로 directly take 함

- Multitask Learner

Make-A-Voice는 unified voice synthesis framework로써 prompt를 adjusting하여 speech/singing에 대한 semantic, acoustic modeling combination으로 training 됨
- 이때 task를 specifying 하는 tag를 prefixing 함으로써 주어진 input에 대한 task를 model에 제공함
- 결과적으로 Make-A-Voice는 multitask learner로써 다음의 장점을 가짐:
  1. 다양한 voice generation task에 대한 general-purpose interface를 제공
  2. Cross-task knowledge sharing
    - Multi-quantization codec modeling을 통해 common knowledge를 share 하여 전체적인 성능을 향상
  3. New task에 대한 generalization
    - In-context learning을 통해 explicitly train 되지 않은 timbre transfer, noise continuation과 같은 task를 지원 가능

- Multilingual Learner

Speech model은 large-scale training data와 acoustic reference의 unseen style을 기반으로 high-quality sample을 생성할 수 있음
- BUT, low-resource language에 대해서는 data scarcity로 인해 여전히 합성 품질의 한계가 있음
- 따라서 Make-A-Voice는 low-resource language를 text를 semantic meaning에 connect 하고, rich-resource language는 다양한 recording condition, accent 등을 가지도록 구성함
  - 이를 통해 low-resource language에 대해서도 acoustic diversity를 효과적으로 반영할 수 있음
- 구조적으로는 주어진 input에 대해 task, language를 지정하기 위해 task, language specifying tag를 prefix 함
  1. e.g.) English utterance에 대한 text-to-semantic translation을 수행하는 경우, tokenized input 앞에 $\text{[EN]}\text{[T2S]}$를 사용함
  2. 이때 model이 cross-lingual을 지원할 수 있도록 multilingual HuBERT와 codec model을 사용하여 multiple language로 pre-train 된 semantic/acoustic discrete representation을 추출함

- Scalable Architecture

AudioGen, MusicLM 등은 scalability를 위해 Transformer architecture를 기반으로 audio signal을 discrete token의 multiple stream $n_{q}$로 represent 한 다음, 해당 code를 frame 수 $T$에 대한 length $T\times n_{q}$로 flatten 함
- BUT, extremely long sequence의 경우 self-attention의 quadratic cost와 large feed-forward network로 인해 high computational cost를 가짐
- 따라서 Make-A-Voice $\theta_{AR}$은 UniAudio와 같이 end-to-end differentiable multiscale Transformer를 채택하여 long sequence를 predict 함
  - 해당 architecture는 sub-quadratic self-attention을 지원하므로 training/generation cost를 줄일 수 있음
- 구조적으로 token embedding matrix $E_{G}$는 interger-valued token $x_{0..T}$에서 $m$-dimensional embedding으로 mapping 하고 time-axis에서 continuous speech representation과 concatenate 함
  1. 이후, concatenated representation은 size $P$, length $K=\frac{T}{P}$의 patch로 chunk 되고,
  2. Large global Transformer module $\theta_{AR}^{\text{global}}$을 통해 patch representation $\mathbf{G}_{o}^{1:K}=\theta_{AR}^{\text{global}}(\mathbf{G}_{i}^{0:K-1})$을 output 함
  3. Small local Transformer module은 $P$ element를 포함한 single patch에서 동작하고 next patch를 $\mathbf{L}_{o}^{1:K}=\theta_{AR}^{\text{local}}(\mathbf{L}_{i}^{0:K-1}+\mathbf{G}_{o}^{1:K})$과 같이 predict 함
    - 이때 각 element는 global model output과 previous token embedding의 summation
- 결과적으로 Make-A-Voice는 160M (base), 520M (medium), 1.2B (large) parameter를 가지는 scalable model을 제공함

- Reconstructing High-Fidelity Waveforms

논문은 acoustic unit을 waveform으로 변환하기 위해 unit-based neural vocoder를 사용함
- 구조적으로는 BigVGAN을 기반으로 multi-resolution discriminator (MRD)를 채택함
- Generator는 discrete representation을 embed 하는 look-up-table (LUT)를 기반으로 transposed convolution, dilated layer를 가지는 residual bloock으로 구성됨
  - Transposed convolution은 encoded representation을 input sample rate와 match 되도록 upsample 함

3. Experiments

- Settings

Dataset : 아래 표 참조
Comparisons :
- TTS : YourTTS, GenerSpeech, VALL-E
- VC : NANSY, PPG-VC
- SVS : DiffSinger, FFT-Singer

- Results

Zero-shot TTS 측면에서 Make-A-Voice의 성능이 가장 우수함

특히 model size가 클수록 Make-A-Voice는 향상된 성능을 보임

Singing Voice Synthesis
- SVS task도 마찬가지로 Make-A-Voice가 가장 우수한 성능을 달성함

Voice Conversion
- VC task에서도 Make-A-Voice는 뛰어난 성능을 보임

Multilingual Learner
- Cross-lingual zero-shot TTS에서도 Make-A-Voice는 우수한 성능을 달성함

Analysis
- 16 hours의 low-resource setting에서도 Make-A-Voice는 뛰어난 성능을 달성함
- 특히 common knowledge sharing으로 인해 서로 다른 task에 대해서도 안정적인 성능을 달성할 수 있음

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation (0)	2025.06.29
[Paper 리뷰] ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Recording (0)	2025.05.25
[Paper 리뷰] Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis (0)	2025.03.29
[Paper 리뷰] CosyVoice: A Scalable Multilingual Zero-Shot Text-to-Speech Synthesizer based on Supervised Semantic Tokens (0)	2025.03.16
[Paper 리뷰] HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis (0)	2025.03.15

최근에 올라온 글

최근에 달린 댓글

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

1. Introduction

2. Method

- Voice Representation

- Make-A-Voice: Controllable Voice LLM

- Multitask Learner

- Multilingual Learner

- Scalable Architecture

- Reconstructing High-Fidelity Waveforms

3. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바