[Paper 리뷰] VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

티스토리 뷰

Paper/Language Model

[Paper 리뷰] VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

feVeRin 2024. 7. 20. 11:12

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Speech editing, zero-shot text-to-speech를 위해 token infilling neural codec language model을 구성할 수 있음
VoiceCraft
- Transformer decoder architecture와 causal masking, delayed stacking을 결합하여 existing sequence 내에서 generation을 수행하는 token rearrangement를 도입
- 추가적으로 speech editing evaluation을 위한 RealEdit dataset을 제공
논문 (ACL 2024) : Paper Link

1. Introduction

Speech signal을 learnable sequence로 tokenizing 하고, resulting unit sequence에 대해 language model을 training 하면 speech를 text로 변환할 필요 없이 spoken utterance에 대해 NLP task를 직접 수행할 수 있음
- 특히 AudioLM과 같은 residual vector quantization (RVQ) 기반의 neural codec language model (NCLM)은 long-term coherent speech continuation에서 뛰어난 생성 품질을 보이고 있음
- 여기서 zero-shot text-to-speech (TTS)는 target voice의 short reference와 target transcript를 사용하여 unseen voice를 합성하는 것을 목표로 함
  - 대표적으로 VALL-E는 zero-shot TTS를 transcript-conditioned speech continuation으로 framing 하는 NCLM을 구성하여 우수한 성능을 달성
- 한편으로 speech editing은 target transcript와 match하도록 utterance의 word나 phrase를 수정하는 것을 목표로 함
  1. 기존에는 single TTS model과 voice conversion model을 결합하여 desired speech segment를 생성한 다음, unedited part와 concatenate 하는 방법을 사용함
    - BUT, prosody mismatch와 boundary artifact로 인해 unnatural한 결과가 발생
  2. 최근의 UniCATS, VoiceBox는 surrounding speech context에 따라 generation을 condition하는 방식으로 speech editing model을 개선함

-> 그래서 zero-shot TTS, speech editing task 모두에서 활용할 수 있는 unified NCLM인 VoiceCraft를 제안

VoiceCraft
- Causal masking step과 delayed stacking step으로 구성된 2-step token rearrangement procedure를 활용
  1. Causal masking은 speech codec sequence에서 bidirectional context로 autoregressive generation을 가능하게 함
  2. Delayed stacking은 효율적인 multi-codebook modeling을 지원함
- 추가적으로 speech editing을 evaluate하기 위해 RealEdit dataset을 제작

< Overall of VoiceCraft >

Speech editing, Zero-shot TTS를 수행할 수 있는 neual codec language model
결과적으로 각 task에 대해 뛰어난 합성 품질을 달성

2. Method

VoiceCraft는 neural codec의 output token을 rearranging 하여 speech editing을 위한 sequence infilling과 zero-shot TTS를 위한 continuation을 left-to-right language modeling으로 cast 함
- Rearrangement는 2-step으로 구성됨
  1. Causal Masking : bidirectional context로 autoregressive continuation/infilling을 지원
  2. Delayed Stacking : 효율적인 multi-codebook modeling을 지원
- VoiceCraft는 decoder-only transformer를 사용하여 autoregressive sequence prediction으로 training 됨

- Rearrange Step 1: Causal Masking

아래 그림과 같이 input으로 continuous speech waveform이 주어지면, VoiceCraft는 먼저 EnCodec을 사용하여 $T \times K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi><mo>\times</mo><mi>K</mi></math>$ codec matrix $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 로 quantize 함
- $T <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>T</mi></math>$ 를 temporal frame 수, $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ 를 RVQ codebook 수라고 할 때, $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 는 $(X 1, . . ., X T) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 로 나타낼 수 있음
  - 여기서 $X t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></msub></math>$ 는 timestep $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에서 서로 다른 codebook의 code를 나타내는 length $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ 의 vector
  - Codebook $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ 의 code가 codebook $k - 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi><mo>-</mo><mn>1</mn></math>$ 의 residual을 modeling 한다고 가정함
- Training 중에 논문은 some token span $(X t 0, . . ., X t 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><msub><mi>t</mi><mrow data-mjx-texclass="ORD"><mn>0</mn></mrow></msub></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><msub><mi>t</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub></mrow></msub><mo stretchy="false">)</mo></math>$ 를 randomly mask 한 다음, unmask 된 모든 token을 condition으로 해당 masked token을 autoregressively predict 하는 것을 목표로 함
  1. 이때 $t 1 < T <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>t</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo><</mo><mi>T</mi></math>$ 인 경우, autoregressive generation을 수행할 때 future output에 대한 condition을 수행할 수 없다는 문제가 있음
  2. 따라서 masking 할 span을 sequence 끝으로 이동하여 causal 하게 $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 의 masking을 modify 해야 함
    - 이를 통해 해당 token을 infilling 할 때, past/future의 unmasked token을 모두 condtion 할 수 있음
- 그러면 모든 masked span을 sequence 끝으로 이동시키는 방식으로 해당 procedure를 multiple masked span으로 확장할 수 있음
  1. 먼저 masking 할 span 수 $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ 을 $Poison (λ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>Poison</mtext><mo stretchy="false">(</mo><mi>λ</mi><mo stretchy="false">)</mo></math>$ 에서 sampling 한 후, 각 span에 대해 span length $l \sim Uniform (1, L) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mo>\sim</mo><mtext>Uniform</mtext><mo stretchy="false">(</mo><mn>1</mn><mo>,</mo><mi>L</mi><mo stretchy="false">)</mo></math>$ 을 sampling 함
  2. 이후 서로 overlap 되지 않는다는 constraint하에서 $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 의 span location을 randomly select 함
  3. Selected $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ span은 mask token $⟨ M 1 ⟩, . . ., ⟨ M n ⟩ <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">⟨</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo fence="false" stretchy="false">⟩</mo><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mo fence="false" stretchy="false">⟨</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow></msub><mo fence="false" stretchy="false">⟩</mo></math>$ 로 대체됨
  4. 결과적으로 masked span 내의 original token은 sequence $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 의 끝으로 이동되고, 각 span 앞에는 해당 mask token이 위치함
- e.g.) $X = (X 1, . . ., X 6) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi><mo>=</mo><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>6</mn></mrow></msub><mo stretchy="false">)</mo></math>$ 이라고 하고 $X 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 에서 $X 4 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>4</mn></mrow></msub></math>$ 까지 single span을 mask 한다고 하자
  1. 그러면 original sequence $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ 는 $Y = (Y 1; ⟨ M 1 ⟩, Y 2; ⟨ M 1 ⟩; Y 3;) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Y</mi><mo>=</mo><mo stretchy="false">(</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>;</mo><mo fence="false" stretchy="false">⟨</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo fence="false" stretchy="false">⟩</mo><mo>,</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>;</mo><mo fence="false" stretchy="false">⟨</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo fence="false" stretchy="false">⟩</mo><mo>;</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>3</mn></mrow></msub><mo>;</mo><mo stretchy="false">)</mo></math>$ 으로 rearrange 됨
    - $Y 1 = (X 1), Y 2 = (X 5, X 6), Y 3 = (X 2, X 3, X 4) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">)</mo><mo>,</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>5</mn></mrow></msub><mo>,</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>6</mn></mrow></msub><mo stretchy="false">)</mo><mo>,</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>3</mn></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>,</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>3</mn></mrow></msub><mo>,</mo><msub><mi>X</mi><mrow data-mjx-texclass="ORD"><mn>4</mn></mrow></msub><mo stretchy="false">)</mo></math>$
  2. $Y 1, Y 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 를 unmasked span, $Y 3 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>3</mn></mrow></msub></math>$ 를 masked span이라고 하면 end of span $EOS <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>EOS</mtext></math>$ token은 masked span의 끝인 $Y 3 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>3</mn></mrow></msub></math>$ 에 추가되고, end of utterance $EOU <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>EOU</mtext></math>$ token은 utterance 끝인 $Y 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub></math>$ 에 추가됨

- Rearrange Step 2: Delayed Stacking

Causal masking token rearrangement 이후, rearranged matrix $Y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Y</mi></math>$ 의 각 timestep은 $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ token의 vector가 됨
- MusicGen에서는 stacked RVQ token에 대해 autoregressive generation을 수행할 때, time $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에서의 codebook $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ 의 예측이 동일한 timestep의 codebook $k - 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi><mo>-</mo><mn>1</mn></math>$ 의 예측에 따라 condition 되는 delay pattern을 도입함
- 이와 비슷하게 VoiceCraft는 span $Y s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 가 $L s \times K <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>\times</mo><mi>K</mi></math>$ shape라고 가정하고, delay pattern을 적용해 $Z s = (Z s, 0, Z s, 1, . . ., Z s, L s + K - 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mn>0</mn></mrow></msub><mo>,</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mn>1</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>+</mo><mi>K</mi><mo>-</mo><mn>1</mn></mrow></msub></math>$ 로 rearrange 함
  1. 이때 $t \in [L s + K - 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>+</mo><mi>K</mi><mo>-</mo><mn>1</mn><mo stretchy="false">]</mo></math>$ 에서 $Z s, t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi></mrow></msub></math>$ 는:
    (Eq. 1) $Z s, t = (Y s, t, 1, Y s, t + 1, 2, . . ., Y s, t - K + 1, K) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>=</mo><mo stretchy="false">(</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi><mo>,</mo><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi><mo>+</mo><mn>1</mn><mo>,</mo><mn>2</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi><mo>-</mo><mi>K</mi><mo>+</mo><mn>1</mn><mo>,</mo><mi>K</mi></mrow></msub><mo stretchy="false">)</mo></math>$
    - $[N] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mi>N</mi><mo stretchy="false">]</mo></math>$ : integer set ${0, 1, . . ., N} <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">{</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>N</mi><mo fence="false" stretchy="false">}</mo></math>$
  2. $Y s, t - k + 1, k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi><mo>-</mo><mi>k</mi><mo>+</mo><mn>1</mn><mo>,</mo><mi>k</mi></mrow></msub></math>$ : matrix $Y s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ 의 coordinate $(t - k + 1, k) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>t</mi><mo>-</mo><mi>k</mi><mo>+</mo><mn>1</mn><mo>,</mo><mi>k</mi><mo stretchy="false">)</mo></math>$ 에 위치한 token
    - 즉, $(t - k + 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>t</mi><mo>-</mo><mi>k</mi><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo></math>$ -th timestep의 $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ -th codebook entry
- $\forall t \in [L s + K - 1], Z s, t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">\forall</mi><mi>t</mi><mo>\in</mo><mo stretchy="false">[</mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>+</mo><mi>K</mi><mo>-</mo><mn>1</mn><mo stretchy="false">]</mo><mo>,</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi></mrow></msub></math>$ 에 $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ valid token이 포함되도록 special learnable $[empty] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mtext>empty</mtext><mo stretchy="false">]</mo></math>$ token을 도입하고, $Y s, t - k + 1, k ≜ [empty], \forall t \in {s : s < k \cup s - k + 1 > L s} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Y</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi><mo>-</mo><mi>k</mi><mo>+</mo><mn>1</mn><mo>,</mo><mi>k</mi></mrow></msub><mo>≜</mo><mo stretchy="false">[</mo><mtext>empty</mtext><mo stretchy="false">]</mo><mo>,</mo><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mstyle scriptlevel="0"><mspace width="0.167em"></mspace></mstyle><mi mathvariant="normal">\forall</mi><mi>t</mi><mo>\in</mo><mo fence="false" stretchy="false">{</mo><mi>s</mi><mo>:</mo><mi>s</mi><mo><</mo><mi>k</mi><mo>\cup</mo><mi>s</mi><mo>-</mo><mi>k</mi><mo>+</mo><mn>1</mn><mo>></mo><msub><mi>L</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo fence="false" stretchy="false">}</mo></math>$ 를 정의함
- 여기서 mask token은 span의 일부가 아니고, delayed stacking 중에 변경되지 않음
  - 결과적으로 논문은 delayed stacking의 resulting matrix를 $Z=(Z1,⟨M1⟩,Z2,⟨M2⟩,...,⟨MS−12⟩,ZS)<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Z</mi><mo>=</mo><mo stretchy="false">(</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>,</mo><mo fence="false" stretchy="false">⟨</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo fence="false" stretchy="false">⟩</mo><mo>,</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo>,</mo><mo fence="false" stretchy="false">⟨</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mn>2</mn></mrow></msub><mo fence="false" stretchy="false">⟩</mo><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mo fence="false" stretchy="false">⟨</mo><msub><mi>M</mi><mrow data-mjx-texclass="ORD"><mfrac><mrow><mi>S</mi><mo>−</mo><mn>1</mn></mrow><mn>2</mn></mfrac></mrow></msub><mo fence="false" stretchy="false">⟩</mo><mo>,</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>S</mi></mrow></msub><mo stretchy="false">)</mo></math>$ 로 정의
  - $Y <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Y</mi></math>$ 는 $S <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>S</mi></math>$ span으로 구성된다고 가정

- Modeling

VoiceCraft는 transformer decoder를 사용하여 speech $W <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi></math>$ 의 transcript를 condition으로 $Z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Z</mi></math>$ 를 autoregressive modeling 함
- Concatenation operator $; <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>;</mo></math>$ 에 대해 decoder의 input을 $[W; Z] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mi>W</mi><mo>;</mo><mi>Z</mi><mo stretchy="false">]</mo></math>$ 라고 하자
  1. Codec matrix $Z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Z</mi></math>$ 내 span $s <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi></math>$ 의 timestep $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에서 model은 $Z s, t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi></mrow></msub></math>$ 의 모든 $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ token을 동시에 예측함
  2. 즉, $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ MLP head를 사용하여 transformer의 final hidden state를 $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ codebook 각각에 대해 하나씩 $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ logit set로 project 함
    - 이때 prediction은 transcript $W <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi></math>$ 와 $Z s, t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi></mrow></msub></math>$ 이전 $Z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Z</mi></math>$ 에 있는 모든 token $H s, t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi></mrow></msub></math>$ 에 따라 condition 됨
- 결과적으로 transformer decoder는 $Z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Z</mi></math>$ 의 factorized conditional distribution을 모델링함:
  (Eq. 2) $P θ (Z | W) = \prod s \prod t P θ (Z s, t | W, H s, t) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>Z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>W</mi><mo stretchy="false">)</mo><mo>=</mo><munder><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></munder><munder><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>t</mi></mrow></munder><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>W</mi><mo>,</mo><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi></mrow></msub><mo stretchy="false">)</mo></math>$
  (Eq. 3)
  - $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ : model parameter
- (Eq. 2)는 time에 따른 autoregressive factorization, (Eq. 3)은 independence assumption이 주어진 codebook에 대한 factorization
  - 논문에서는 $W, H s, t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mo>,</mo><msub><mi>H</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi></mrow></msub></math>$ 가 주어졌을 때, $Z s, t <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>Z</mi><mrow data-mjx-texclass="ORD"><mi>s</mi><mo>,</mo><mi>t</mi></mrow></msub></math>$ 의 $K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>$ RVQ code는 서로 independent 하다고 가정
- (Eq. 3)의 token-level probability formulation을 사용하여 training loss를 negative log-likelihood로 얻을 수 있음:
  (Eq. 4) $L (θ) = - log P θ (Z | W) = - \sum K k = 1 L k (θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo><mo>=</mo><mo>-</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo stretchy="false">(</mo><mi>Z</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>W</mi><mo stretchy="false">)</mo><mo>=</mo><mo>-</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></munderover><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo></math>$
- 경험적으로 first residual codebook에 later codebook 보다 더 많은 weight를 주면 성능을 더욱 향상할 수 있으므로, final loss는:
  (Eq. 5) $L θ = \sum K k = 1 α k L k (θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow></msub><mo>=</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></munderover><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo></math>$
  - $(α k) K k = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><msub><mi>α</mi><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></msub><msubsup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></msubsup></math>$ : tunable hyperparameter
  - 이때 mask token과 $[empty] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><mtext>empty</mtext><mo stretchy="false">]</mo></math>$ token을 제외한 모든 token에 대한 prediction loss를 계산

- Inference

Speech Editing
- Speech recording $R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>R</mi></math>$ 과 transcript $W <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi></math>$ 가 있을 때, VoiceCraft는 $R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>R</mi></math>$ 의 relevant span만 modify 하여 target transcript $W' <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>W</mi><mo data-mjx-alternate="1">'</mo></msup></math>$ 과 match 되도록 함
  - $W' <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>W</mi><mo data-mjx-alternate="1">'</mo></msup></math>$ 는 $W <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi></math>$ 의 일부 word가 insert, substitute, delete 된 version이라고 가정함
- 해당 task는 training과 비슷하지만 다음의 차이점을 가짐:
  1. Training 중에 input transcript는 단순히 original recording $W <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi></math>$ 를 사용하지만, 추론 시에는 modified transcript $W' <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>W</mi><mo data-mjx-alternate="1">'</mo></msup></math>$ 를 사용함
  2. Training 중에는 mask 할 span이 randomly choice 되지만, 추론 시에는 original transcript와 target transcript를 비교하여 mask 할 word를 식별함
    - 이후 original transript의 word-level forced alignment를 적용하여 mask word에 해당하는 codec token span을 식별
- 한편으로 edited speech와 unedited speech 간의 smooth transition을 보장하려면, span 주변의 neighboring word도 co-articulation effect를 모델링하기 위해 modify 되어야 함
  - 이를 위해 small margin hyperparameter $ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi></math>$ 을 사용하여 left, right side 모두에서 mask span length를 $ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi></math>$ 만큼 extend 함
- Autoregressive generation 중에는 editing place에 mask token이 insert 된 target transcript를 VoiceCraft에 제공하고, 해당 sequence를 autoregressively continue 하도록 하여 masked span을 fill 함
  - 이후 생성된 codec token은 utterance의 correct location으로 splice back 되고, encoder-decoder network를 사용하여 전체 codec token sequence를 waveform으로 mapping 함
Zero-Shot TTS
- VoiceCraft의 zero-shot TTS는 original utterance의 끝에서 insertion edit을 수행하는 것에 해당함
- 이때 모델에는 target transcript와 transcription이 포함된 voice prompt가 제공됨
  - 해당 input은 서로 concatenate 되고 autoregressive 하게 target transcript에 대한 codec sequence를 생성함

3. Experiments

- Settings

Dataset : GigaSpeech, LibriTTS, RealEdit (위 표 참고)
Comparisons : FluentSpeech, VALL-E, XTTS, YourTTS

- Results

Ablation Study
- Model size가 클수록 VoiceCraft의 성능은 향상됨
- Coedbook에 heavy weighting을 부여하면 WER, MCD 같은 intelligibility 성능을 개선할 수 있지만, $F 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>F</mi><mn>0</mn></math>$ , energy 같은 prosody는 저하됨
  - 논문에서는 최적 configuration으로써 830M의 $(5, 1, 0.5, 0.1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mn>5</mn><mo>,</mo><mn>1</mn><mo>,</mo><mn>0.5</mn><mo>,</mo><mn>0.1</mn><mo stretchy="false">)</mo></math>$ weight를 선택

Speech Editing Results
- VoiceCraft는 FluentSpeech와 비교하여 더 우수한 speech editing 성능을 달성함

Side-by-side 비교에서도 VoiceCraft가 더 선호되는 것으로 나타남

Original speech와의 비교에서도 큰 차이를 보이지 않음

Zero-Shot TTS Results
- Zero-shot 측면에서도 VoiceCraft는 가장 우수한 성능을 달성함

'Paper > Language Model' 카테고리의 다른 글

[Paper 리뷰] SpeechX: Neural Codec Language Model as a Versatile Speech Transformer (0)	2025.01.25
[Paper 리뷰] Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision (0)	2025.01.08
[Paper 리뷰] TacoLM: Gated Attention Equipped Codec Language Model are Efficient Zero-shot Text-to-Speech Synthesizers (0)	2024.07.16
[Paper 리뷰] Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (0)	2024.07.06
[Paper 리뷰] VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (0)	2024.06.15

Let IT Begin Voice Engineer | 심심하면 앨범 리뷰 올립니다

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

1. Introduction

2. Method

- Rearrange Step 1: Causal Masking

- Rearrange Step 2: Delayed Stacking

- Modeling

- Inference

3. Experiments

- Settings

- Results

'Paper > Language Model' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역