[Paper 리뷰] DualVC2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

티스토리 뷰

Paper/Conversion

[Paper 리뷰] DualVC2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

feVeRin 2024. 9. 18. 09:43

DualVC2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

기존의 DualVC는 streaming inference를 위해 streaming architecture, intra-model knowledge distillation, hybrid predictive coding을 활용함
BUT, autoregressive decoder는 error accumulation의 문제가 있고 추론 속도가 제한적임
- Causal convolution은 chunk 내의 future information을 효과적으로 사용할 수 없음
- Unvoiced segment의 noise를 효과적으로 처리할 수 없어 음성 품질이 저하됨
DualVC2
- Conformer-based architecture를 활용하여 parallel inference를 지원
- Within-chunk future information을 반영할 수 있도록 dynamic chunk mask가 있는 non-causal convolution을 채택
- Noise robustness를 향상하기 위해 quiet attention을 도입
논문 (ICASSP 2024) : Paper Link

1. Introduction

Voice Conversion (VC)는 linguistic content를 유지하면서 다른 speaker의 음성으로 변환하는 것을 목표로 함
- 최근에는 VC model이 dubbing, live streaming 등과 같은 real-time communication (RTC)에서 사용되고 있으므로 streaming capability가 요구됨
  1. 대표적으로 ACE-VC, VQMIVC와 같은 기존 VC model은 utterance-level에서 동작해 전체 utterance를 target speaker timbre로 변환함
    - BUT, 해당 non-streaming model은 뛰어난 naturalness에 비해 real-time application에서 활용할 수 없음
  2. Streaming VC model은 frame-by-frame이나 chunk 별로 real-time input을 처리할 수 있음
    - BUT, streaming inference 중에는 future information이 없으므로 non-streaming model보다 낮은 VC 품질을 보임
- 한편으로 기존의 DualVC는 intra-model distillation과 Hybrid Predictive Coding (HPC)를 결합하여 streaming VC의 성능을 향상함
  1. 여기서 모든 convolutional layer는 dual-mode convolution block으로 대체되고, 각각은 2개의 parallel basic convolutional layer로 구성됨
    - Streaming mode의 경우 causal, Non-streaming mode의 경우 non-causal
  2. 이후 두 mode의 encoder output에 대한 knowledge distillation loss를 계산하여 streaming mode의 hidden representation을 non-streaming mode와 가깝게 함
    - 이를 통해 streaming mode의 합성 품질을 향상
  3. 추가적으로 DualVC는 contrastive predictive coding과 autoregressive predictive coding을 결합한 HPC를 도입함
    - HPC module을 통해 common feature structure를 capture 하여 future information을 추론 가능
- BUT, 해당 DualVC는 다음의 한계점을 가짐:
  1. Autoregressive decoder는 frame-by-frame decoding으로 limit되고 parallelize 되지 않아 latency가 높음
    - 특히 spectrogram은 autoregressive generation 과정에서 error accumulation이 발생하므로 VC 성능이 저하될 수 있음
  2. Chunk-based streaming inference에서 pure causal convolution은 current chunk 내의 future information을 fully exploit 하지 못함
  3. Unvoiced frame의 background noise이 제거되지 않고 output으로 leakage될 수 있음

-> 그래서 기존 DualVC를 개선해 더 나은 inference speed와 stability를 제공하는 DualVC2를 제안

DualVC2
- DualVC를 기반으로 Conformer-based backbone을 채택하여 context information을 효과적으로 capture하고 parallel inference를 향상
- Dynamic Chunk Training을 활용해 다양한 latency의 chunk size에 대응
  - 특히 within-chunk future context를 효과적으로 사용하고 potential feature discontinuity를 해결하기 위해 dynamic masked convolution을 도입
- Noise robustness를 향상하기 위해 quiet attention mechanism을 채택
  - 추가적으로 data augmentation을 적용하여 robustness와 intelligibility를 더욱 향상

< Overall of DualVC2 >

DualVC를 기반으로 Dynamic Masked Convolution과 Quiet Attention을 도입
결과적으로 기존보다 뛰어난 conversion 성능과 70%의 inference speed 향상을 달성

2. Method

DualVC2는 encoder, decoder, HPC module로 구성됨
- 먼저 pre-trained streaming Automatic Speech Recognition (ASR) model의 encoder는 input mel-spectrogram에서 bottleneck feature (BNF)를 추출함
  - 이후 해당 BNF를 encoder로 전달하여 context information을 추가적으로 추출
- HPC module은 encoder가 unsupervised manner로 effective latent representation을 추출하도록 도움
- 다음으로 pre-trained speaker encoder에서 추출된 target speaker embedding을 latent representation에 concatenate 하여 decoder input으로 사용함
- 최종적으로 decoder를 통해 target speaker timbre로 convert된 spectrogram을 생성

- Streamable Architecture

DualVC2는 dual-mode model로써 non-streaming mode에서는 full-context로, streaming mode에서는 limited context로 VC task를 수행할 수 있음
- 이를 위해 논문은 Dynamic Chunk Training (DCT)를 활용
  1. 이때 DCT는 각 self-attention layer에 대한 attention score matrix에 dynamic chunk mask를 적용하여 chunk size를 dynamically varying 하는 것을 목표로 함
  2. 실제로 training 중에 full-sequence를 사용할 확률은 50%이고, 나머지의 경우 chunk size는 $1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn></math>$ (=12.5ms)와 $20 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>20</mn></math>$ (=250ms) 사이에서 randomize됨
- 한편으로 dual-mode convolution은 streaming/non-streaming mode에 대한 2개의 parallel basic convolution layer로 구성됨
  - BUT, 기존 DualVC와 달리 causal convolution layer를 dynamic mask가 있는 non-causal convolution으로 대체하여 사용함
- 결과적으로 DCT를 따라, full-sequence input에 대해서는 dual-mode convolution을 non-streaming mode로 설정하고, random chunk input에 대해서는 streaming mode로 설정하여 사용함

- Dynamic Masked Convolution

Streaming model은 left-shifted convolution kernel을 가지는 causal convolution을 통해 receptive field가 future frame에 accessing 하는 것을 방지함
- BUT, causal convolution은 whitin-chunk future context를 fully exploit 하지 못하므로 성능이 저하되므로, Dynamic Chunk Convolution을 고려할 수 있음
  - Chunked input으로 model을 training 하고, current chunk의 right boundary를 넘어서지 않는 non-causal convolution을 사용해 training-inference mismatch를 방지하는 방식
- 추론 시, future information은 current chunk의 final convolutional receptive field에는 없지만 subsequent chunk의 initial convolutional receptive field에는 존재함
  1. 이때 neighboring convolutional input feature change는 해당 두 chunk 간 output feature에 abrupt discontinuity를 발생시킴
  2. 해당 feature discontinuity는 VC task에서 audible clicking sound를 유발할 수 있음
    - 따라서 streaming VC에서 Dynamic Chunk Convolution을 단순히 사용하는 것은 unfeasible 함
- 따라서 convolutional input에 대한 dynamic masking strategy인 Dynamic Masked Convolution (DMC)를 도입함
  1. DMC는 non-causal convolution의 다양한 future information에 대한 robustness를 향상하여, chunk 간 clicking sound를 완화하고 output feature continuity를 향상하는 것을 목표로 함
  2. 그러면 training process에서, 각 convolution operation 내의 future information을 나타내는 last $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ frame은 $0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn></math>$ 으로 mask 됨:
    (Eq. 1) $n = rand (0, kernel / 2) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi><mo>=</mo><mtext>rand</mtext><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mtext>kernel</mtext><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><mn>2</mn><mo stretchy="false">)</mo></math>$
- 일반적인 convolutional computation에서 convolution kernel은 input sequence를 따라 complete output sequence를 계산하므로 single convolution operation에 대해 input과 다른 mask를 적용할 수 없음
  1. 따라서 논문은 아래 그림과 같이 1D convolution process를 replicate 한 equivalent 2D convolution을 활용
    - 이때 extra axis를 추가하여 input sequence를 expand 함
  2. 이후 additional axis를 따라 masking procedure를 적용
    - 해당 dynamic masked convolution을 통해 streaming model은 successive chunk 내의 future information을 효과적으로 반영할 수 있음

- Quiet Attention

기존 self-attention mechanism에서,
- Attention score matrix $W T \times T <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>W</mi><mrow data-mjx-texclass="ORD"><mi>T</mi><mo>\times</mo><mi>T</mi></mrow></msup></math>$ 는 softmax function으로 계산됨:
  (Eq. 2) $ˆwti=Softmax(wti)=exp(wti)∑Tn=1exp(wtn)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>w</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow></msub><mo>=</mo><mtext>Softmax</mtext><mo stretchy="false">(</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow></msub><mo stretchy="false">)</mo></mrow><mrow><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></munderover><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>n</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac></math>$
  - 여기서 time step $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 에서의 weight는 $ˆ w t i \in W <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>w</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow></msub><mo>\in</mo><mi>W</mi></math>$ 와 같고, 합이 1이 되도록 normalize 됨
- Pre-trained ASR model은 대부분의 noise를 제거할 수 있지만 여전히 noise가 남아있을 수 있으므로, DualVC2는 남아있는 noise interference에 대한 robustness를 향상하는 것을 목표로 함
  1. 먼저 time $t <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>t</mi></math>$ 의 noisy unvoiced frame에 대해, attention calculation은 어떠한 information도 제공하지 않아야 함
  2. 이때 모든 $w t i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow></msub></math>$ 가 negative infinity를 가지더라도 output probability $ˆ w t i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mover><mi>w</mi><mo stretchy="false">^</mo></mover></mrow><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow></msub></math>$ 는 $1T<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mn>1</mn><mi>T</mi></mfrac></math>$ 와 같이 계산됨
    - 해당 문제를 해결하기 위해 논문은 quiet attention을 채택함
- 결과적으로 quiet attention은 negative orthant에 escape mechanism을 도입한 것과 같음:
  (Eq. 3) $Softmax1(wti)=exp(wti)1+∑Tn=1exp(wtn)<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mtext>Softmax</mtext><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo stretchy="false">(</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>i</mi></mrow></msub><mo stretchy="false">)</mo></mrow><mrow><mn>1</mn><mo>+</mo><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>T</mi></mrow></munderover><mi>exp</mi><mo data-mjx-texclass="NONE">⁡</mo><mo stretchy="false">(</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>n</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac></math>$
  - 해당 quiet attention을 사용하면 unvoiced frame에서 발생하는 information을 ignore 할 수 있으므로, residual noise를 제거할 수 있음

- Data Augmentation

Training dataset과 비교하여 실제 recording은 background noise, reverberation 등을 포함하고 있음
- 특히 training data의 speaking style은 scripted reading과 관련되어 있지만, 실제 conversation은 training data와 크게 다르므로 model intelligibility를 유지하는 것이 어려움
- 따라서 논문은 data augmentation을 도입하여 해당 문제를 해결함
  - MUSAN noise dataset을 추가하고 random reverberation과 tempo augmentation을 적용하는 방식
  - 해당 data augmentation은 WavAugment를 통해 수행됨

3. Experiments

- Settings

Dataset : AISHELL-3
Comparisons : DualVC, IBF-VC

- Results

전체적으로 DualVC2가 가장 우수한 합성 성능을 보임

Ablation Study
- Ablation study 측면에서 각 component를 제거하는 경우 MOS, CER의 저하가 나타남
- 특히 각 mel-spectrogram을 비교해 보면, noise와 vertical line이 발생함

Computational Efficiency Evaluation
- System latency는 다음과 같이 얻어짐:
  (Eq. 4) $Latency = chunksize \times (1 + RTF) <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>Latency</mtext><mo>=</mo><mtext>chunksize</mtext><mo>\times</mo><mo stretchy="false">(</mo><mn>1</mn><mo>+</mo><mtext>RTF</mtext><mo stretchy="false">)</mo></math>$
- 결과적으로 160ms chunk size와 26.4ms latency에 대해, DualVC2의 total latency는 186.4ms와 같음
  - 0.58 RTF, 41M parameter, 252.8ms latency를 가지는 기존 DualVC와 비교하여 훨씬 효율적임

'Paper > Conversion' 카테고리의 다른 글

[Paper 리뷰] DiffVC: Diffusion-based Voice Conversion with Fast Maximum Likelihood Sampling Scheme (0)	2024.10.05
[Paper 리뷰] DualVC: Dual-mode Voice Conversion Using Intra-model Knowledge Distillation and Hybrid Predictive Coding (0)	2024.09.28
[Paper 리뷰] Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy (0)	2024.09.16
[Paper 리뷰] TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion (0)	2024.09.10
[Paper 리뷰] Wav2Vec-VC: Voice Conversion via Hidden Representations of Wav2Vec 2.0 (0)	2024.09.04

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] DualVC2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

DualVC2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

1. Introduction

2. Method

- Streamable Architecture

- Dynamic Masked Convolution

- Quiet Attention

- Data Augmentation

3. Experiments

- Settings

- Results

'Paper > Conversion' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역