[Paper 리뷰] iSTFTNet2: Faster and More Lightweight iSTFT-based Neural Vocoder Using 1D-2D CNN

티스토리 뷰

Paper/Vocoder

[Paper 리뷰] iSTFTNet2: Faster and More Lightweight iSTFT-based Neural Vocoder Using 1D-2D CNN

feVeRin 2024. 6. 26. 09:23

iSTFTNet2: Faster and More Lightweight iSTFT-based Neural Vocoder Using 1D-2D CNN

iSTFTNet은 1D CNN을 backbone으로 사용하고 일부를 iSTFT로 대체해 빠르고 고품질의 음성 합성을 지원함
- BUT, 1D CNN은 high-dimensional spectrogram을 모델링하기 어렵고, temporal upsampling에 대한 추가적인 속도 개선의 여지가 남아있음
iSTFTNet2
- Temporal, spectral structure를 각각 모델링하기 위해 1D-2D CNN을 활용해 iSTFTNet을 개선
- Few-frequency space에서 conversion 이후 frequency upsampling을 수행하는 2D CNN을 도입해 속도 저하 없이 high-dimensional spectrogram을 효과적으로 모델링
논문 (INTERSPEECH 2023) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 일반적으로 two-stage로 구성됨
- Acoustic model은 text input data로부터 intermediate representation인 mel-spectrogram을 예측하고 neural vocoder는 해당 intermediate representation에서 speech waveform을 합성함
  - 이때 autoregressive neural vocoder를 활용하면 high-fidelity의 합성이 가능하지만 sample-by-sample processing으로 인해 추론 속도가 상당히 느림
- 따라서 추론 속도와 parallelization을 향상하기 위해 non-autoregressive model이 도입됨
  1. 대표적으로 flow-based, diffusion-based, Generative Adversarial Network (GAN)-based model 등
  2. 특히 GAN-based model은 architectural flexibility와 빠른 추론 속도를 제공함
- GAN-based neural vocoder 중에서 iSTFTNet은 가장 빠른 속도와 lightweight model을 제공함
  1. 구조적으로는 HiFi-GAN과 같은 lightweight 1D CNN을 backbone으로 사용하여 output-side neural process를 inverse STFT (iSTFT)로 대체하는 방식을 사용함
  2. 특히 iSTFTNet은 1D CNN이 처리하기 어려운 high-dimensional spectrogram 모델링을 위해 large temporal upsampling을 적용하여 frequency dimension을 reduce한 다음 iSTFT를 적용함
    - 이를 통해 빠른 합성 속도를 달성했지만, temporal upsampling 측면에서 속도 향상의 여지가 남아있음
- 이때 2D CNN을 활용하여 spectrogram 변환을 수행하는 방식을 고려할 수 있음
  - BUT, 2D CNN을 단순히 적용하면 frequency dimension에 따라 compuation cost가 linearly increse 하므로, Fre-GAN과 같이 1D-2D CNN을 결합하는 방식을 활용해야함

-> 그래서 1D-2D CNN을 통해 iSTFTNet의 속도를 개선한 iSTFTNet2를 제안

iSTFTNet2
- 1D-2D CNN을 활용한 iSTFTNet의 변형으로써, 각각의 1D, 2D CNN은 global temporal과 spectrogram structure를 모델링하는데 사용됨
- 특히 1D CNN과 few-frequency 2D CNN을 활용하여 few-frequency space에서 변환을 수행한 다음, frequency upsampling을 적용
  - 이를 통해 속도 저하 없이 기존 1D CNN-based iSTFTNet의 high-dimensional spectrogram 모델링 성능을 향상

< Overall of iSTFTNet2 >

Temporal, spectral structure를 각각 모델링하는 1D-2D CNN을 활용해 iSTFTNet을 개선
결과적으로 성능 저하 없이 기존보다 더 적은 parameter 수와 빠른 추론 속도를 달성

2. Preliminary: Conventional iSTFTNet

iSTFTNet은 fully neural vocoder의 output-side layer를 lightweight iSTFT로 대체해 빠른 합성을 지원함
- 이때 lightweight 1D CNN vocoder인 HiFi-GAN을 backbone으로 사용해 구성됨
  - BUT, 1D CNN은 frequency direction의 local structure를 capture 하기 어렵기 때문에 high-dimensional spectrogram을 모델링하기 어려움
- 따라서 iSTFTNet은 다음의 temporal upsampling을 사용하여 frequency dimension을 reduce 함:
  (Eq. 1) $iSTFT(fs,hs,ws)=iSTFT(f1s,h1s,w1s)<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">iSTFT</mi></mrow><mo stretchy="false">(</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>,</mo><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>,</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo stretchy="false">)</mo><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">iSTFT</mi></mrow><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mi>s</mi></mfrac><mo>,</mo><mfrac><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mi>s</mi></mfrac><mo>,</mo><mfrac><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mi>s</mi></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$
  - $f s, h s, w s <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>,</mo><msub><mi>h</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>,</mo><msub><mi>w</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub></math>$ : 각각 $\times s <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>\times</mo><mi>s</mi></math>$ temporal upsampling 이후 iSTFT에 필요한 FFT size, hop length, window length
  - (Eq. 1)은 time-frequency tradeoff인 $f 1 \cdot 1 = f s \cdot s = constant <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mn>1</mn></mrow></msub><mo>\cdot</mo><mn>1</mn><mo>=</mo><msub><mi>f</mi><mrow data-mjx-texclass="ORD"><mi>s</mi></mrow></msub><mo>\cdot</mo><mi>s</mi><mo>=</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-auto-op="false">constant</mi></mrow></math>$ 를 기반으로 하고, $\times s <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>\times</mo><mi>s</mi></math>$ temporal upsampling을 수행하면 frequency dimension을 $s <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>s</mi></math>$ 배 줄일 수 있음을 의미
- iSTFTNet의 전체 architecture는 아래 그림의 (a)와 같음
  1. 이때 음성 품질과 속도에 대한 최적 tradeoff 모델은 iSTFTNet- $C8C8I4 <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>C8C8I4</mtext></math>$ 로 얻어짐
    - $C x <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>C</mtext><mi>x</mi></math>$ : $\times x <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>\times</mo><mi>x</mi></math>$ temporal upsampling을 갖춘 1D CNN block, $I y <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>I</mtext><mi>y</mi></math>$ : $\times y <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>\times</mo><mi>y</mi></math>$ upsampling을 갖춘 iSTFT
  2. 속도를 우선하는 경우, temporal upsampling을 더 적게 수행하는 iSTFTNet- $C8C1I32 <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>C8C1I32</mtext></math>$ 를 사용할 수 있음
    - BUT, 해당 모델의 경우 1D CNN으로 인해 high-dimensional spectrogram을 모델링하는데 한계가 있어 합성 품질이 떨어짐

3. iSTFTNet2

iSTFTNet2는 fewer temporal sampling을 통해 iSTFTNet의 속도를 향상하면서 음성 품질을 유지하는 것을 목표로 함
- 이를 위해 spectrogram의 local structure를 capture 하는 fully 2D CNN을 사용할 수 있지만, frequency-dimension에 따라 computational cost가 linearly increase 하는 문제가 있음
- 따라서 아래 그림의 (b)와 같이 단순 2D CNN 대신 1D-2D CNN을 활용해 iSTFTNet2를 구성함
  1. 구조적으로는 먼저 기존 iSTFTNet과 동일하게 처음 3개 module에 대해 1D CNN을 사용
    - 이때 subsequent 2D CNN에 더 많은 information을 전달하기 위해 1D ResBlock에서 multi-receptive fusion output을 integrating 할 때, addition 대신 channel concatenation을 사용함
  2. 이후 1D-to-2D conversion을 수행하고 2D CNN을 적용하여 spectrogram의 local structure를 capture 함
- 이때 2D CNN의 도입으로 인한 compuational cost의 증가를 방지하기 위해, few-frequency space에서 main conversion을 수행한 다음, transposed convolution을 통해 last phase에서 frequency upsampling을 수행함
  - 즉, frequency dimension이 8번 downsample 된 space에 대해 2D block이 적용됨

한편으로 2D block은 아래 그림과 같이 구성됨
- 먼저 2D ResidualBlock은 information propagate를 위해 residual connection을 활용함
  - 이때 기존의 iSTFTNet- $C8C8I4 <math xmlns="http://www.w3.org/1998/Math/MathML"><mtext>C8C8I4</mtext></math>$ 보다 더 빠르고 lightweight 하도록 kernel size와 channel 수에 대한 model parameter를 adjust 함
- 추가적으로 더 효율적인 구성을 위해 ShuffleNet에 기반한 2D ShuffleBlock을 도입함
  1. 해당 block에서는 2D convolutional layer에 사용되는 parameter 수가 2D ResBlock의 절반이 되도록 adjust 됨
    - 여기서 half channel은 residual connection과 달리 model capacity를 preserve 하기 위해 directly propagate 됨
  2. Channel shuffle은 skip과 non-skip branch 간의 interaction을 제공하기 위해 사용됨
    - 해당 channel shuffle, channel split, channel concatenation은 weight-free operation이므로 2D ResBlock 보다 빠르게 동작가능함
- 결과적으로 논문에서는 2D ResBlock을 사용한 모델을 iSTFTNet2-Base, 2D ShuffleBlock을 사용한 모델을 iSTFTNet2-Small이라고 함

4. Experiments

- Settings

Dataset : LJSpeech, VCTK
Comparisons : HiFi-GAN, iSTFTNet

- Results

Result on Single Speaker Dataset
- LJSpeech dataset에 대해 iSTFTNet2가 가장 좋은 성능을 달성함
- RTF와 Parameter 수 측면에서도 iSTFTNet2는 가장 효율적인 것으로 나타남

Result on Multiple Speaker Dataset
- VCTK dataset에 대해서도 iSTFTNet2는 빠른 추론과 합리적인 품질을 제공함

Application to Multi-band Modeling
- Multi-band Modeling으로 iSTFTNet2를 확장해 보아도, 기존 iSTFTNet보다 뛰어난 성능을 달성함

'Paper > Vocoder' 카테고리의 다른 글

[Paper 리뷰] JenGAN: Stacked Shifted Filters in GAN-based Speech Synthesis (0)	2024.07.03
[Paper 리뷰] FreeV: Free Lunch for Vocoders through Pseudo Inversed Mel Filter (0)	2024.06.28
[Paper 리뷰] ItoWave: Ito Stochastic Differential Equation is All You Need for Wave Generation (0)	2024.06.24
[Paper 리뷰] Ultra-Lightweight Neural Differential DSP Vocoder for High Quality Speech Synthesis (0)	2024.06.18
[Paper 리뷰] BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation (0)	2024.06.16

최근에 올라온 글

최근에 달린 댓글

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Total

Today

Yesterday

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Let IT Begin

티스토리 뷰

[Paper 리뷰] iSTFTNet2: Faster and More Lightweight iSTFT-based Neural Vocoder Using 1D-2D CNN

iSTFTNet2: Faster and More Lightweight iSTFT-based Neural Vocoder Using 1D-2D CNN

1. Introduction

2. Preliminary: Conventional iSTFTNet

3. iSTFTNet2

4. Experiments

- Settings

- Results

'Paper > Vocoder' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역