[Paper 리뷰] DenoiSpeech: Denoising Text to Speech with Frame-level Noise Modeling

티스토리 뷰

Paper/TTS

[Paper 리뷰] DenoiSpeech: Denoising Text to Speech with Frame-level Noise Modeling

feVeRin 2024. 6. 7. 09:28

DenoiSpeech: Denoising Text to Speech with Frame-level Noise Modeling

Text-to-Speech 모델을 학습하기 위해서는 고품질의 speech data가 필요하지만, 대부분 noisy speech를 포함하고 있음
DenoiSpeech
- Noisy speech data를 사용하여 clean speech를 합성할 수 있는 Text-to-Speech 모델
- 모델과 jointly train 되는 noise condition module을 사용하여 fine-grained frame-level noise를 모델링하여 real-world noisy speech를 처리함
논문 (ICASSP 2021) : Paper Link

1. Introduction

Text-to-Speech (TTS)는 text로부터 natural, intelligible voice를 합성하는 것을 목표로 함
- 특히 neural TTS 모델을 training하기 위해서는 대용량의 clean speech data가 필요하지만, 비용 등으로 인해 수집의 한계가 있음
- 한편으로 daily conversation이나 public talk 등은 수집하기 쉽지만 noisy speech를 많이 포함하고 있음
  1. 이를 해결하기 위해, 기존에는 straight-forward하게 pre-trained speech enhancement 모델을 사용하여 denoising을 수행한 다음 TTS 모델을 training 함
    - 해당 방식은 simple noise에서는 잘 동작하지만, noise distribution이 다르거나 복잡한 경우 성능이 저하됨
  2. 여기서 보다 효과적인 training을 위해서는 noisy speech data에서 TTS 모델을 직접 training 하는 것이 좋음
    - 이 경우 noise embedding을 condition으로 하여 추론 중에 denoising을 하는 방식을 고려할 수 있음
    - 이때 일반적으로 utterance-level vector를 사용하므로, time-dimension에 따라 크게 변화하는 복잡한 noise pattern을 반영할 수 없다는 한계가 있음

-> 그래서 noisy speech를 효과적으로 처리할 수 있는 fine-grained frame-level noise modeling을 적용한 DenoiSpeech를 제안

DenoiSpeech
- Noise extractor를 활용하여 target-speaker noisy speech에서 frame-level noise information을 추출한 다음, 추출된 noise information을 TTS decoder의 input으로 사용하는 noise condition module을 도입
- Noise extractor는 TTS loss와 adversarial CTC loss로 jointly train 되어 noise information 만을 추출함

< Overall of DenoiSpeech >

TTS 모델과 jointly train 되는 noise condition module을 통해 fine-grained frame-level noise를 모델링하고 real-world noisy speech를 효과적으로 학습
결과적으로 기존의 utterance-level, speech enhancement 방식보다 뛰어난 성능을 달성

2. Method

- Model Overview

DenoiSpeech는 FastSpeech2를 기반으로 구성됨
- Phoneme encoder는 phoneme embedding을 hidden sequence로 변환하고, length regulator는 해당 sequence를 mel-spectrogram sequence와 동일한 length로 extend 함
- 다음으로 noise condition module은 noisy speech에서 noise condition을 추출하여 hidden sequence에 추가하고, pitch는 pitch predictor를 통해 hidden sequence에 추가됨
- Mel-spectrogram decoder는 hidden sequence를 parallel 하게 mel-spectrogram sequence로 변환함

- Noise Condition Module

Noise condition module은 noisy speech에서 noise information을 capture 한 다음, TTS 모델의 input으로 사용하는 것을 목표로 함
- 먼저 noise extractor는 background noise audio를 추출한 다음, noise encoder를 사용하여 noise condition으로 변환함
- 이후 noise condition은 TTS 모델에 noise information을 반영하기 위해 hidden sequence에 더해짐
- 추가적으로 adversarial CTC sub-module은 text 없이 noise information 만을 추출할 수 있도록 도입됨
Noise Extractor
- 먼저 noise extractor는 noisy speech $y$에서 noise audio $y'$을 추출하는 것을 목표로 함
- 여기서 noise extractor는 speech enhancement에서 주로 사용되는 U-Net architecture를 활용
  1. 구조적으로, noise extractor는 4개의 DownConv block과 4개의 UpConv block을 가짐
  2. 각 DownConv/UpConv block은 downsampling/upsampling layer와 2개의 $3\times 3$ 2D convolution layer를 반복하여 구성됨
    - 각 layer 다음에는 ReLU activation과 batch normalization layer가 이어짐
- 해당 noise extractor는 다음의 2가지 data로 training 됨
  1. Paired noisy data $(y_{p}, y'_{p})$
    - $y'_{p}$ : noise sequence, $y_{p}$ : noisy speech sequence (clean speech와 noise $y'_{p}$를 mix 하여 얻어짐)
  2. Unpaired noisy data $y_{u}$
    - Paired noise를 가지지 않는 noisy speech (artifical/real-world noisy data일 수 있음)
- 이때 아래 [Algorithm 1]과 같이 unpaired noisy data를 training에 활용하기 위해 noise extractor를 TTS 모델과 jointly training 함
  - 이를 통해 noise extraction을 최적화하는 end-to-end gradient를 제공할 수 있음
Noise Encoder
- Noise encoder는 추출된 noise audio를 fine-grained frame-level noise condition으로 변환하는 역할
- 여기서 noise audio는 noise condition으로 encode 된 다음, noise condition이 hidden sequence에 추가되어 pitch predictor로 전달됨
  - Fine-grained frame-level noise condition은 output mel-spectrogram과 length가 동일하므로, 각 timestep의 noise information을 describe 할 수 있음
  - 결과적으로 이를 통해 DenoiSpeech는 time에 따라 larger variance로 noise를 처리함
- 논문은 paired noisy speech에 대한 noise encoder의 input으로 ground-truth noise audio $y'_{p}$를 사용하고, unpaired noisy data에 대해 noise extractor로 추출된 noise audio를 사용
  - 추가적으로 TTS 모델 training을 위해 clean speech를 사용
  - Clean speech의 경우, noise encoder의 input으로 clean speech와 동일한 length의 silence audio가 사용됨
Adversarial CTC Module
- Noise extractor로 얻어지는 noise에 text information이 포함되어 있으면 information leakage가 발생하여 TTS 모델 training이 어려워짐
- 따라서 DenoiSpeech는 noise extractor가 text information 대신 noise information만 생성할 수 있도록, adversarial CTC module을 추가함
  - 이때 unpaired noisy speech data에서 추출된 noise audio는 Gradient Reverse Layer (GRL)과 CTC encoder로 전달되어 adversarial CTC loss를 계산
  - 이를 통해, 추출된 noise audio에는 speech content information이 포함되지 않도록 force 할 수 있음
- CTC encoder는 hidden state를 character-level output distribution에 project 하기 위해, additional linear-softmax layer가 있는 Transformer encoder로 구성됨

- Training and Inference

DenoiSpeech의 training/inference는 아래 [Algorithm 1]과 같음

3. Experiments

- Settings

Dataset : VCTK + Noisy real-world dataset (internal)
Comparisons : FastSpeech2
- Augmented-Adversarial FastSpeech2
- Enhancement-based FastSpeech2

- Results

MOS 측면에서 DenoiSpeech는 다른 FastSpeech2 기반 모델들보다 뛰어난 성능을 보임

Frame-level noise condition은 utterance-level 보다 더 뛰어난 CMOS를 보임
- 즉, fine-grained noise condition은 noise information을 더 잘 설명할 수 있음

Noise Condition의 Granularity에 따른 CMOS 비교

한편으로 training 중에 noise extractor를 fix 하면 합성 품질의 저하가 발생함
- 즉, noise extractor와 TTS 모델을 jointly training 하면 더 나은 generalization 성능을 얻을 수 있음

추가적으로 adversarial CTC module을 사용하면 품질을 향상할 수 있음

'Paper > TTS' 카테고리의 다른 글

[Paper 리뷰] EATS: End-to-End Adversarial Text-to-Speech (0)	2024.06.09
[Paper 리뷰] MSMC-TTS: Multi-Stage Multi-Codebook VQ-VAE based Neural TTS (0)	2024.06.08
[Paper 리뷰] VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design (0)	2024.06.05
[Paper 리뷰] SANE-TTS: Stable and Natural End-to-End Multilingual Text-to-Speech (0)	2024.06.04
[Paper 리뷰] DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech (0)	2024.06.03

최근에 올라온 글

최근에 달린 댓글

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] DenoiSpeech: Denoising Text to Speech with Frame-level Noise Modeling

DenoiSpeech: Denoising Text to Speech with Frame-level Noise Modeling

1. Introduction

2. Method

- Model Overview

- Noise Condition Module

- Training and Inference

3. Experiments

- Settings

- Results

'Paper > TTS' 카테고리의 다른 글

티스토리툴바