[Paper 리뷰] Singing Voice Synthesis based on a Musical Note Position-aware Attention Mechanism

티스토리 뷰

Paper/SVS

[Paper 리뷰] Singing Voice Synthesis based on a Musical Note Position-aware Attention Mechanism

feVeRin 2024. 2. 29. 11:00

Singing Voice Synthesis based on a Musical Note Position-aware Attention Mechanism

Singing Voice Synthesis를 위해 acoustic, temporal 모델링을 동시에 수행할 수 있는 sequence-to-sequence 모델을 활용할 수 있음
Musical Note Position-aware Attention
- Musical score가 주는 rhythm을 고려하여 attention weight를 추정
- 제안하는 attention mechanism을 활용하여 sequence-to-sequence 모델에서 simultaneous 모델링을 수행하고 temporal 모델링에 대한 robustness를 향상
논문 (ICASSP 2023) : Paper Link

1. Introduction

Singing Voice Synthesis (SVS)는 musical score로부터 얻은 score feature sequence를 사용하여 acoustic feature sequence를 생성함
- 가창 음성은 주어진 musical score와 strictly synchronize 되어야 하므로 temporal structure 모델링이 중요
  - Text-to-Speech (TTS)와 달리 가창 음성의 phoneme duration은 동일한 phoneme이라도 해당 note duration에 따라 다르기 때문에 temporal 모델링은 특히 어려움
- 일반적인 DNN-based SVS 모델은 acoustic, duration, time-lag 모델을 사용하여 acoustic feature와 temporal structure를 모델링함
  - 이때 acoustic 모델은 time-aligned score feature sequence로부터 acoustic sequence를 mapping 함
  - BUT, 이러한 방식은 alingment-related issue로 인해 어려움이 있음
  1. Acoustic feature의 모델링 성능은 alignment error에 영향을 받음
  2. Acoustic feature와 temporal structure 사이의 correlation을 적절히 모델링할 수 있는 ability가 없음
    -> 결과적으로 합성된 가창 음성의 naturalness와 expressiveness가 저하됨
- TTS 모델을 활용하는 경우, explicit length regulator가 있는 encoder-decoder 구조를 주로 사용함
  - 해당 방식은 external duration information에 민감하다는 특징이 있음
- Attention mechanism을 활용한 sequence-to-sequence (Seq2Seq) 방식 또한 자주 활용되고 있음
  - Acoustic, temporal structure를 동시에 모델링할 수 있다는 장점이 있지만, target musical score와 가창 음성 간의 timing mismatch 문제가 있음
  - Attention-based 모델에서 timing을 manually collecting하는 것은 어렵고 impractical하기 때문

-> 그래서 attention-based Seq2Seq SVS 모델을 개선하기 위해 note position-aware attention machanism을 도입

Note Position-aware Attention
- Musical score에 의해 inform 되는 note position을 기반으로 attention weight를 계산
- Robust alignment를 얻고 naturalness를 향상하는 additional technique들을 도입
  - Auxiliary note feature embedding
  - Guided attention loss
  - Pitch normalization

< Overall of This Paper >

Musical score가 주는 rhythm을 고려하여 attention weight를 추정해 Seq2Seq SVS 모델을 개선
결과적으로 추가적인 temporal 모델링 없이 적절한 vocal timing을 가지는 가창 음성을 얻음

2. Method

Proposed method는 attention mechanism을 가지는 encoder-decoder 모델을 기반으로 함
- Phoneme-level score feature sequence에서 직접 frame-level acoustic feature sequence를 생성
- 이때 score feature는 phone, note pitch, note length, key, tempo 등 다양한 musical context로 구성됨

- Musical Note Position-aware Attention Mechanism

Attention mechanism은 각 decoder time-step에서 attention weight를 계산하여 encoder hidden state $x=[x_{1},x_{2},...,x_{N}]$의 soft-selection을 수행하는 과정
- 여기서 context vector $c_{t}$는 $c_{t}=\sum_{n=1}^{N}\alpha_{t}(n)x_{n}$로 얻을 수 있음
  - $\alpha_{t}(n)$ : $t$-th decoder time-step에서 $n$-th hidden state에 대한 attention weight
- Output vector $o_{t}$는 decoder를 통한 context vector $c_{t}$의 예측된 conditioning
Attention mechanism에 의해 얻어진 alignment는 주어진 musical score에 따라 가창 음성을 합성하기 위해, encoder state를 skipping 하지 않고 monotonic 하며 continuous 해야 함
- 이를 위해 current attention weight $\alpha_{t}(n)$은 previous alignment를 통해 reculsive 하게 계산됨:
  (Eq. 1) $\alpha'_{t}(n)=\left( (1-u_{t-1}(n))\alpha_{t-1}(n)+u_{t-1}(n-1)\alpha_{t-1}(n-1)\right)\cdot y_{t}(n)$
  (Eq. 2) $\alpha_{t}(n)=\frac{\alpha'_{t}(n)}{\sum_{m=1}^{N}\alpha'_{t}(m)}$
  - $y_{t}(n)$ : output probability, $u_{t}(n)$ : transition probability
  - (Eq. 1)의 $u_{t}(n)$은 phoneme-dependent time-variant transition probability
- (Eq. 1)은 transition agent를 사용한 generalized forward attention으로 볼 수 있음

SVS의 alignment는 note duration, tempo와 같은 musical note의 temporal structure를 고려해야 함
- 따라서 $y_{t}(n), u_{t}(n)$의 계산을 위해 note positional feature를 도입
- Output probability $y_{t}(n)$은 musical note position-aware additional term $U^{(\cdot)}p_{t,n}$을 사용하여 extended content-based attention을 통해 다음과 같이 계산됨:
  (Eq. 3) $e_{t}(n)=v^{(e)\top}\tanh \left(W^{(e)}q_{t}+V^{(e)}x_{n}+U^{(e)}p_{t,n}+b^{(e)}\right)$
  (Eq. 4) $y_{t}(n)=\frac{\exp\left(e_{t}(n)\right)}{\sum_{m=1}^{N}\exp\left(e_{t}(m)\right)}$
  - $q_{t}$ : $t$-th time-step decoder hidden state
  - $W^{(\cdot)}q_{t}, V^{(\cdot)}x_{n}$ : attention의 query/key comparison
  - $b^{(\cdot)}$ : bias term
- Note position embedded feature $p_{t,n}$은 note position representation $[p_{t,n}^{1},p_{t,n}^{2},p_{t,n}^{3}]$에 single tanh hidden layer를 feed 하여 얻어짐
- 여기서 각 note position representaion은 다음과 같이 주어진 musical score의 note length로부터 계산됨:
  (Eq. 5) $p^{1}_{t,n}=t-s_{n}$
  (Eq. 6) $p^{2}_{t,n}=e_{n}-t$
  (Eq. 7) $p^{3}_{t,n}= \left\{\begin{matrix}
  s_{n}-t, & (t<s_{n}) \\
  0, & (s_{n}\leq t \leq e_{n}) \\
  t-e_{n}, & (e_{n}<t) \\
  \end{matrix}\right.$
  - $s_{n}, e_{n}$ : 각각 $n$-th musical note의 start, end position
- Transition probability는 past alignment에서 파생되어야 하므로 location-sensitive attention을 채택:
  (Eq. 8) $u_{t}(n)=\sigma\left( v^{(u)\top}\tanh (W^{(u)}q_{t}+V^{(u)}x_{n}+U^{(u)}p_{t,n}+T^{(u)}f_{t,n}+b^{(u)})\right)$
  - $\sigma(\cdot)$ : logistic sigmoid function
  - $T^{(u)}f_{t,n}$ : previous cumulative alignment로부터 계산된 convolutional feature를 사용하는 location-sensitive term

- Auxiliary Note Feature Embedding

가창 음성의 temporal structure는 input score의 note context에 따라 달라지므로, 이를 기반으로 가창 음성의 alignment를 예측해야 함
- 이를 위해 current note와 연관된 note context를 attention query에 auxiliary note feature로써 embed 함
- Auxiliary note feature는,
  - Score feature에서 phoneme/mora-related context를 제거하여 얻어진 note-related context를 포함하고
  - Note length를 사용하여 frame-level sequence로 expand 됨
  - 이후 해당 upsampled feature는 single dense layer로 전달되고, Pre-Net의 output과 concatenate 되어 decoder input을 구성함
- 결과적으로 attention query를 통해 해당 note의 current frame position과 note context이 전달되므로, attention mechanism은 musical score가 제공하는 rhythm에 맞게 alignment를 adjust 가능

- Guided Attention Loss for SVS

가창은 일반적으로 musical score의 rhythm을 따르기 때문에 alignment는 score의 note timing에 따라 결정되는 path에 가까워야 함
- 위 가정을 기반으로 SVS 모델에 대한 guided attention loss를 설계
- 이때 TTS와는 달리 SVS 모델에 적합한 penalty matrix를 구성함
  - Diagonal element가 0인 TTS와 달리 note boundary를 기반으로 penalty matrix를 생성
- 아래 그림과 같이 동일한 note에 여러 개의 morae가 포함되어 있어도 roubst 한 alignment 추정이 가능하도록, pseudo-determined mora boundary를 기반으로 penalty matrix $G \in \mathbb{R}^{N\times T}$를 설계
  1. 해당 pseudo-boundary는 각 note의 morae 수에 따라 note duration을 equally dividing하여 얻음
  2. Soft matrix는 60 frame width에 걸쳐 linearly decaying하여 생성됨
  3. Penalty matrix를 구성하기 위한 boundary의 start position은 vocal timing deviation을 고려하여 15 frame 앞으로 shift됨

Alignment matrix $A \in \mathbb{R}^{N \times T}$를 가정하고 $(i,j)$ element는 $\alpha_{j}(i)$, $T$는 총 frame 수라고 하자
- 이때 guided attention loss는:
  (Eq. 9) $\mathcal{L}_{att}(G,A)=\frac{1}{NT}||G\odot A||_{1}$
  - $\odot$ : element-wise product
- 따라서 proposed method에 대한 최종적인 loss $\mathcal{L}$는:
  (Eq. 10) $\mathcal{L}=\mathcal{L}_{feat}(o,\hat{o})+\mathcal{L}_{feat}(o,\hat{o}')+\lambda \mathcal{L}_{att}(G,A)$
  (Eq. 11) $\mathcal{L}_{feat}(o,\hat{o})=\frac{1}{TD}\sum_{t=1}^{T}||o-\hat{o}||_{2}^{2}$
  - $\hat{o}, \hat{o}'$ : 각각 decoder와 Post-Net에 의해 예측된 acoustic feature
  - $D$ : acoustic feature의 dimension 수
  - (Eq. 10)의 $\lambda$는 guided attention loss에 대한 adjustment parameter

- Pitch Normalization

합성된 가창 음성의 pitch는 musical score의 note pitch를 따라야 함
- Log fundamental frequency $F_{0}$와 note pitch로 결정된 log $F_{0}$ 간의 차이로 모델링 되는 pitch normalization을 도입
- Pitch normalization은 생성된 $F_{0}$ sequence를 frame-by-frame으로 처리하기 위해, time-aligned frame-level note pitch sequence가 필요함
  - 이는 각 decoder time-step에서 attention weight를 이용하여 phone-level input note pitch sequence를 weighting 함으로써 얻어짐

3. Experiments

- Settings

Dataset : Japanese Children's Songs

- Results

Experiment 1
- MOS 측면에서 temporal structure 모델링의 효과를 비교
- 결과적으로 제안된 방식이 가장 우수한 합성 품질을 보임

Alignment 측면에서
- Base와 NF는 적절한 alignment를 얻지 못하는 것으로 나타남
  - 제안된 note position-aware attention이 SVS 성능 향상에 효과적임
- Guided attention loss까지 적용된 Prop은 가장 우수한 결과를 얻을 수 있음을 보임

Experiment 2
- Attention mechanism의 효과를 비교
- 결과적으로 제안된 position-aware attention을 사용했을 때 가장 우수한 합성 품질을 달성

Alignment 측면에서,
- P-Trans는 alignment가 monotonic 하게 나타나지만, alignment 추정이 output probability에 크게 의존함
- T-Trans의 경우 transition이 쉽게 일어나기 때문에 consonant skipping 문제가 발생함
- 제안하는 Prop는 보다 적절한 transition probability를 가지기 때문에 더 정확한 alignment를 얻을 수 있고, 결과적으로 합성 품질 향상으로 이어짐

'Paper > SVS' 카테고리의 다른 글

[Paper 리뷰] SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-Filter Model (0)	2024.05.03
[Paper 리뷰] StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis (0)	2024.03.26
[Paper 리뷰] Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables (0)	2024.01.20
[Paper 리뷰] LiteSing: Towards Fast, Lightweight and Expressive Singing Voice Synthesis (0)	2024.01.09
[Paper 리뷰] UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis (0)	2024.01.04

최근에 올라온 글

최근에 달린 댓글

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Total

Today

Yesterday

Let IT Begin

티스토리 뷰

[Paper 리뷰] Singing Voice Synthesis based on a Musical Note Position-aware Attention Mechanism

Singing Voice Synthesis based on a Musical Note Position-aware Attention Mechanism

1. Introduction

2. Method

- Musical Note Position-aware Attention Mechanism

- Auxiliary Note Feature Embedding

- Guided Attention Loss for SVS

- Pitch Normalization

3. Experiments

- Settings

- Results

'Paper > SVS' 카테고리의 다른 글

티스토리툴바