 [Paper 리뷰] SSAST: Self-Supervised Audio Spectrogram Transformer
				
				
					[Paper 리뷰] SSAST: Self-Supervised Audio Spectrogram Transformer
					SSAST: Self-Supervised Audio Spectrogram TransformerAudio task에 Transformer를 적용할 수 있음SSASTSelf-Supervised Learning을 통해 Audio Spectrogram Transformer를 향상Joint discriminative and generative masked spectrogram patch modeling에 기반한 pre-training을 적용논문 (AAAI 2022) : Paper Link1. IntroductionAudio Spectrogram Transformer (AST)와 같은 pure self-attention-based model은 기존 CNN-based model에 비해 많은 training data를..
 [Paper 리뷰] EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text Prompting
				
				
					[Paper 리뷰] EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text Prompting
					EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text PromptingText-to-Speech model은 여전히 emotional expression 측면에서 한계가 있음EmoVoiceLarge Language Model을 활용하여 fine-grained freestyle natural language emotion control을 지원Phoneme token과 audio token을 parallel output 하여 content consistency를 향상논문 (MM 2025) : Paper Link1. IntroductionEmotion-contorllable Text-to-Speech (TTS) model은 emotion..
 [Paper 리뷰] HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis
				
				
					[Paper 리뷰] HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis
					HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech SynthesisZero-shot speech synthesis는 inference speed와 robustness의 한계가 있음HierSpeech++Hierarchical synthesis framework를 활용하여 naturalness를 향상Text representation과 prosody prompt를 기반으로 self-supervised/$F0$ representation을 생성하는 Text-to-Vec framework를 도입하고 16k..
 [Paper 리뷰] BridgeVoC: Neural Vocoder with Schrodinger Bridge
				
				
					[Paper 리뷰] BridgeVoC: Neural Vocoder with Schrodinger Bridge
					BridgeVoC: Neural Vocoder with Shrodinger BridgeDiffusion-based neural vocoder는 mel-spectrogram의 linear-degradation을 neglect 함BridgeVoCTime-Frequency domain-based neural vocoder와 Schrodinger Bridge를 연결Mel-spectrogram을 target linear-scale domain으로 project 하고 degraded spectral representation으로 취급논문 (IJCAI 2025) : Paper Link1. IntroductionNeural vocoder는 acoustic feature로부터 high-quality waveform을 생..
 [Paper 리뷰] PALLE: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
				
				
					[Paper 리뷰] PALLE: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
					PALLE: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech SynthesisZero-Shot Text-to-Speech에서 autoregressive model은 generation speed, non-autoregressive model은 temporal modeling의 한계가 있음PALLEAutoregressive의 explicit temporal modeling과 non-autoregressive의 parallel genertion을 combine 한 pseudo-autoregressive approach를 도입Two-stage framework를 기반으로 first stage에서는 ..
 [Paper 리뷰] RNDVoC: Learning Neural Vocoder from Range-Null Space Decomposition
				
				
					[Paper 리뷰] RNDVoC: Learning Neural Vocoder from Range-Null Space Decomposition
					RNDVoC: Learning Neural Vocoder from Range-Null Space DecompositionNeural vocoder는 parameter-performance trade-off가 존재함RNDVoCRange-Null Decomposition과 vocoder task를 bridge 하여 target spectrogram reconstruction을 range-space와 null-space 간의 superimposition으로 decompose추가적으로 sub-band, sequential modeling을 위해 cross-/narrow-band module을 활용한 dual-path framework를 구성논문 (IJCAI 2025) : Paper Link1. Introduct..
