Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech ReferenceCross-domain singing voice synthesis를 지원할 수 있는 unified framework가 필요함Everyone-Can-SingLyrics에 기반한 language content, musical score에 기반한 performance attribute, singing style, vocal technique 등의 multiple aspect control을 지원Pre-trained content embedding과 diffusion-based generator를 활용논문 (ICASSP 2025) : Paper Link1..
DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-SpeechDiffusion Transformer 기반의 speech model은 mel-spectrogram을 general image로 취급함DPI-TTSDiffusion Transformer를 기반으로 low-to-high frequency, frame-by-frame progressive inference approach를 적용하여 naturalness를 향상Fine-grained style temporal modeling을 도입하여 speaker style similarity를 개선논문 (ICASSP 2025) : Pape..
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERTSelf-supervised representation learning은 storage-intensive Transformer로 인해 low-resource setting에서 활용하기 어려움LightHuBERTOnce-for-All Transformer compression framework를 활용하여 structured parameter를 pruningTwo-stage distillation을 통해 HuBERT의 contextualized latent representation을 반영논문 (INTERSPEECH 2..
Factorized-VITS: Decoding Prosody and Text in End-to-End Speech Synthesis without External or Secondary AlignerExplicit text-side prosody modeling을 incorporate 하면 end-to-end text-to-speech 성능을 향상할 수 있음Factorized-VITSAudio prior hidden space를 text, prosody subspace로 clean factorizeExtra parameter 없이 factorized text subspace에서 on-the-fly alignment를 수행논문 (ICASSP 2025) : Paper Link1. IntroductionTex..
VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-SpeechDecoder-only text-to-speech model은 monotonic alignment constraint가 부족하여 mispronunciation, word skipping, repeating 등의 문제가 발생함VALL-TDecoder-only Transformer를 유지하면서 input phoneme sequence에 대한 relative position embedding을 도입Monotonic generation process를 explicitly indicate 하여 zero-shot text-to-speech에 대한 r..
WavTokenizer: An Efficient Acoustic Discrete Codec Tokenizer for Audio Language ModelingLanguage model은 high-dimensional natural signal을 lower-dimensional discrete token으로 compress 하는 tokenizer를 활용함WavTokenizerQuantizer layer와 discrete codec의 temporal dimension을 compressBroader VQ space, contextual window extending, inverse Fourier transform structure를 통해 더 나은 reconstruction quality와 richer sema..
SpeechTokenizer: Unified Speech Tokenizer for Speech Language ModelsSpeech language model은 semantic, acoustic token과 같은 discrete speech representation을 기반으로 구축됨SpeechTokenizerSpeech token이 speech language model에 적합한지를 evaluate 하기 위해 SLMTokBench를 도입Residual Vector Quantization에 기반한 encoder-decoder architecture를 채택하여 unified speech tokenizer를 구성 논문 (ICLR 2024) : Paper Link1. IntroductionSpeech Lan..
AdaptVC: High Quality Voice Conversion with Adaptive LearningVoice conversion을 위해서는 source에서 disentangled linguistic content를 추출하고 reference에서 voice style을 추출할 수 있어야 함AdaptVCAdapter를 활용하여 self-supervised speech feature를 tuning 해 content, speaker를 효과적으로 disentangleCross-attention speaker conditioning과 conditional flow matching을 활용하여 synthesis quality를 향상논문 (ICASSP 2025) : Paper Link1. Introductio..
FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised LearningSelf-supervised learning은 computational cost 측면에서 한계가 있음FitHuBERTTime-Reduction layer를 사용하여 inference time을 개선Hint-based Distillation을 통해 performance degradation을 방지논문 (INTERSPEECH 2022) : Paper Link1. IntroductionLarge-scale speech Self-Supervised Learning (SSL)은 speech-only data를 pre-training에 활용할 ..
