ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based Speaker VerificationSpeaker verification은 speaker representation을 추출하는 neural network에 의존함ECAPA-TDNNInitial frame layer를 1-dimensional Res2Net module로 reconstruct 하고 channel interdependency를 explicitly modeling 하기 위해 Squeeze-and-Excitation block을 도입서로 다른 hierarchical level의 feature를 aggregate, propagate 하고 channe..
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis기존의 text-to-speech model은 phrasing, intonation 측면에서 한계가 있음ProsodyFMProsody 측면에서 phrasing, intonation을 향상하기 위해 Flow Matching backbone을 활용하고 Phrase break encoder, Duration predictor, Terminal intonation encoder를 도입Explicit prosodic label 없이 training 되어 break duration, intonation pattern의 broad spectrum을 uncove..
UniWav: Towards Unified Pre-Training for Speech Representation Learning and GenerationPre-training과 representation learning은 서로 다른 foundation model을 사용함UniWavPre-training, representation learning을 위한 unified encoder-decoder frameworkRepresentation encoder와 generative decoder를 jointly learning논문 (ICLR 2025) : Paper Link1. IntroductionSpeech representation은 specific task를 excelling 하는 데 사용됨특히 HuBE..
ExpressiveSinger: Multilingual and Multi-Style Score-based Singing Voice Synthesis with Expressive Performance ControlSinging Voice Synthesis는 timing, dynamics, pitch 측면에서 controllability가 부족함ExpressiveSingerPhoneme timing, $F0$ curve, amplitude envelope를 포함하는 expressive performance control signal을 생성Style guidance와 singer timbre embedding을 활용해 performance control signal에서 mel-spectrogram을 생성논문 ..
Balanced-Wav2Vec: Enhancing Stability and Robustness of Representation Learning through Sample Reweighting TechniquesSelf-Supervised Learning model은 mode collapse, dimension collapse로 인해 expressiveness가 떨어짐Balanced-Wav2VecOver-represented mode의 emergence를 suppress 하는 balanced-infoNCE loss를 도입Wav2Vec 2.0의 highly-skewed codebook distribution을 방지하고 stable convergence를 지원논문 (INTERSPEECH 2024) : Pape..
CCC-Wav2Vec 2.0: Clustering Aided Cross Contrastive Self-Supervised Learning of Speech RepresentationsSelf-Supervised Learning은 unlabeled data를 효과적으로 활용할 수 있음CCC-Wav2Vec 2.0Clustering과 Augmentation-based Cross-Contrastive loss를 self-supervised objective로 활용이를 통해 pre-training의 robustness를 향상논문 (SLT 2023) : Paper Link1. IntroductionSelf-Supervised Learning (SSL)은 unlabeled data에서 high-level repres..
FACTSpeech: Speaking a Foreign Language Pronunciation Using Only Your Native Characters대부분의 text-to-speech model은 transliterated text를 고려하지 않음FACTSpeechInput text의 pronunciation을 native, literal pronunciation으로 변환하는 language shift embedding을 도입Speaker identity를 preserve 하면서 pronunciation을 향상하기 위해 conditional instance normalization을 적용논문 (INTERSPEECH 2023) : Paper Link1. IntroductionText-to-Speec..
E1-TTS: Simple and Fast Non-Autoregressive TTSEfficient non-autoregressive zero-shot text-to-speech model이 필요함E1-TTSDenoising diffusion pre-training과 distribution matching distillation을 활용Text, audio pair 간의 explicit monotonic alignment를 제거논문 (ICASSP 2025) : Paper Link1. IntroductionNon-Autoregressive (NAR) Text-to-Speech (TTS) model은 text로부터 speech를 parallel 하게 생성하므로, 하나의 unit 씩 합성하는 Autoregres..
SyllableLM: Learning Coarse Semantic Units for Speech Language ModelsAudio와 같은 continuous data에 대한 tokenization은 fixed size convolution이나 discrete clustering에 의존하므로 data의 semantic structure와 align 되지 않음SyllableLMPre-trained encoder loss의 correlation을 analyze 하여 noisy boundary를 추출Distillation technique을 통해 model representation을 iteratively improving논문 (ICLR 2025) : Paper Link1. IntroductionSpoken..
