
ATP-TTS: Adaptive Thresholding Pseudo-Labeling for Low-Resource Multi-Speaker Text-to-SpeechText-to-Speech는 low-resource scenario에서는 활용하기 어려움ATP-TTSAdaptive Thresholding을 통해 적절한 pseudo-label을 select이후 contrastive learning perturbation으로 enhance 된 Automatic Speech Recognition model을 활용하여 latent representation을 predict논문 (ICASSP 2025) : Paper Link1. IntroductionGlow-TTS, VITS, NaturalSpeech와 같은 su..

SSR-Speech: Towards Stable, Safe and Robust Zero-Shot Text-based Speech Editing and SynthesisStable, safe, robust zero-shot text-to-speech model이 필요함SSR-SpeechTransformer decoder를 기반으로 classifier-free guidance를 incorporateWatermark EnCodec을 통해 edited region에 대한 frame-level watermark를 embed논문 (ICASSP 2025) : Paper Link1. IntroductionYourTTS와 같은 zero-shot text-based speech generation model은 Speech..

Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware DecodingCode-Switching Automatic Speech Recognition은 여전히 seamless language switch 측면에서 한계가 있음CS-WhisperWhisper를 기반으로 encoder의 intra-sentence switching을 향상하기 위해 Encoder Refiner를 도입각 decoder layer에서 language-specific decoding information을 얻기 위해 서로 다른 language prompt를 가진 Language-Aware Adapter를 활용논문 (ICASSP 2025) : Pap..

SpeechFlow: Generative Pre-Training for Speech with Flow MatchingSingle pre-trained generative model을 다양한 downstream task에 활용할 수 있음SpeechFlowFlow Matching과 masked condition을 사용하여 untranscribed speech로 pre-training을 수행Pre-trained generative model을 task-specific data로 fine-tuning 하여 다양한 task에 적용논문 (ICLR 2024) : Paper Link1. IntroductionDiscriminative model은 speech recognition, enhancement, separat..

From Discrete Tokens to High-Fidelity Audio Using Multi-Band DiffusionDiffusion을 highly compressed representation으로 condition 된 audio waveform을 합성하는 데 사용할 수 있음MBDLow-bitrate discrete representation에서 any type audio modality를 생성이를 위해 Multi-band diffusion-based framework를 활용논문 (NeurIPS 2023) : Paper Link1. IntroductionMelGAN과 같은 neural-based vocoder는 high-quality sample을 합성할 수 있음특히 HuBERT와 같은 Self..

VQ-Wav2Vec: Self-Supervised Learning of Discrete Speech RepresentationsWav2Vec-style self-supervised context prediction을 통해 audio segment의 discrete representation을 학습할 수 있음VQ-Wav2VecGumbel-Softmax, online $k$-means clusetering을 활용하여 dense representation을 quantizeDiscretization을 통해 BERT pre-training을 directly applicate논문 (ICLR 2020) : Paper Link1. IntroductionDiscrete speech representation을 학습하기 ..