
SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language ModelsLarge Language Model을 위한 기존의 speech representation discretization method는 Euclidean distance-based quantization이나 pre-defined codebook에 의존함SECodecSpeech를 graph로 modeling 하고 graph 내의 speech feature node를 clustering 한 다음, 2D Strutural Entropy를 minimize 하여 codebook을 추출- 2D SE minimization principle을 ..

SEVC: Voice Conversion via Structural Entropy기존의 voice conversion method는 prosody leakage, speech representation blurring의 문제가 있음SEVCSource, reference speech에서 self-supervised representation을 추출하고 reference speech representation을 graph로 구축이후 2D Structural Entropy를 사용하여 semantically similar representation을 clustering- Voice conversion 시 source representation의 각 frame을 new node로 취급하고, SE를 통해 각 nod..

LiveSpeech: Low-Latency Zero-Shot Text-to-Speech via Autoregressive Modeling of Audio Discrete CodesNeural audio codec을 통해 zero-shot text-to-speech가 가능하지만 low-latency scenario에서 활용하기 어려움LiveSpeech각 frame의 codebook contribution을 고려한 adaptive codebook loss를 도입Codebook을 grouping 하고 해당 group에 대한 parallel processing을 수행논문 (INTERSPEECH 2024) : Paper Link1. IntroductionNaturalSpeech2와 같은 Zero-shot Text..

ZCS-CDiff: A Zero-Shot Code-Switching TTS System with Conformer-Based Diffusion ModelCode-Switching Text-to-Speech는 zero-shot scenario에서 활용하기에 한계가 있음ZCS-CDiffSpeech feature를 disentangle 하고 diffusion model을 사용하여 해당 disentangled attribute를 modelingConformer-based WaveNet을 denoising network로 활용하여 attribute modeling을 개선추가적으로 speaker-assist module을 통해 speaker similarity를 향상논문 (ICASSP 2025) : Paper Li..

MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier TransformLightweight end-to-end text-to-speech model이 필요함MB-iSTFT-VITSComputationally expensive component를 simple inverse Short-Time Fourier Transform으로 replaceFixed/trainable synthesis filter를 가지는 multi-band generation을 통해 waveform을 생성논문 (ICASSP 2023) : Paper Link1. I..

W2V-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-TrainingMasked Language Modeling을 self-supervised speech representation learning에 적용할 수 있음W2V-BERTContrastive learning과 masked language modeling을 combine2가지의 self-supervised task를 end-to-end fashion으로 optimize논문 (ASRU 2021) : Paper Link1. IntroductionLarge-scale unannotated speech를 사용하여 Automatic..