Let IT Begin

[Paper 리뷰] BlockDecoder: Boosting ASR Decoders with Context and Merger Modules

BlockDecoder: Boosting ASR Decoders with Context and Merger ModulesAttention-based Encoder-Decoder model에서 decoder는 Automatic Speech Recognition output을 autoregressively generate 함- 특히 initial layer는 textual context를 build 하고 later layer는 acoustic, textual informaiton을 merge 함BlockDecoderPurely text-based text encoder와 information을 combine 하는 merger를 도입Encoder representation을 reuse 하고 text encod..

Paper/ASR 2025. 11. 10. 13:55

[Paper 리뷰] Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis

Shallow Flow Matching for Coarse-to-Fine Text-to-Speech SynthesisFlow Matching-based Text-to-Speech model을 개선할 수 있음Shallow Flow Matching (SFM)Coarse representation으로부터 Flow Matching path를 따라 intermediate state를 construct해당 state의 temporal position을 adaptively determine 하기 위해 orthogonal projection을 도입논문 (NeurIPS 2025) : Paper Link1. IntroductionVoiceBox, ReFlow-TTS, VoiceFlow와 같은 Flow Matching (F..

Paper/TTS 2025. 11. 6. 13:35

[Paper 리뷰] FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks기존의 neural codec은 high bitrate, semantic/acoustic information loss의 문제가 있음FocalCodecFocal modulation을 기반으로 single binary codebook을 사용하여 speech를 compressSemantic/acoustic information을 preserve 하여 다양한 downstream task에서 우수한 성능을 달성논문 (NeurIPS 2025) : Paper Link1. IntroductionAudioLM, AudioGen과 같은 speech language model은 token-based sp..

Paper/Neural Codec 2025. 11. 5. 13:24

[Paper 리뷰] SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model

SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space ModelAudio representation learning을 위한 Transformer architecture는 memory, inference time 측면에서 quadratic complexity를 가짐SSAMBAState Space Model인 Mamba를 self-supervised audio representation learning에 도입Bidirectional Mamba를 사용하여 complex audio pattern을 capture 하고 unlabeled dataset으로부터 robust audio representation을 학습논문 (SLT 20..

Paper/Representation 2025. 11. 4. 12:57

[Paper 리뷰] SSAST: Self-Supervised Audio Spectrogram Transformer

SSAST: Self-Supervised Audio Spectrogram TransformerAudio task에 Transformer를 적용할 수 있음SSASTSelf-Supervised Learning을 통해 Audio Spectrogram Transformer를 향상Joint discriminative and generative masked spectrogram patch modeling에 기반한 pre-training을 적용논문 (AAAI 2022) : Paper Link1. IntroductionAudio Spectrogram Transformer (AST)와 같은 pure self-attention-based model은 기존 CNN-based model에 비해 많은 training data를..

Paper/Representation 2025. 10. 30. 12:35

[Paper 리뷰] EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text Prompting

EmoVoice: LLM-based Emotional Text-to-Speech Model with Freestyle Text PromptingText-to-Speech model은 여전히 emotional expression 측면에서 한계가 있음EmoVoiceLarge Language Model을 활용하여 fine-grained freestyle natural language emotion control을 지원Phoneme token과 audio token을 parallel output 하여 content consistency를 향상논문 (MM 2025) : Paper Link1. IntroductionEmotion-contorllable Text-to-Speech (TTS) model은 emotion..

Paper/Language Model 2025. 10. 29. 12:45

[Paper 리뷰] HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis

HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech SynthesisZero-shot speech synthesis는 inference speed와 robustness의 한계가 있음HierSpeech++Hierarchical synthesis framework를 활용하여 naturalness를 향상Text representation과 prosody prompt를 기반으로 self-supervised/$F0$ representation을 생성하는 Text-to-Vec framework를 도입하고 16k..

Paper/TTS 2025. 10. 18. 06:17

[Paper 리뷰] BridgeVoC: Neural Vocoder with Schrodinger Bridge

BridgeVoC: Neural Vocoder with Shrodinger BridgeDiffusion-based neural vocoder는 mel-spectrogram의 linear-degradation을 neglect 함BridgeVoCTime-Frequency domain-based neural vocoder와 Schrodinger Bridge를 연결Mel-spectrogram을 target linear-scale domain으로 project 하고 degraded spectral representation으로 취급논문 (IJCAI 2025) : Paper Link1. IntroductionNeural vocoder는 acoustic feature로부터 high-quality waveform을 생..

Paper/Vocoder 2025. 10. 3. 11:50

[Paper 리뷰] PALLE: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

PALLE: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech SynthesisZero-Shot Text-to-Speech에서 autoregressive model은 generation speed, non-autoregressive model은 temporal modeling의 한계가 있음PALLEAutoregressive의 explicit temporal modeling과 non-autoregressive의 parallel genertion을 combine 한 pseudo-autoregressive approach를 도입Two-stage framework를 기반으로 first stage에서는 ..

Paper/Language Model 2025. 10. 2. 15:27

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

티스토리툴바