반응형
ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation AlignmentEnvironmental audio와 함께 speech를 jointly generate 하는 것은 어려움ImmersiveTTSMutimodal diffusion Transformer를 기반으로 transcript-aligned speech latent와 text-conditioned environmental context를 joint attention으로 fuseSemantic consistency를 향상하기 위해 domain-specific representation alig..
Paper/TTS
2026. 7. 2. 14:37
반응형
