<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Let IT Begin</title>
    <link>https://randomsampling.tistory.com/</link>
    <description>Voice Engineer | 심심하면 앨범 리뷰 올립니다</description>
    <language>ko</language>
    <pubDate>Sat, 13 Jun 2026 06:24:20 +0900</pubDate>
    <generator>TISTORY</generator>
    <ttl>100</ttl>
    <managingEditor>feVeRin</managingEditor>
    <image>
      <title>Let IT Begin</title>
      <url>https://tistory1.daumcdn.net/tistory/5934179/attach/f8b44e78d4d24d1cbf9d8c49e61cd0e3</url>
      <link>https://randomsampling.tistory.com</link>
    </image>
    <item>
      <title>[Paper 리뷰] CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering</title>
      <link>https://randomsampling.tistory.com/657</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/katex.min.js&quot;&gt;&lt;/script&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/contrib/auto-render.min.js&quot;&gt;&lt;/script&gt;
&lt;script&gt;document.addEventListener(&quot;DOMContentLoaded&quot;, function() {  renderMathInElement(document.body, {    delimiters: [      {left: &quot;$$&quot;, right: &quot;$$&quot;, display: true},      {left: &quot;$&quot;, right: &quot;$&quot;, display: false}    ]  });});&lt;/script&gt;
&lt;/b&gt;&lt;/h2&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering&lt;/span&gt;&lt;/b&gt;&lt;/h2&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;대부분의 text-to-speech system은 single utterance-level emotion을 enforce 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;CoCoEmo&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Activation steering에 대한 multi-rater evaluation protocol을 도입&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Human-like emotional speech를 위한 lightweight steering approach를 적용&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;논문 (ICML 2026) : &lt;a href=&quot;https://arxiv.org/pdf/2602.03420&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Paper Link&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;1. Introduction&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Natural speech는 inherently complex 하고 multiple concurrent, conflicting affective signal이 combine 되는 경우가 많음&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;특히 대부분의 expressive Text-to-Speech (TTS) model은 emotion을 single, globally coherent state로 취급함&lt;/span&gt;&lt;br /&gt;- 이로인해 mixed emotion은 single dominant tone으로 average 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;이를 위해 label granularity를 늘리거나 richer emotion annotation으로 retraining 할 수 있지만, 근본적인 원인을 해결하지는 못함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;한편 steering vector를 활용하면 pre-trained TTS system의 latent representation space에서 controlled directional bias를 반영할 수 있음&lt;/span&gt;&lt;br /&gt;- 특히 mixed emotion은 multiple emotion-specific steering direction으로 나타나고 text-emotion misalignment는 textual content와 independent 하게 acoustic feature를 modulate 하여 express 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #ee2323; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;BUT, steering vector를 Speech Language Model (SLM)에서 효과적으로 적용하기 위해서는 steering 위치, steering 방법, steering evaluation 등에 대한 gap을 해결해야 함&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;-&amp;gt; 그래서 SLM에서 steering vector의 동작을 분석하여 controllability를 개선한 CoCoEmo를 제안&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;CoCoEmo&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Modular emotional TTS architecture에 대한 &lt;b&gt;in-depth analysis를 수행&lt;/b&gt;하고 evaluation을 위한 &lt;b&gt;multi-rater protocol을 도입&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;추가적으로 &lt;b&gt;optimal SLM layer에 steering vector를 inject&lt;/b&gt; 하여 reliable mixed-emotion synthesis를 지원&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;&amp;lt; Overall of CoCoEmo &amp;gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;SLM과 같은 hybrid TTS system을 bridge하는 steering vector mechanism&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;결과적으로 기존보다 우수한 성능을 달성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;2. Disentangling Emotion in SLM and Flow-Matching&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Model Overview&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Hybrid TTS system은 일반적으로 2-stage architecture를 사용함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;$\mathbf{x}_{i}$를 $i$-th input text sequence, $\mathbf{c}_{ref}$를 target emotion에 대한 reference signal이라고 하자&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;First stage에서 TTS language model $f_{SLM}$은 해당 input을 discrete speech token sequence $\mathbf{z}$로 mapping 함:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 1)&lt;/b&gt;&lt;/span&gt; $ \mathbf{z}_{i}=f_{SLM}(\mathbf{x}_{i},\mathbf{c}_{ref})$&lt;br /&gt;- $\mathbf{z}=(z_{i}^{1},...,z_{i}^{T})$ : token sequence&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Second stage에서 flow-matching acoustic model $f_{Flow}$는 speech token sequence를 mel-spectrogram으로 transform 하고 pre-trained vocoder $g_{voc}$를 통해 waveform으로 convert 함:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 2)&lt;/b&gt;&lt;/span&gt; $\mathbf{m}_{i}=f_{Flow}(\mathbf{z}_{i},\mathbf{c}_{ref}),\,\,\, \mathbf{y}_{i}=g_{voc}(\mathbf{m}_{i})$&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Where to Steer 1: Modular Analysis&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Cross-Conditioning Diagnostic&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Emotional expression에 대한 SLM과 Flow-Matching module의 contribution을 disentangle 하기 위해 논문은 Cross-Conditioning Diagnostic을 도입함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;$\mathbf{c}^{e},\mathbf{c}^{n}$을 각각 emotional, neutral conditioning signal이라고 하자&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;SLM-Driven&lt;/b&gt;&lt;br /&gt;- &lt;span style=&quot;color: #006dd7;&quot;&gt;Emotion reference는 speech token $\mathbf{z}_{i}$를 modify 하기 위해 SLM에만 적용되고, flow-matching module은 neutral condition에서 동작함&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 3)&lt;/b&gt;&lt;/span&gt; $\mathbf{z}_{i}^{e}=f_{SLM}(\mathbf{x}_{i},\mathbf{c}^{e}),\,\,\, \mathbf{m}_{SLM}=f_{Flow}(\mathbf{z}_{i}^{e},\mathbf{c}^{n})$&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;Flow-Driven&lt;/b&gt;&lt;br /&gt;- &lt;span style=&quot;color: #006dd7;&quot;&gt;SLM은 neutral이고 emotion reference는 flow-matching을 통해서만 도입됨&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 4)&lt;/b&gt;&lt;/span&gt; $\mathbf{z}_{i}^{n}=f_{SLM}(\mathbf{x}_{i},\mathbf{c}^{n}),\,\,\, \mathbf{m}_{Flow}=f_{Flow}(\mathbf{z}_{i}^{n},\mathbf{c}^{e})$&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Emotion이 SLM에서 encode 된다면 SLM-Driven은 stronger emotional expressiveness를 생성해야 함&lt;br /&gt;- 그렇지 않으면 Flow-Driven이 dominate 함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-09 084219.png&quot; data-origin-width=&quot;542&quot; data-origin-height=&quot;98&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bKUAo8/dJMcad3jI7o/wHCw76GyJA1CC8iDqTeb70/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bKUAo8/dJMcad3jI7o/wHCw76GyJA1CC8iDqTeb70/img.png&quot; data-alt=&quot;Cross-Conditioning Diagnostic&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bKUAo8/dJMcad3jI7o/wHCw76GyJA1CC8iDqTeb70/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbKUAo8%2FdJMcad3jI7o%2FwHCw76GyJA1CC8iDqTeb70%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;542&quot; height=&quot;98&quot; data-filename=&quot;스크린샷 2026-06-09 084219.png&quot; data-origin-width=&quot;542&quot; data-origin-height=&quot;98&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Cross-Conditioning Diagnostic&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Findings and Design Implications&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Energy contour 측면에서 SLM-Driven condition은 emotion 별로 distinct prosodic pattern이 나타나고, Flow-Driven condition은 largely overlapped contour가 나타남&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;- 즉, flow-matching module은 prosody를 alter 하지 않고 acoustic rendering에만 관여함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;위 표의 cross-conditioning diagnostic에서 SLM-Driven은 lower CCC, higher SR STD를 가짐&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;- 즉, SLM은 synthesized emotional feature의 variability를 govern 하고 flow-matching은 local rendering을 refine 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;결과적으로 SLM이 emotional prosody의 primary driver이므로 emotion steering은 SLM에 적용되어야 함&lt;/span&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-09 084142.png&quot; data-origin-width=&quot;592&quot; data-origin-height=&quot;380&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/biS6Ju/dJMcabYO98N/c7NpUyhU5JjwTNs9JxSefk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/biS6Ju/dJMcabYO98N/c7NpUyhU5JjwTNs9JxSefk/img.png&quot; data-alt=&quot;Energy Contour&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/biS6Ju/dJMcabYO98N/c7NpUyhU5JjwTNs9JxSefk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbiS6Ju%2FdJMcabYO98N%2Fc7NpUyhU5JjwTNs9JxSefk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;592&quot; height=&quot;380&quot; data-filename=&quot;스크린샷 2026-06-09 084142.png&quot; data-origin-width=&quot;592&quot; data-origin-height=&quot;380&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Energy Contour&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Where to Steer 2: Layer and Operator Selection&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;Why Linear Separability&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;Mixed emotion에서 steering vector는 complex expression을 생성하기 위해 서로 다른 direction을 가리키도록 combine 되므로, linear separability는 steerability의 proxy로 사용될 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;즉, higher separability를 가질수록 steering vector를 reliable extract 할 수 있고 combine 할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Layer- and Operation-Level Probing for SLM Steering&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;$\mathcal{D}=\{(\mathbf{x}_{i},\mathbf{a}_{i},y_{i})\}_{i=1}^{N}$을 $N$ sample로 구성된 dataset이라고 하자&lt;br /&gt;- $\mathbf{x}_{i}$ : input text, $\mathbf{a}_{i}$ : reference emotional speech, $y_{i}\in\{0,1,...,E\}$ : emotion label&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;SLM은 multiple operation $\mathcal{O}^{(l)}$을 가진 $L$ Transformer layer를 가지고, 이때 layer-/operational-wise activation은&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 5)&lt;/b&gt;&lt;/span&gt; $\mathbf{h}_{i}^{(l,o)}=\left\{\begin{matrix} \text{Op}^{(l,o)}(\mathbf{x}_{i},\mathbf{a}_{i}), &amp;amp; l=1 \\ \text{Op}^{(l,o)}(\mathbf{h}_{i}^{(l-1)}),&amp;nbsp;&amp;nbsp;&amp;amp; l=2,...,L \\ \end{matrix}\right.,\,\,\, o\in\mathcal{O}^{(l)}$&lt;br /&gt;- $\text{Op}^{(l,o)}$ : attention, feed-forward network와 같은 operation&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Emotion이 most distinctly represent 되는 위치를 identify 하기 위해, 논문은 $y_{i}$를 predict 하는 linear probe $\mathbf{h}_{i}^{(l,o)}$를 training 하고 accuracy를 통해 linear separability를 measure 함&lt;br /&gt;- Highest discriminability를 가지는 Top-$K$ layer, operation은 steering vector를 추출하고 inject 하는 데 사용됨&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Findings and Design Implications&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;아래 그림과 같이 &lt;a href=&quot;https://randomsampling.tistory.com/520&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;CosyVoice2&lt;/b&gt;&lt;/a&gt;에서는 10-17 layer가 strong linear separability를 가지고, operation 중에서는 $\texttt{attn\_output}$이 highest discriminability를 보임&lt;br /&gt;- &lt;a href=&quot;https://randomsampling.tistory.com/627&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;IndexTTS2&lt;/b&gt;&lt;/a&gt;의 경우 5-10 layer&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;결과적으로 mid-to-late layer와 attention output은 emotion representation에 대한 highest linear separability를 가짐&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-09 084058.png&quot; data-origin-width=&quot;618&quot; data-origin-height=&quot;347&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/FmNqD/dJMcaf7UJ7W/YVWrKCSnAOWoFy5i2sFTXK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/FmNqD/dJMcaf7UJ7W/YVWrKCSnAOWoFy5i2sFTXK/img.png&quot; data-alt=&quot;Emotion Discriminability&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/FmNqD/dJMcaf7UJ7W/YVWrKCSnAOWoFy5i2sFTXK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FFmNqD%2FdJMcaf7UJ7W%2FYVWrKCSnAOWoFy5i2sFTXK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;618&quot; height=&quot;347&quot; data-filename=&quot;스크린샷 2026-06-09 084058.png&quot; data-origin-width=&quot;618&quot; data-origin-height=&quot;347&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Emotion Discriminability&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;3. Method&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;위 결과를 바탕으로 논문은 identified model layer에서 각 individual emotion에 대한 steering vector를 추출함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Mixed-emotion vector는 single-emotion vector의 weighted combination으로 구성되고, emotion proportion에 대한 quantitative control을 지원함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Steering vector는 linguistic representation과는 independent 하게 emotional acoustic variation으로부터 추출되고 text-emotion mismatch를 handling 함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-09 084305.png&quot; data-origin-width=&quot;1228&quot; data-origin-height=&quot;746&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/rLpaE/dJMcadPL4S6/iNl78sy8CU5QKgMTLTA560/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/rLpaE/dJMcadPL4S6/iNl78sy8CU5QKgMTLTA560/img.png&quot; data-alt=&quot;Overview&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/rLpaE/dJMcadPL4S6/iNl78sy8CU5QKgMTLTA560/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FrLpaE%2FdJMcadPL4S6%2FiNl78sy8CU5QKgMTLTA560%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1228&quot; height=&quot;746&quot; data-filename=&quot;스크린샷 2026-06-09 084305.png&quot; data-origin-width=&quot;1228&quot; data-origin-height=&quot;746&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Overview&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Steering Vector Construction&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;Single Emotion Steering&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;논문은 mean-difference approach를 활용하여 emotion steering vector를 compute 하고 mean neutral representation에서 mean target emotion representation으로 이동함&lt;/span&gt;&lt;br /&gt;- 이때 acoustic emotion information을 isolate 하기 위해 same speaker, transcript를 가지는 sample만 compare 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Emotion label $y_{i}\in\{0,...,E\}$, neutral speech $y_{i}=0$에 대해, dataset $\mathcal{D}=\{(\mathbf{x}_{i},\mathbf{a}_{i},y_{i})\}_{i=1}^{N}$이 주어진다고 하자&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;먼저 speaker, linguistic content를 control 하기 위해 speaker-matched neutral-emotion pair를 구성함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;특히 각 target emotion $e\in\mathcal{Y}$에 대해, same speaker의 emotion-$e$ utterance와 neutral utternace를 pair 하여 두 subset $D^{(e)}, D_{0}^{(e)}$를 구성함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Sample $i$에서 select 된 $l$ layer와 operation $o$에서의 last-token activation을 $\mathbf{h}_{i}^{(l,o)}$라고 하면, emotion $e$의 steering vector는 emotion-$e$ sample과 paired neutral counterpart 간의 mean representation과 같음&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 6)&lt;/b&gt;&lt;/span&gt; $\mathbf{v}_{e}^{(l,o)}=\frac{1}{|\mathcal{D}^{(e)}|}\sum_{i\in\mathcal{D}^{(e)}} \mathbf{h}_{i}^{(l,o)}-\frac{1}{|\mathcal{D}_{0}^{(e)}|}\sum_{j\in\mathcal{D}_{0}^{(e)}} \mathbf{h}_{j}^{(l,o)}$&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;결과적으로 vector $\mathbf{v}_{e}^{(l,o)}$는 latent space에서 emotion $e$에 대한 direction을 capture하고 추론 시 inject되어 target emotion expression을 induce함&lt;/span&gt;&lt;br /&gt;- Mismatch scenario에서 steering vector는 text-implied emotion을 override 하고 internal bias로 동작함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Mixed Emotion Steering&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Mixed emotion은 single emotion vector $\mathbf{v}_{e}^{(l,o)}$를 combine 하여 steering vector를 compute 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Target emotion에 대한 weight를 $\{p_{e}\}^{E}_{e=1}$이라 하고 $\sum_{e=1}^{E}p_{e}=1$이라고 할 때, mixed emotion steering vector는&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 7)&lt;/b&gt;&lt;/span&gt; $\mathbf{v}_{mix}^{(l,o)}=\sum_{e=1}^{E}p_{e}\mathbf{v}_{e}^{(l,o)}$&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Inference-Time Steering&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;추론 시에는 single emotion steering vector $\mathbf{v}_{e}^{(l,o)}$ 또는 mixed emotion vector $\mathbf{v}_{mix}^{(l,o)}$가 selected Top-$K$ layer와 operation에 inject 됨&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;각 selected layer, operation에서 activation $\mathbf{h}$는 steering을 통해 modulate 됨:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 8)&lt;/b&gt;&lt;/span&gt; $\tilde{\mathbf{h}}_{i}^{(l,o)}=\mathbf{h}_{i}^{(l,o)}+\alpha\cdot \mathbf{v}^{(l,o)}$&lt;br /&gt;- $\alpha$ : steering intensity, $\mathbf{v}^{(l,o)}$ : single emotion $\mathbf{v}^{(l,o)}_{e}$ 또는 mixed emotion $\mathbf{v}_{mix}^{(l,o)}$&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;추가적으로 논문은 original activation scale을 preserve 하고 semantic coherence를 maintain 하기 위해 $\tilde{\mathbf{h}}_{i}^{(l,o)}\leftarrow\frac{|| \mathbf{h}_{i}^{(l,o)}||}{||\tilde{\mathbf{h}}_{i}^{(l,o)}||}\cdot \tilde{\mathbf{h}}_{i}^{(l,o)}$와 같이 renormalize 함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Mixed-Emotion Evaluation&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Mixed-emotion synthesis를 evaluate 하기 위해서는 soft ground-truth가 필요함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;이를 위해 논문은 multi-rater annotation을 활용함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;각 speech recording $\mathbf{a}_{i}$는 $M$ rater에 의해 one-hot vector $y_{i,m}\in\{0,1\}^{|E|}$로 label 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;이때 consensus distribution은&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 9)&lt;/b&gt;&lt;/span&gt; $\mathbf{p}_{i}=\frac{1}{M}\sum_{m=1}^{M}y_{i,m}$&lt;br /&gt;- e.g., $E=\{\texttt{happy, sad, angry}\}$에 대해 두 rater가 $\texttt{happy}$를 label 하고 한 rater가 $\texttt{sad}$를 label 했다면, $\mathbf{p}_{i}=[\frac{2}{3},\frac{1}{3},0]$과 같음&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;해당 consensus distribution은 &lt;b&gt;(Eq. 7)&lt;/b&gt;의 steering vector $\mathbf{v}_{mix}^{(l,o)}$에 대한 mixing weight로 사용되고, synthesized speech는 $\mathbf{p}_{i}$가 derive 되는 동안 ground-truth target speech $\mathbf{a}_{i}$와 compare 됨&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;4. Experiments&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Settings&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Dataset : ESD, RAVDESS, CREMA-D&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Comparisons : &lt;a href=&quot;https://randomsampling.tistory.com/520&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;CosyVoice2&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/627&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;IndexTTS2&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Results&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;CoCoEmo를 적용하면 더 나은 mixed-emotion synthesis가 가능함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-09 081608.png&quot; data-origin-width=&quot;670&quot; data-origin-height=&quot;612&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bYuarW/dJMcabEtl9u/kb0GJVRzC1COi0Du0KuV61/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bYuarW/dJMcabEtl9u/kb0GJVRzC1COi0Du0KuV61/img.png&quot; data-alt=&quot;Model 성능 비교&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bYuarW/dJMcabEtl9u/kb0GJVRzC1COi0Du0KuV61/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbYuarW%2FdJMcabEtl9u%2Fkb0GJVRzC1COi0Du0KuV61%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;670&quot; height=&quot;612&quot; data-filename=&quot;스크린샷 2026-06-09 081608.png&quot; data-origin-width=&quot;670&quot; data-origin-height=&quot;612&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Model 성능 비교&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;a href=&quot;https://randomsampling.tistory.com/463&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Emotion2Vec&lt;/b&gt;&lt;/a&gt; Similarity, Target Emotion Probability, Spearman Correlation 측면에서도 우수한 성능을 보임&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-09 081705.png&quot; data-origin-width=&quot;1131&quot; data-origin-height=&quot;411&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/mhq4Z/dJMcabLj6kx/wIdkwM8ZP6TZLKMhXyZEx0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/mhq4Z/dJMcabLj6kx/wIdkwM8ZP6TZLKMhXyZEx0/img.png&quot; data-alt=&quot;Mixed-Emotion Synthesis&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/mhq4Z/dJMcabLj6kx/wIdkwM8ZP6TZLKMhXyZEx0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fmhq4Z%2FdJMcabLj6kx%2FwIdkwM8ZP6TZLKMhXyZEx0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1131&quot; height=&quot;411&quot; data-filename=&quot;스크린샷 2026-06-09 081705.png&quot; data-origin-width=&quot;1131&quot; data-origin-height=&quot;411&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Mixed-Emotion Synthesis&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Text-Emotion Mismatch Speech Synthesis&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Mismatched set에 대해서도 robust 한 성능을 달성함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-09 081952.png&quot; data-origin-width=&quot;583&quot; data-origin-height=&quot;341&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ET8h6/dJMcaip4sKi/JsXTF5YPY1Ptw4IQMCDMp1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ET8h6/dJMcaip4sKi/JsXTF5YPY1Ptw4IQMCDMp1/img.png&quot; data-alt=&quot;Mismatched Set에서의 성능&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ET8h6/dJMcaip4sKi/JsXTF5YPY1Ptw4IQMCDMp1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FET8h6%2FdJMcaip4sKi%2FJsXTF5YPY1Ptw4IQMCDMp1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;583&quot; height=&quot;341&quot; data-filename=&quot;스크린샷 2026-06-09 081952.png&quot; data-origin-width=&quot;583&quot; data-origin-height=&quot;341&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Mismatched Set에서의 성능&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Activation steering을 활용하면 E-SIM을 consistently improve 할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-09 082638.png&quot; data-origin-width=&quot;998&quot; data-origin-height=&quot;548&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cRcn37/dJMcaftla9p/QN7JWoZwZspJK8t3KnKuIk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cRcn37/dJMcaftla9p/QN7JWoZwZspJK8t3KnKuIk/img.png&quot; data-alt=&quot;Mismatch Synthesis에서 Steering Strength $\alpha$의 효과&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cRcn37/dJMcaftla9p/QN7JWoZwZspJK8t3KnKuIk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcRcn37%2FdJMcaftla9p%2FQN7JWoZwZspJK8t3KnKuIk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;998&quot; height=&quot;548&quot; data-filename=&quot;스크린샷 2026-06-09 082638.png&quot; data-origin-width=&quot;998&quot; data-origin-height=&quot;548&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Mismatch Synthesis에서 Steering Strength $\alpha$의 효과&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Single Emotion Steering&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Steering이 없는 $\alpha=0$에 비해 $\alpha$가 커질수록 TEP가 증가함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;즉, steering vector를 통해 correct directional bias를 반영할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-09 082238.png&quot; data-origin-width=&quot;788&quot; data-origin-height=&quot;417&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/vdhu4/dJMcagFQsqu/Wdu8zMydCjkf7sqqBobNGK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/vdhu4/dJMcagFQsqu/Wdu8zMydCjkf7sqqBobNGK/img.png&quot; data-alt=&quot;Single Emotion Steering&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/vdhu4/dJMcagFQsqu/Wdu8zMydCjkf7sqqBobNGK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fvdhu4%2FdJMcagFQsqu%2FWdu8zMydCjkf7sqqBobNGK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;788&quot; height=&quot;417&quot; data-filename=&quot;스크린샷 2026-06-09 082238.png&quot; data-origin-width=&quot;788&quot; data-origin-height=&quot;417&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Single Emotion Steering&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Layer-wise Steering Analysis&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;a href=&quot;https://randomsampling.tistory.com/520&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;CosyVoice2&lt;/b&gt;&lt;/a&gt;에서 17, 14 layer는 highest separability를 가짐&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-09 082507.png&quot; data-origin-width=&quot;670&quot; data-origin-height=&quot;372&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/wTsuY/dJMcaip4sN0/AIDsTrcQRnJ4L1lJfc4cZ0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/wTsuY/dJMcaip4sN0/AIDsTrcQRnJ4L1lJfc4cZ0/img.png&quot; data-alt=&quot;Layer-wise TEP&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/wTsuY/dJMcaip4sN0/AIDsTrcQRnJ4L1lJfc4cZ0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FwTsuY%2FdJMcaip4sN0%2FAIDsTrcQRnJ4L1lJfc4cZ0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;670&quot; height=&quot;372&quot; data-filename=&quot;스크린샷 2026-06-09 082507.png&quot; data-origin-width=&quot;670&quot; data-origin-height=&quot;372&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Layer-wise TEP&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper/Language Model</category>
      <category>CoCoEmo</category>
      <category>language model</category>
      <author>feVeRin</author>
      <guid isPermaLink="true">https://randomsampling.tistory.com/657</guid>
      <comments>https://randomsampling.tistory.com/657#entry657comment</comments>
      <pubDate>Tue, 9 Jun 2026 13:06:28 +0900</pubDate>
    </item>
    <item>
      <title>[Paper 리뷰] FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representation</title>
      <link>https://randomsampling.tistory.com/656</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/katex.min.js&quot;&gt;&lt;/script&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/contrib/auto-render.min.js&quot;&gt;&lt;/script&gt;
&lt;script&gt;document.addEventListener(&quot;DOMContentLoaded&quot;, function() {  renderMathInElement(document.body, {    delimiters: [      {left: &quot;$$&quot;, right: &quot;$$&quot;, display: true},      {left: &quot;$&quot;, right: &quot;$&quot;, display: false}    ]  });});&lt;/script&gt;
&lt;/b&gt;&lt;/h2&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations&lt;/span&gt;&lt;/b&gt;&lt;/h2&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Zero-shot Text-to-Speech는 여전히 independent, precise control 측면에서 한계가 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;FC-TTS&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;2-stage spectrogram generation pipeline과 VQ-VAE-based style encoder를 도입&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;추가적으로 conditioning-aware consistency loss를 도입해 attribute separation과 dual-reference control의 reliability를 향상&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;논문 (ACL 2026) : &lt;a href=&quot;https://arxiv.org/pdf/2605.24618&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Paper Link&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;1. Introduction&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Zero-shot Text-to-Speech (TTS)는 example utterance를 condition으로 style flexibility를 제공할 수 있음&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #ee2323; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;BUT, 대부분의 zero-shot TTS model은 style, timbre의 entanglement로 인해 independent control이 어려움&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이를 위해 &lt;a href=&quot;https://randomsampling.tistory.com/445&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;NANSY++&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/443&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;NaturalSpeech3&lt;/b&gt;&lt;/a&gt;와 같은 disentangled approach를 고려할 수 있지만, 여전히 imperfect disentanglement와 unseen style-timbre combination에 대한 robustness의 한계가 존재함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;-&amp;gt; 그래서 disentangled representation을 modeling을 개선하여 controllability를 향상한 FC-TTS를 제안&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;FC-TTS&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Disentangled representation을 기반으로 &lt;b&gt;2-stage spectrogram generation pipeline을 도입&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Fine-grained, intra-utterance style variability를 capture 하는 &lt;b&gt;VQ-VAE-based style encoder&lt;/b&gt;와 disentangled control을 지원하는 &lt;b&gt;Conditioning-aware Consistency Loss를 도입&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;&amp;lt; Overall of FC-TTS &amp;gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;2-stage pipeline과 disentangled representation을 활용한 zero-shot controllable TTS model&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;결과적으로 기존보다 우수한 성능을 달성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;2. Preliminary&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Factorized Speech Codec&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Timbre, style controllable TTS를 위해 &lt;a href=&quot;https://randomsampling.tistory.com/443&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;FACodec&lt;/b&gt;&lt;/a&gt;을 고려할 수 있음&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;해당 codec은 speech signal을 discrete token의 multiple disentangled stream으로 factorize 하고, 각 stream은 distinct speech attribute를 capture 함&lt;/span&gt;&lt;br /&gt;- Prosody token $\mathbf{c}_{p}$, content token $\mathbf{c}_{c}$, acoustic detail token $\mathbf{c}_{d}$&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;각 stream은 time step $T$, residual quantization level $N_{p}=1, N_{c}=2, N_{d}=3$에 대해 $\mathbb{Z}^{N_{*}\times T}$와 같이 represent 되고, speaker timbre는 continuous global embedding $z_{spk}\in\mathbb{R}^{D}$으로 capture 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;이때 논문은 $z_{spk}, \mathbf{c}_{p}$만 condition으로 사용하고 content token $\mathbf{c}_{c}$, acoustic detail token $\mathbf{c}_{d}$는 information leakage를 방지하기 위해 exclude 됨&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Flow-Matching TTS&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Conditional TTS는 phoneme sequence $\mathbf{y}\in\mathbb{Z}^{L}$, speaker timbre와 같은 conditioning information $\mathbf{c}$가 주어졌을 때 log mel-spectrogram과 같은 target speech representation $\mathbf{x}\in\mathbb{R}^{F\times T}$를 생성함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;해당 conditional generation을 위해 Conditional Flow-Matching (CFM)을 채택할 수 있음&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;CFM은 isotropic Gaussian $\mathcal{N}(0,I)$와 같은 simple prior $p_{1}(\mathbf{x})$에서 target conditional distribution $p_{0}=p(\mathbf{x}|\mathbf{y},\mathbf{c})$로의 continuous-time transformation을 define 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;$p_{1}$에서 $p_{0}$로의 progression을 describe 하기 위해 CFM은 time $t\in[0,1]$을 따라 sample을 transport 하는 time-dependent flow $\phi_{t}:[0,1]\times \mathbb{R}^{d}\rightarrow \mathbb{R}^{d}$를 도입함&lt;br /&gt;- 각 step은 marginal distribution $p_{t}(\mathbf{x})$를 가짐&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;해당 flow는 각 point에서 instantaneous direction을 specify 하는 velocity field $v_{t}(\mathbf{x}):[0,1]\times \mathbb{R}^{d}\rightarrow \mathbb{R}^{d}$에 의해 drive 되고, relationship은 Ordinary Differential Equation (ODE)를 통해 govern 됨:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 1)&lt;/b&gt;&lt;/span&gt; $ \frac{d}{dt}\phi_{t}(\mathbf{x})=v_{t}(\phi_{t}(\mathbf{x})),&amp;nbsp;\,\,\,&amp;nbsp;\phi_{1}(\mathbf{x})=\mathbf{x}_{1}$&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;True $v$는 unavailable 하므로 CFM은 conditional vector field $v_{t}(\mathbf{x}|\mathbf{x}_{0})$로 $u_{\theta}(\mathbf{x},t,\mathbf{y},\mathbf{c})$를 training 하여 이를 approximate 함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;이때 straight-line Optimal Transport (OT) trajectory가 가장 efficient 하고, 해당 ground-truth velocity는&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 2)&lt;/b&gt;&lt;/span&gt; $v_{t}^{OT}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathbf{x}_{0}-\mathbf{x}_{1}$&lt;br /&gt;- $\mathbf{x}_{t}=(1-t)\mathbf{x}_{1}+t\mathbf{x}_{0}$&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;이후 model은 &lt;b&gt;(Eq. 3)&lt;/b&gt;의 loss를 minimize 하여 predicted velocity $u_{\theta}$를 OT velocity와 align 함:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 3)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{CFM}=\mathbb{E}_{t,\mathbf{x}_{0},\mathbf{x}_{1}}\left[\left|\left| u_{\theta}(\mathbf{x}_{t},t,\mathbf{y},\mathbf{c}) - (\mathbf{x}_{0}-\mathbf{x}_{1})\right|\right|^{2}\right]$&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;3. Method&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;FC-TTS에서 timbre condition $z_{spk}$와 style condition $\mathbf{c}_{p}$는 training시 same target에서 extract 되고 추론 시에는 서로 다른 utterance의 reference를 사용함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;구조적으로 FC-TTS는 두 condition을 sequentially process 함:&lt;/span&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;Timbre stage&lt;/b&gt;에서는 $z_{spk}$를 통해 timbre characteristic을 anchor 하여 blurry spectrogram을 생성함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;Style stage&lt;/b&gt;에서는 $\mathbf{c}_{p}$를 통해 prosodic characteristic을 imprint 하여 refine 함&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;해당 2-stage framework를 통해 각 reference condition이 intended step에서만 반영되도록 함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Hierarchical Spectrogram Generation&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;a href=&quot;https://randomsampling.tistory.com/443&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;NaturalSpeech3&lt;/b&gt;&lt;/a&gt;와 같이 FACodec decoder를 simply reusing 하면 unseen combination에 대한 robust generation을 보장할 수 없으므로 independent timbre-prosody control이 어려움&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;이를 위해 논문은 jointly trained CFM speech decoder를 incorporate 해 hierarchical log mel-spectrogram generation을 수행함&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;먼저 timbre information을 사용하여 blurry spectrogram $\mathbf{h}$를 생성하고, CFM decoder를 통해 style information을 사용하여 complete spectrogram $\mathbf{x}_{0}$로 refine 함&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;해당 step은 blurry spectrogram에 대한 Mean Absolute Error (MAE) loss, final output에 대한 CFM loss로 jointly training 됨&lt;br /&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- 특히 MAE objective $\mathcal{L}_{blur}=\mathbb{E}[||\mathbf{h}-\mathbf{x}_{0}||]$는 over-smoothed output을 encourage 함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;추가적으로 information leakage를 방지하기 위해 $z_{spk}$는 same long audio file의 다른 utternace로 randomly replace 됨&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;결과적으로 timbre adapter는 first stage에서 $z_{spk}$를 inject 하여 timbre characteristic을 anchor 하고 style adapter는 $\mathbf{c}_{p}$를 subsequently apply 해 각 reference가 dedicated pathway에만 influence 하도록 보장함&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-08 085654.png&quot; data-origin-width=&quot;1083&quot; data-origin-height=&quot;580&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dFvGKI/dJMcaiQ7aze/QgxLJAyxnBk2Gb5mzm74UK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dFvGKI/dJMcaiQ7aze/QgxLJAyxnBk2Gb5mzm74UK/img.png&quot; data-alt=&quot;Overview&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dFvGKI/dJMcaiQ7aze/QgxLJAyxnBk2Gb5mzm74UK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdFvGKI%2FdJMcaiQ7aze%2FQgxLJAyxnBk2Gb5mzm74UK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1083&quot; height=&quot;580&quot; data-filename=&quot;스크린샷 2026-06-08 085654.png&quot; data-origin-width=&quot;1083&quot; data-origin-height=&quot;580&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Overview&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- VQ-VAE Style Encoding&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Zero-shot TTS model은 In-Context Learning (ICL)을 활용해 voice characteristic을 mimic 함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;BUT, ICL은 timbre, style이 consistent 하다고 가정하므로 single utterance 내에서 speaking style이 varying 하는 것을 반영할 수 없음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이를 위해 논문은 training 시 target speech에서 extract 된 style representation을 condition으로 제공함&lt;br /&gt;- 이때 model은 higher-level prosodic pattern을 capture 하지 않고 style reference의 surface-level acoustic feature를 copying 하여 shortcut으로 사용할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;따라서 Transformer encoder, Cross-attention, Finite scalar quantization (FSQ) layer를 combine 한 TCF style encoder를 도입해 phoneme, frame level에서 style representation을 hierarchically modeling 함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;Prosody-only Representation&lt;/b&gt;&lt;br /&gt;- &lt;span style=&quot;color: #006dd7;&quot;&gt;논문은 TCF input으로 prosody token $\mathbf{c}_{p}$만 사용하고, content token $\mathbf{c}_{c}$, acoustic detail token $\mathbf{c}_{d}$는 exclude 함&lt;/span&gt;&lt;br /&gt;- 이를 통해 style encoder가 unintended information을 encoding 하지 않고 rhythmic, intonational pattern만 capture 하도록 할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;Q-Former Bottleneck&lt;/b&gt;&lt;br /&gt;- &lt;span style=&quot;color: #006dd7;&quot;&gt;Learned query token의 fixed set은 cross-attention을 통해 variable-length encoder output에 attend 하여 fixed latent token으로 compress 됨&lt;/span&gt;&lt;br /&gt;- 해당 bottleneck은 frame-level temporal detail을 discard 하고 high-level stylistic structure만 retain 하도록 force 하여 model이 specific acoustic realization에 overfit 되는 것을 방지함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;Vector Quantization&lt;/b&gt;&lt;br /&gt;- &lt;span style=&quot;color: #006dd7;&quot;&gt;Q-Former를 통해 생성된 continuous latent token은 FSQ를 통해 further discretize 됨&lt;/span&gt;&lt;br /&gt;- FSQ는 low-level acoustic residual을 suppress 하고 semantically meaningful style code에 commit 하는 information bottleneck으로 사용됨&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Conditional Consistency Loss&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Disentangled TTS에서 condition consistency를 향상하기 위해 논문은 Conditional Consistency Loss (CCL)을 도입함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;먼저 CFM objective를 reparameterize 하여 FC-TTS decoder가 vector field 대신 log mel-spectrogram을 directly generate 하도록 구성함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;이후 해당 spectrogram으로부터 두 attribute predictor를 training 하여 conditioning prosody token $\mathbf{c}_{p}$, speaker embedding $z_{spk}$를 predict 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이때 각 predictor는 non-target conditioning signal도 receive 하여 prosody predictor에는 $z_{spk}$를 timbre predictor에는 $\mathbf{c}_{p}$를 feed 함&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;그러면 CCL은 prosody prediction을 위한 cross-entropy loss, speaker embedding consistency를 위한 negative cosine similarity의 weighted summation으로 얻어짐&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 4)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{CCL}=\lambda_{ccl\text{-}pro}\cdot\mathbb{E}\left[\text{CE}\left(\mathbf{c}_{p}, f\left( \hat{\mathbf{x},z_{spk}}\right)\right)\right]-\lambda_{ccl\text{-}spk}\cdot\mathbb{E}\left[\cos\left( z_{spk},g\left(\hat{\mathbf{x}},\mathbf{c}_{p}\right)\right)\right]$&lt;br /&gt;- $\text{CE}(\cdot, \cdot)$ : cross-entropy loss, $f(\cdot)$ : prosody predictor, $g(\cdot)$ : speaker embedding predictor&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-08 085544.png&quot; data-origin-width=&quot;612&quot; data-origin-height=&quot;282&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/KOlrY/dJMcaa6FO4Y/Vmx2qxYjWPbe5Ippb56FJ0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/KOlrY/dJMcaa6FO4Y/Vmx2qxYjWPbe5Ippb56FJ0/img.png&quot; data-alt=&quot;CCL Gradient&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/KOlrY/dJMcaa6FO4Y/Vmx2qxYjWPbe5Ippb56FJ0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FKOlrY%2FdJMcaa6FO4Y%2FVmx2qxYjWPbe5Ippb56FJ0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;612&quot; height=&quot;282&quot; data-filename=&quot;스크린샷 2026-06-08 085544.png&quot; data-origin-width=&quot;612&quot; data-origin-height=&quot;282&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;CCL Gradient&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;4. Experiments&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Settings&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Dataset : LibriHeavy&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Comparisons : &lt;a href=&quot;https://randomsampling.tistory.com/443&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;NaturalSpeech3&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/494&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;F5-TTS&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/390&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;DiTTo-TTS&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/238&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;CLaM-TTS&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-08 085604.png&quot; data-origin-width=&quot;611&quot; data-origin-height=&quot;306&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c1cCA6/dJMcahSbsgc/oQKf1x7C5TkXXMkFdmDR60/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c1cCA6/dJMcahSbsgc/oQKf1x7C5TkXXMkFdmDR60/img.png&quot; data-alt=&quot;LibriHeavy Dataset&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c1cCA6/dJMcahSbsgc/oQKf1x7C5TkXXMkFdmDR60/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc1cCA6%2FdJMcahSbsgc%2FoQKf1x7C5TkXXMkFdmDR60%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;611&quot; height=&quot;306&quot; data-filename=&quot;스크린샷 2026-06-08 085604.png&quot; data-origin-width=&quot;611&quot; data-origin-height=&quot;306&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;LibriHeavy Dataset&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Results&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;전체적으로 FC-TTS의 성능이 가장 우수함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-08 082756.png&quot; data-origin-width=&quot;715&quot; data-origin-height=&quot;377&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/VZV3K/dJMcacQS3dX/P4qRKrzpfjK5P0r0Gy1k3K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/VZV3K/dJMcacQS3dX/P4qRKrzpfjK5P0r0Gy1k3K/img.png&quot; data-alt=&quot;Model 성능 비교&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/VZV3K/dJMcacQS3dX/P4qRKrzpfjK5P0r0Gy1k3K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FVZV3K%2FdJMcacQS3dX%2FP4qRKrzpfjK5P0r0Gy1k3K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;715&quot; height=&quot;377&quot; data-filename=&quot;스크린샷 2026-06-08 082756.png&quot; data-origin-width=&quot;715&quot; data-origin-height=&quot;377&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Model 성능 비교&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Timbre controllability 측면에서도 FC-TTS가 더 뛰어남&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-08 082845.png&quot; data-origin-width=&quot;523&quot; data-origin-height=&quot;123&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cd0LzH/dJMcaip3GxC/5fFs7kxQfkopIcIWcJ9xGk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cd0LzH/dJMcaip3GxC/5fFs7kxQfkopIcIWcJ9xGk/img.png&quot; data-alt=&quot;Timbre Controllability&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cd0LzH/dJMcaip3GxC/5fFs7kxQfkopIcIWcJ9xGk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fcd0LzH%2FdJMcaip3GxC%2F5fFs7kxQfkopIcIWcJ9xGk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;523&quot; height=&quot;123&quot; data-filename=&quot;스크린샷 2026-06-08 082845.png&quot; data-origin-width=&quot;523&quot; data-origin-height=&quot;123&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Timbre Controllability&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Prosody control에서도 더 나은 성능을 보임&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-08 083006.png&quot; data-origin-width=&quot;547&quot; data-origin-height=&quot;127&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/xC2UN/dJMcagZ2uHH/xaKTyqWDkOA8lZZUsG1s81/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/xC2UN/dJMcagZ2uHH/xaKTyqWDkOA8lZZUsG1s81/img.png&quot; data-alt=&quot;Prosody Controllability&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/xC2UN/dJMcagZ2uHH/xaKTyqWDkOA8lZZUsG1s81/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FxC2UN%2FdJMcagZ2uHH%2FxaKTyqWDkOA8lZZUsG1s81%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;547&quot; height=&quot;127&quot; data-filename=&quot;스크린샷 2026-06-08 083006.png&quot; data-origin-width=&quot;547&quot; data-origin-height=&quot;127&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Prosody Controllability&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;AudioLLM-as-a-Judge 측면에서도 FC-TTS가 더 선호됨&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-08 083117.png&quot; data-origin-width=&quot;402&quot; data-origin-height=&quot;118&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/canORI/dJMcahrb0qe/qkaZ7unpOaFPCJ4lhAurj1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/canORI/dJMcahrb0qe/qkaZ7unpOaFPCJ4lhAurj1/img.png&quot; data-alt=&quot;AudioLLM-as-a-Judge&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/canORI/dJMcahrb0qe/qkaZ7unpOaFPCJ4lhAurj1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcanORI%2FdJMcahrb0qe%2FqkaZ7unpOaFPCJ4lhAurj1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;402&quot; height=&quot;118&quot; data-filename=&quot;스크린샷 2026-06-08 083117.png&quot; data-origin-width=&quot;402&quot; data-origin-height=&quot;118&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;AudioLLM-as-a-Judge&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Ablation Study&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;각 component는 성능 향상에 유효함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-08 083201.png&quot; data-origin-width=&quot;1152&quot; data-origin-height=&quot;232&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Nt2d2/dJMcaf08Psj/PoHpNDBYeAXyqCVzcTtglK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Nt2d2/dJMcaf08Psj/PoHpNDBYeAXyqCVzcTtglK/img.png&quot; data-alt=&quot;Ablation Study&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Nt2d2/dJMcaf08Psj/PoHpNDBYeAXyqCVzcTtglK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FNt2d2%2FdJMcaf08Psj%2FPoHpNDBYeAXyqCVzcTtglK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1152&quot; height=&quot;232&quot; data-filename=&quot;스크린샷 2026-06-08 083201.png&quot; data-origin-width=&quot;1152&quot; data-origin-height=&quot;232&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Ablation Study&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;특히 2-stage design을 활용하면 timbre combination에 대한 robustness를 향상할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-08 083235.png&quot; data-origin-width=&quot;681&quot; data-origin-height=&quot;461&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/crHkh3/dJMcai4A8u5/nDdJdnvjK35vtjcr2aXefK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/crHkh3/dJMcai4A8u5/nDdJdnvjK35vtjcr2aXefK/img.png&quot; data-alt=&quot;Mel-Spectrogram 비교&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/crHkh3/dJMcai4A8u5/nDdJdnvjK35vtjcr2aXefK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcrHkh3%2FdJMcai4A8u5%2FnDdJdnvjK35vtjcr2aXefK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;681&quot; height=&quot;461&quot; data-filename=&quot;스크린샷 2026-06-08 083235.png&quot; data-origin-width=&quot;681&quot; data-origin-height=&quot;461&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Mel-Spectrogram 비교&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper/TTS</category>
      <category>FC-TTS</category>
      <category>text-to-speech</category>
      <category>TTS</category>
      <author>feVeRin</author>
      <guid isPermaLink="true">https://randomsampling.tistory.com/656</guid>
      <comments>https://randomsampling.tistory.com/656#entry656comment</comments>
      <pubDate>Mon, 8 Jun 2026 10:57:42 +0900</pubDate>
    </item>
    <item>
      <title>[공지] Polyfill 공급망 공격 문제 (해결됨)</title>
      <link>https://randomsampling.tistory.com/655</link>
      <description>&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&amp;nbsp;블로그 LaTeX 렌더링을 위해 MathJax를 쓰고 있었는데 중국의 Polyfill 공급망 공격으로 인해 일부 게시글에서 보안 위협이 발견되고 있습니다.&lt;/span&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;&lt;b&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;&lt;b&gt;2026.06.03. 17:30을 기점으로 MathJax 관련 코드를 전부 수정하여 모든 보안 위협은 해결되었습니다.&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&amp;lt;Time Line&amp;gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;- 2026.06.03. 11:00 : 공격 확인&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;- 2026.06.03. 12:30 : &lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000; text-align: justify;&quot;&gt;Algorithm 게시글 &lt;/span&gt;MathJax 서식 전체 교체 완료&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;- 2026.06.03. 17:30 : Paper 게시글 MathJax 서식 전체 교체 완료&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;&lt;b&gt;* (주의) 본&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;&lt;b&gt; 블로그는 아래 그림과 같은 사용자 정보를 절대 요구하지 않습니다.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-06-03 105908.png&quot; data-origin-width=&quot;1919&quot; data-origin-height=&quot;489&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/diV57G/dJMcaaS31w3/Lc3dh7zbhWXRKv9HPSjHOk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/diV57G/dJMcaaS31w3/Lc3dh7zbhWXRKv9HPSjHOk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/diV57G/dJMcaaS31w3/Lc3dh7zbhWXRKv9HPSjHOk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdiV57G%2FdJMcaaS31w3%2FLc3dh7zbhWXRKv9HPSjHOk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1919&quot; height=&quot;489&quot; data-filename=&quot;스크린샷 2026-06-03 105908.png&quot; data-origin-width=&quot;1919&quot; data-origin-height=&quot;489&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>공지사항</category>
      <author>feVeRin</author>
      <guid isPermaLink="true">https://randomsampling.tistory.com/655</guid>
      <comments>https://randomsampling.tistory.com/655#entry655comment</comments>
      <pubDate>Wed, 3 Jun 2026 11:04:56 +0900</pubDate>
    </item>
    <item>
      <title>[이달슈] 이달의 슈게이즈 5회 - 26년 5월</title>
      <link>https://randomsampling.tistory.com/653</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이달의 슈게이즈 5회 - 26년 5월&lt;/span&gt;&lt;/b&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;*&amp;nbsp;업로드&amp;nbsp;당일&amp;nbsp;기준&amp;nbsp;작성자&amp;nbsp;레이더망에&amp;nbsp;걸린&amp;nbsp;것들만&amp;nbsp;올리니&amp;nbsp;놓치는게&amp;nbsp;있을&amp;nbsp;수도&amp;nbsp;있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;1. 달콤쌉쌀한 초여름의 맛&lt;/span&gt;&lt;/b&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt; &amp;nbsp;교토의&amp;nbsp;슈게이즈&amp;nbsp;밴드&amp;nbsp;MoritaSaki&amp;nbsp;in&amp;nbsp;the&amp;nbsp;Pool이&amp;nbsp;신보&amp;nbsp;&amp;lt;Kidcore&amp;nbsp;Sculpture&amp;gt;를&amp;nbsp;들고&amp;nbsp;돌아왔습니다.&amp;nbsp;특히&amp;nbsp;이번&amp;nbsp;신보는&amp;nbsp;전작에&amp;nbsp;비해&amp;nbsp;더&amp;nbsp;다채로워진&amp;nbsp;멜로디와&amp;nbsp;생동감&amp;nbsp;넘치는&amp;nbsp;리듬을&amp;nbsp;앞세워&amp;nbsp;일본&amp;nbsp;슈게이즈에서&amp;nbsp;기대할&amp;nbsp;수&amp;nbsp;있는&amp;nbsp;가장&amp;nbsp;감각적인&amp;nbsp;사운드&amp;nbsp;톤을&amp;nbsp;만들어냅니다.&amp;nbsp;앨범아트처럼&amp;nbsp;샛노란&amp;nbsp;러버덕이&amp;nbsp;가득한&amp;nbsp;도심&amp;nbsp;속&amp;nbsp;풀장.&amp;nbsp;매끈한&amp;nbsp;도시적&amp;nbsp;감성과&amp;nbsp;어렴풋한&amp;nbsp;향수가&amp;nbsp;절묘하게&amp;nbsp;교차된&amp;nbsp;이&amp;nbsp;앨범을&amp;nbsp;놓치지&amp;nbsp;않길&amp;nbsp;바랍니다.&lt;/span&gt;&lt;/p&gt;
&lt;figure data-ke-type=&quot;video&quot; data-ke-style=&quot;alignCenter&quot; data-video-host=&quot;youtube&quot; data-video-url=&quot;https://www.youtube.com/watch?v=e6luRpDrpt8&quot; data-video-thumbnail=&quot;https://scrap.kakaocdn.net/dn/crAbdg/dJMb9cBOxGV/gmsCzR8VgnteyIxVGKipAk/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/j5Wgy/dJMb9eTV38G/F01VEGOz6h9jT70Tl48OgK/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/pBLmd/dJMb9jgDPPv/ipfSVBZrc6aqyPebBNRxYK/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720&quot; data-video-width=&quot;860&quot; data-video-height=&quot;484&quot; data-video-origin-width=&quot;860&quot; data-video-origin-height=&quot;484&quot; data-ke-mobilestyle=&quot;widthContent&quot; data-video-title=&quot;SLOWDIVE&quot; data-original-url=&quot;&quot;&gt;&lt;iframe src=&quot;https://www.youtube.com/embed/e6luRpDrpt8&quot; width=&quot;860&quot; height=&quot;484&quot; frameborder=&quot;&quot; allowfullscreen=&quot;true&quot;&gt;&lt;/iframe&gt;
&lt;figcaption&gt;MoritaSaki in the Pool - 'Slowdive'&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;2. 커버와 창작 사이&lt;/span&gt;&lt;/b&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt; &amp;nbsp;지난 20일에는 Kurayamisaka가 뜻밖의 커버 싱글 'Sagittarius'를 공개했습니다. 'Sagittarius'는 작년에 발매된 슈게이즈-아이돌 프로젝트 RAY의 곡 중 하나로써, 원곡 자체도 2집 이후의 팝적인 Kurayamisaka 노선과 닮은 점이 많았던지라 큰 위화감 없이 꽤 잘 어울립니다. 그리고 27일에는 The Otals의 새 싱글 'さよならマクガフィン (Sayonara MacGuffin)'이 발매되었는데, 빠른 템포의 퍼지한 기타를 중심으로 전원적인 향수를 자극합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;figure data-ke-type=&quot;video&quot; data-ke-style=&quot;alignCenter&quot; data-video-host=&quot;youtube&quot; data-video-url=&quot;https://www.youtube.com/watch?v=VydmC1hvuyU&quot; data-video-thumbnail=&quot;https://scrap.kakaocdn.net/dn/62v08/dJMb896acWf/tM55Fb6iduuckWkKm4tZXk/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/bo7U3d/dJMb9fZBPdZ/ruFdqWeoCakDhEVJKp76P1/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/cZ31eH/dJMb82MJF4Z/Ik6Yxl4XkW6GZeFaO418v0/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720&quot; data-video-width=&quot;860&quot; data-video-height=&quot;484&quot; data-video-origin-width=&quot;860&quot; data-video-origin-height=&quot;484&quot; data-ke-mobilestyle=&quot;widthContent&quot; data-video-title=&quot;sagittarius&quot; data-original-url=&quot;&quot;&gt;&lt;iframe src=&quot;https://www.youtube.com/embed/VydmC1hvuyU&quot; width=&quot;860&quot; height=&quot;484&quot; frameborder=&quot;&quot; allowfullscreen=&quot;true&quot;&gt;&lt;/iframe&gt;
&lt;figcaption&gt;Kurayamisaka - 'Sagittarius'&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;figure data-ke-type=&quot;video&quot; data-ke-style=&quot;alignCenter&quot; data-video-host=&quot;youtube&quot; data-video-url=&quot;https://www.youtube.com/watch?v=-N5Yw7teHJ0&quot; data-video-thumbnail=&quot;https://scrap.kakaocdn.net/dn/cyIvVH/dJMb9frLVt3/9QtETk0B5eHb9lU8gFUN2K/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/puci4/dJMb9bv8LuU/qKtai7xUIOq4Es1k9N2y91/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/Gylt8/dJMb9b3Yw2q/GxoF9WQqV3GC00jnVJAxfk/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720&quot; data-video-width=&quot;860&quot; data-video-height=&quot;484&quot; data-video-origin-width=&quot;860&quot; data-video-origin-height=&quot;484&quot; data-ke-mobilestyle=&quot;widthContent&quot; data-video-title=&quot;さよならマクガフィン&quot; data-original-url=&quot;&quot;&gt;&lt;iframe src=&quot;https://www.youtube.com/embed/-N5Yw7teHJ0&quot; width=&quot;860&quot; height=&quot;484&quot; frameborder=&quot;&quot; allowfullscreen=&quot;true&quot;&gt;&lt;/iframe&gt;
&lt;figcaption&gt;The Otals - 'さよならマクガフィン'&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;3. N년째 우리는 보컬로이드 곰탕&lt;/span&gt;&lt;/b&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt; &amp;nbsp;한편&amp;nbsp;보카게이즈&amp;nbsp;공장&amp;nbsp;路傍の石&amp;nbsp;(Robounoishi)는&amp;nbsp;올해에만&amp;nbsp;벌써&amp;nbsp;3번째&amp;nbsp;정규를&amp;nbsp;찍어내고&amp;nbsp;있습니다.&amp;nbsp;路傍の石의&amp;nbsp;특징이라고&amp;nbsp;하면&amp;nbsp;2024년부터&amp;nbsp;&amp;lt;Mikgazer&amp;gt;&amp;nbsp;간판을&amp;nbsp;달고&amp;nbsp;유사품을&amp;nbsp;매년&amp;nbsp;발매해오고&amp;nbsp;있다는&amp;nbsp;건데,&amp;nbsp;속는&amp;nbsp;사람이&amp;nbsp;한둘이&amp;nbsp;아닌지&amp;nbsp;이번&amp;nbsp;앨범에도&amp;nbsp;당당히(?)&amp;nbsp;&amp;lt;Mikgazer&amp;nbsp;2026&amp;gt;이라는&amp;nbsp;타이틀을&amp;nbsp;붙이고&amp;nbsp;나왔네요.&amp;nbsp;물론&amp;nbsp;애초에&amp;nbsp;동인음악&amp;nbsp;컴필로&amp;nbsp;시작한&amp;nbsp;&amp;lt;Mikgazer&amp;gt;에&amp;nbsp;원조를&amp;nbsp;따지는&amp;nbsp;것도&amp;nbsp;웃기긴&amp;nbsp;합니다만,&amp;nbsp;개인적으로는&amp;nbsp;찍어내는&amp;nbsp;퀄리티도&amp;nbsp;점점&amp;nbsp;떨어지면서&amp;nbsp;닳고&amp;nbsp;닳은&amp;nbsp;프랜차이즈를&amp;nbsp;우려먹기만&amp;nbsp;하는게&amp;nbsp;썩&amp;nbsp;좋아 보이지는&amp;nbsp;않습니다.&lt;/span&gt;&lt;/p&gt;
&lt;figure data-ke-type=&quot;video&quot; data-ke-style=&quot;alignCenter&quot; data-video-host=&quot;youtube&quot; data-video-url=&quot;https://www.youtube.com/watch?v=hTr3zhGT1zQ&quot; data-video-thumbnail=&quot;https://scrap.kakaocdn.net/dn/4I7G1/dJMb8Z3xXLQ/lk4QTIEb6w7IlVcE4XbseK/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/zwKEA/dJMb8VNB7zw/CV33xcTr0OJtUQWYKn1lrK/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/c6aCbT/dJMb8UHVLTV/NakhDhSdDuxdDMHkkVsm1k/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720&quot; data-video-width=&quot;860&quot; data-video-height=&quot;484&quot; data-video-origin-width=&quot;860&quot; data-video-origin-height=&quot;484&quot; data-ke-mobilestyle=&quot;widthContent&quot; data-video-title=&quot;Did say something&quot; data-original-url=&quot;&quot;&gt;&lt;iframe src=&quot;https://www.youtube.com/embed/hTr3zhGT1zQ&quot; width=&quot;860&quot; height=&quot;484&quot; frameborder=&quot;&quot; allowfullscreen=&quot;true&quot;&gt;&lt;/iframe&gt;
&lt;figcaption&gt;路傍の石 - 'Did Say Something'&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;4. 헤비니스로 돌아갈래&lt;/span&gt;&lt;/b&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt; &amp;nbsp;슬슬&amp;nbsp;여름이&amp;nbsp;다가오니&amp;nbsp;멜로디컬한&amp;nbsp;일본&amp;nbsp;슈게이즈가&amp;nbsp;더&amp;nbsp;눈에&amp;nbsp;띄긴&amp;nbsp;합니다만&amp;nbsp;그래도&amp;nbsp;근본&amp;nbsp;넘치는&amp;nbsp;아일랜드&amp;nbsp;씬을&amp;nbsp;지나칠&amp;nbsp;수는&amp;nbsp;없습니다.&amp;nbsp;그중에서도&amp;nbsp;지난&amp;nbsp;7일에&amp;nbsp;발매된&amp;nbsp;Silk의&amp;nbsp;데뷔&amp;nbsp;EP&amp;nbsp;&amp;lt;Auralux&amp;gt;를&amp;nbsp;가장&amp;nbsp;주목해 볼 만합니다.&amp;nbsp;재밌게도&amp;nbsp;&lt;a href=&quot;https://randomsampling.tistory.com/597&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;1월&amp;nbsp;이달슈&lt;/b&gt;&lt;/a&gt;에서도&amp;nbsp;잠깐&amp;nbsp;소개했던&amp;nbsp;(이제는&amp;nbsp;해체한)&amp;nbsp;Virgins의&amp;nbsp;기타리스트&amp;nbsp;Michael&amp;nbsp;Smyth가&amp;nbsp;이&amp;nbsp;밴드를&amp;nbsp;이끌고&amp;nbsp;있는데,&amp;nbsp;새로운&amp;nbsp;밴드에서는&amp;nbsp;보다&amp;nbsp;우직하고&amp;nbsp;헤비한&amp;nbsp;사운드를&amp;nbsp;채용해&amp;nbsp;밀도감&amp;nbsp;높은&amp;nbsp;슈게이즈를&amp;nbsp;구현하는데&amp;nbsp;집중합니다.&lt;/span&gt;&lt;/p&gt;
&lt;figure data-ke-type=&quot;video&quot; data-ke-style=&quot;alignCenter&quot; data-video-host=&quot;youtube&quot; data-video-url=&quot;https://www.youtube.com/watch?v=UMKJNQjxP0c&quot; data-video-thumbnail=&quot;https://scrap.kakaocdn.net/dn/lxnBe/dJMb8VNB7z0/hRZpEZQdGvpywAEA2WQ2ck/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/NHnL7/dJMb8YXScMZ/8uVvgi84jB02dkIn31cI3K/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/cn4USL/dJMb85WZUej/CHvVivjXvxJAWm6PKgs0fk/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720&quot; data-video-width=&quot;860&quot; data-video-height=&quot;484&quot; data-video-origin-width=&quot;860&quot; data-video-origin-height=&quot;484&quot; data-ke-mobilestyle=&quot;widthContent&quot; data-video-title=&quot;Auralux&quot; data-original-url=&quot;&quot;&gt;&lt;iframe src=&quot;https://www.youtube.com/embed/UMKJNQjxP0c&quot; width=&quot;860&quot; height=&quot;484&quot; frameborder=&quot;&quot; allowfullscreen=&quot;true&quot;&gt;&lt;/iframe&gt;
&lt;figcaption&gt;Silk - 'Auralux'&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;5. 2시간 동안의 우주유영&lt;/span&gt;&lt;/b&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt; &amp;nbsp;미국으로&amp;nbsp;눈을&amp;nbsp;돌리면&amp;nbsp;러닝타임&amp;nbsp;2시간의&amp;nbsp;장대한&amp;nbsp;신보가&amp;nbsp;기다리고&amp;nbsp;있습니다.&amp;nbsp;13일에&amp;nbsp;발매된&amp;nbsp;Gabriel&amp;nbsp;Moon의&amp;nbsp;&amp;lt;Moonland&amp;gt;는&amp;nbsp;그&amp;nbsp;압도적인&amp;nbsp;분량만큼이나&amp;nbsp;혼란스러운&amp;nbsp;앨범으로써,&amp;nbsp;정돈되지&amp;nbsp;않은&amp;nbsp;소음과&amp;nbsp;사이키델릭한&amp;nbsp;스페이스&amp;nbsp;록&amp;nbsp;사운드가&amp;nbsp;귀를&amp;nbsp;정신없이&amp;nbsp;괴롭힙니다.&amp;nbsp;그리고&amp;nbsp;인디애나주&amp;nbsp;출신답게&amp;nbsp;이모(Emo)적인&amp;nbsp;가사도&amp;nbsp;눈에&amp;nbsp;띄는데,&amp;nbsp;감성을&amp;nbsp;살리기에는&amp;nbsp;전반적으로&amp;nbsp;다소&amp;nbsp;유치한&amp;nbsp;면이&amp;nbsp;있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;figure data-ke-type=&quot;video&quot; data-ke-style=&quot;alignCenter&quot; data-video-host=&quot;youtube&quot; data-video-url=&quot;https://www.youtube.com/watch?v=wg6ERmUZyiY&quot; data-video-thumbnail=&quot;https://scrap.kakaocdn.net/dn/cA3TDk/dJMb82eTxdA/4ha7oL4SJZKcFf8h8Pa6b0/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/hw5PT/dJMb88Gbxpf/kYAfXXkbGic0n8sJYvaszK/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/yz5Ax/dJMb82MJF5B/eqnJDwKjiv5hDCR7WFfaeK/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720&quot; data-video-width=&quot;860&quot; data-video-height=&quot;484&quot; data-video-origin-width=&quot;860&quot; data-video-origin-height=&quot;484&quot; data-ke-mobilestyle=&quot;widthContent&quot; data-video-title=&quot;Smile&quot; data-original-url=&quot;&quot;&gt;&lt;iframe src=&quot;https://www.youtube.com/embed/wg6ERmUZyiY&quot; width=&quot;860&quot; height=&quot;484&quot; frameborder=&quot;&quot; allowfullscreen=&quot;true&quot;&gt;&lt;/iframe&gt;
&lt;figcaption&gt;Gabriel Moon - 'Smile'&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;6. 상처받은 디스토션&lt;/span&gt;&lt;/b&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt; &amp;nbsp;마지막으로 Bad Light의 신규 EP &amp;lt;Mortal Wounds&amp;gt;를 살펴봅시다. 전체적으로 Narrow Head의 영향이 크게 느껴지는 앨범인데, 공격적인 하드코어 대신 드라마틱한 포스트록적 구성을 채용해 보다 웅장한 사운드를 들려줍니다. 특히 주요 하이라이트에 삽입된 디스토션은 앨범의 위태로운 감정선을 한층 증폭시킵니다.&lt;/span&gt;&lt;/p&gt;
&lt;figure data-ke-type=&quot;video&quot; data-ke-style=&quot;alignCenter&quot; data-video-host=&quot;youtube&quot; data-video-url=&quot;https://www.youtube.com/watch?v=LOCQz7sh0xQ&quot; data-video-thumbnail=&quot;https://scrap.kakaocdn.net/dn/GeVDH/dJMb85vVyZ5/nHvwiJeimB3eBGOa3oGeJ1/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/p2tCh/dJMb84X5h8n/dUUkkKbEjxnKkrLluBcdT1/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720,https://scrap.kakaocdn.net/dn/q3CeM/dJMb81G3WXI/wgKhRH95kwPCtMuGswku5k/img.jpg?width=1280&amp;amp;height=720&amp;amp;face=0_0_1280_720&quot; data-video-width=&quot;860&quot; data-video-height=&quot;484&quot; data-video-origin-width=&quot;860&quot; data-video-origin-height=&quot;484&quot; data-ke-mobilestyle=&quot;widthContent&quot; data-video-title=&quot;Bad Light - Stuck on Letting Go (Official Audio)&quot; data-original-url=&quot;&quot;&gt;&lt;iframe src=&quot;https://www.youtube.com/embed/LOCQz7sh0xQ&quot; width=&quot;860&quot; height=&quot;484&quot; frameborder=&quot;&quot; allowfullscreen=&quot;true&quot;&gt;&lt;/iframe&gt;
&lt;figcaption&gt;Bad Light - 'Stuck on Letting Go'&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;7. 미처 말하지 못한 앨범들&lt;/span&gt;&lt;/b&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;- Album&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Sungaze - &amp;lt;I'm No Longer Afraid of Heights&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Gulu&amp;nbsp;Gulu&amp;nbsp;-&amp;nbsp;&amp;lt;I&amp;nbsp;Get&amp;nbsp;Anxious&amp;nbsp;When&amp;nbsp;the&amp;nbsp;Sun&amp;nbsp;Hits&amp;nbsp;My&amp;nbsp;Eyes&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Interstates&amp;nbsp;Reasoning&amp;nbsp;-&amp;nbsp;&amp;lt;River&amp;nbsp;That&amp;nbsp;Sleeps&amp;nbsp;People&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Burnt&amp;nbsp;Log&amp;nbsp;-&amp;nbsp;&amp;lt;Feed&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Above&amp;nbsp;Me&amp;nbsp;-&amp;nbsp;&amp;lt;Soften&amp;nbsp;the&amp;nbsp;Blows&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Blue&amp;nbsp;Diner.&amp;nbsp;-&amp;nbsp;&amp;lt;Disc&amp;nbsp;2&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Desario&amp;nbsp;-&amp;nbsp;&amp;lt;Long&amp;nbsp;Lost&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;The&amp;nbsp;Asteroid&amp;nbsp;No.4&amp;nbsp;-&amp;nbsp;&amp;lt;In&amp;nbsp;Praise&amp;nbsp;of&amp;nbsp;Shadows&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Lauren&amp;nbsp;Lakis&amp;nbsp;-&amp;nbsp;&amp;lt;Deadlights&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;White&amp;nbsp;Flowers&amp;nbsp;-&amp;nbsp;&amp;lt;Dreams&amp;nbsp;for&amp;nbsp;Somebody&amp;nbsp;Else&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;The&amp;nbsp;Haunted&amp;nbsp;Youth&amp;nbsp;-&amp;nbsp;&amp;lt;Boys&amp;nbsp;Cry&amp;nbsp;Too&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;- EP&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Midscale&amp;nbsp;-&amp;nbsp;&amp;lt;Dread,&amp;nbsp;This&amp;nbsp;Could&amp;nbsp;Save&amp;nbsp;You&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Brightmoon&amp;nbsp;-&amp;nbsp;&amp;lt;First&amp;nbsp;Light&amp;gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;- Single&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Lightbreather - 'Forever'&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Plaster - 'Plaster'&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Tuffie - 'Eraser'&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Red Red Sun - 'Designated Driver'&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Angel Egg - 'Blue Light'&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Ursamenor - 'Meia Volta'&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;chart (1).png&quot; data-origin-width=&quot;1971&quot; data-origin-height=&quot;1140&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cOwvSa/dJMcahEwUg3/tZ5immpkmzeMiWKp9LXPb0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cOwvSa/dJMcahEwUg3/tZ5immpkmzeMiWKp9LXPb0/img.png&quot; data-alt=&quot;이달의 슈게이즈 5회: 26년 5월&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cOwvSa/dJMcahEwUg3/tZ5immpkmzeMiWKp9LXPb0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcOwvSa%2FdJMcahEwUg3%2FtZ5immpkmzeMiWKp9LXPb0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1971&quot; height=&quot;1140&quot; data-filename=&quot;chart (1).png&quot; data-origin-width=&quot;1971&quot; data-origin-height=&quot;1140&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;이달의 슈게이즈 5회: 26년 5월&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Music/이달슈</category>
      <category>슈게이즈</category>
      <category>이달슈</category>
      <category>이달의 슈게이즈</category>
      <author>feVeRin</author>
      <guid isPermaLink="true">https://randomsampling.tistory.com/653</guid>
      <comments>https://randomsampling.tistory.com/653#entry653comment</comments>
      <pubDate>Wed, 27 May 2026 20:10:10 +0900</pubDate>
    </item>
    <item>
      <title>[Paper 리뷰] EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding</title>
      <link>https://randomsampling.tistory.com/652</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/katex.min.js&quot;&gt;&lt;/script&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/contrib/auto-render.min.js&quot;&gt;&lt;/script&gt;
&lt;script&gt;document.addEventListener(&quot;DOMContentLoaded&quot;, function() {  renderMathInElement(document.body, {    delimiters: [      {left: &quot;$$&quot;, right: &quot;$$&quot;, display: true},      {left: &quot;$&quot;, right: &quot;$&quot;, display: false}    ]  });});&lt;/script&gt;
&lt;/b&gt;&lt;/h2&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding&lt;/span&gt;&lt;/b&gt;&lt;/h2&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Spectrogram-domain은 complex-valued phase modeling의 한계가 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;EuleroDec&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Analysis-Quantization-Synthesis pipeline에서 magnitude-phase coupling을 preserve&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;특히 adversarial discriminator, diffusion post-filter를 제거하여 end-to-end processing을 지원&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;논문 (ICASSP 2026) : &lt;a href=&quot;https://arxiv.org/pdf/2601.17517&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Paper Link&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;1. Introduction&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Spectral-domain audio codec은 STFT를 통해 signal을 time-frequency domain으로 decompose 함&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Magnitude spectrum 만으로도 대부분의 perceptual content를 반영할 수 있지만, phase spectrum이 improper 한 경우 decoded signal에서 audible artifact가 발생함&lt;/span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;이를 해결하기 위해 &lt;a href=&quot;https://randomsampling.tistory.com/210&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;EnCodec&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/258&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;DAC&lt;/b&gt;&lt;/a&gt; 등은 multi-scale adversarial discriminator를 사용하고, &lt;a href=&quot;https://randomsampling.tistory.com/279&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;ScoreDec&lt;/b&gt;&lt;/a&gt;은 flow-based post filter를 사용함&lt;br /&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- &lt;span style=&quot;color: #ee2323;&quot;&gt;BUT, 해당 방식은 slow convergence와 adversarial instability의 문제점이 있음&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;이때 Complex-Valued Neural Network (CVNN)을 활용하면 speech modeling을 향상할 수 있음&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;-&amp;gt; 그래서 CVNN을 end-to-end neural codec에 접목한 EuleroDec을 제안&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;EuleroDec&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;b&gt;Complex-valued RVQ-VAE를 활용&lt;/b&gt;해 STFT의 algebraic structure와 amplitude-phase coupling을 반영&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;추가적으로 adversarial training, diffusion-based post-filter를 제거하여 &lt;b&gt;end-to-end modeling을 지원&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;&amp;lt; Overall of EuleroDec &amp;gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Complex-valued RVQ-VAE를 활용한 end-to-end neural codec&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;결과적으로 기존보다 우수한 성능을 달성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;2. Background&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Residual Vector Quantization&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;RVQ-VAE-based codec에서 encoder는 complex STFT frame을 latent vector $\mathbf{z}\in\mathbb{R}^{H}$로 project 함&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;$M$ residual codebook stack $\{\mathcal{E}^{(m)}\}_{m=1}^{M}$은 $\mathbf{z}$를 iteratively approximate 함:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 1)&lt;/b&gt;&lt;/span&gt; $ \mathbf{r}^{(m)}=\mathbf{z}-\sum_{j&amp;lt;m}\mathbf{e}_{k_{j}}^{(j)},\,\,\,&amp;nbsp;k_{m}=\arg\min_{k}||&amp;nbsp;\mathbf{r}^{(m)}-\mathbf{e}_{k}^{(m)}||_{2}$&lt;br /&gt;- $\mathbf{e}_{k}^{(m)}$ : stage $m$의 selected centroid&lt;br /&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- 이때 index sequence $(k_{1},...,k_{M})$만 transmit 되어 $R_{f}M\log_{2}K$의 bitrate를 제공함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;기존 spectral codec은 quantization 전에 complex STFT를 두 개의 real signal로 split 해야 함&lt;/span&gt;:&lt;/span&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;이를 위해 spectrum을 modulus $|X|$와 unwrapped phase $\angle X$로 split 하여 2개의 independent-RVQ pipeline을 training 할 수 있음&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;한편으로는 spectrum을 real/imaginary component $\mathfrak{R}\{X\}, \mathfrak{I}\{X\}$로 separate 하고 Euclidean Mean-Squared Error로 optimize 된 RVQ cascade에 전달할 수 있음&lt;br /&gt;- 이후 output을 $\hat{X}=\hat{R}+j\hat{I}$와 같이 recombine 하여 iSTFT에 전달함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;BUT, 앞선 두 방식 모두 magnitude, phase 간의 intrinsic correlation을 neglect 한다는 단점이 있음&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-20 083804.png&quot; data-origin-width=&quot;732&quot; data-origin-height=&quot;597&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ciuSrD/dJMcagZP6Kz/86cB9zIAyL66hgidA67hM1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ciuSrD/dJMcagZP6Kz/86cB9zIAyL66hgidA67hM1/img.png&quot; data-alt=&quot;Overview&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ciuSrD/dJMcagZP6Kz/86cB9zIAyL66hgidA67hM1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FciuSrD%2FdJMcagZP6Kz%2F86cB9zIAyL66hgidA67hM1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;732&quot; height=&quot;597&quot; data-filename=&quot;스크린샷 2026-05-20 083804.png&quot; data-origin-width=&quot;732&quot; data-origin-height=&quot;597&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Overview&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Complex-Valued Neural Networks&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Complex-valued nerual network는 input, weight, activation을 $z=x+iy$와 같이 represent 하고 true complex algebra를 compute 함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Complex convolution은 $\mathbb{C}$에 대해 linear 하고 real/imaginary part를 coupling 함:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 2)&lt;/b&gt;&lt;/span&gt; $ (w*z)[n]=\sum_{k}w_{k}z_{n-k},\,\,\, w_{k}=a_{k}+ib_{k}, \,\,\,z_{n-k}=x_{n-k}+iy_{n-k}$&lt;br /&gt;- 해당 coupling은 model이 $x,y$를 independent channel로 취급하지 않고, 대신 amplitude-phase interaction을 학습할 수 있도록 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;여기서 phase equivariance는 임의의 $\phi\in\mathbb{R}$에 대해&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 3)&lt;/b&gt;&lt;/span&gt; $f\left(e^{i\phi}z\right)=e^{i\phi}f(z)$&lt;br /&gt;- 이는 $U(1)$ rotation으로 induce 된 geometry를 preserve 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;$\text{modReLU}$ activation은 해당 phase intact를 leave 하고 modulus에 threshold를 적용함&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 4)&lt;/b&gt;&lt;/span&gt; $\text{modReLU}(z)=\text{ReLU}(|z|+b)\frac{z}{|z|}$&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Normalization은 separate modeling 대신 $(x,y)$를 whitening 하여 $2\times 2$ covariance와 함께 cross-channel dependence를 modeling 함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-20 083717.png&quot; data-origin-width=&quot;450&quot; data-origin-height=&quot;442&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c3qz38/dJMcadhNby7/qMhCXz0p8tKzLXWp2pSU20/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c3qz38/dJMcadhNby7/qMhCXz0p8tKzLXWp2pSU20/img.png&quot; data-alt=&quot;$\text{modReLU}$ Activation&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c3qz38/dJMcadhNby7/qMhCXz0p8tKzLXWp2pSU20/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc3qz38%2FdJMcadhNby7%2FqMhCXz0p8tKzLXWp2pSU20%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;450&quot; height=&quot;442&quot; data-filename=&quot;스크린샷 2026-05-20 083717.png&quot; data-origin-width=&quot;450&quot; data-origin-height=&quot;442&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;$\text{modReLU}$ Activation&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;3. Method&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;논문은 $x\in\mathbb{C}^{B\times C\times F\times T}$의 $\texttt{complex64}$ domain을 활용함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;먼저 24kHz에서 $N_{FFT}=512, \text{win}=512, \text{hop}=64, \texttt{Hann window}$를 사용하여 256 frame에 대한 complex spectrogram을 compute 하고, 2048-entry codebook의 RVQ를 사용함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;6 kpbs에서는 8 temporal stride를 통해 256 frame을 32 latent frame으로 reduce 하고, fixed length coding과 12 codebook을 사용하는 경우 6.2 kpbs가 됨&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;12 kpbs에서는 4 temporal stride를 사용하여 token rate를 doubling 하고 동일한 codebook 수를 keeping 하여 $\approx$ 12.4 kpbs를 얻음&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;구조적으로는 complex-valued VQ-VAE를 사용함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Encoder and Decoder&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;4 downsampling stage는 $\text{freq}\times \text{time}$의 anisotropic schedule을 사용하고, decoder는 transposed convolution을 통해 이를 mirror 함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;Encoder는 5 complex residual layer를 가지고 $((1,1), (3,3), (3,5), (3,7), (1,1))$ dilation을 사용하여 stable complex statistics를 maintain 하면서 receptive field를 enlarge 함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이후 hierarchical compression을 위한 complex $3\times 7$ convolution과 4 downsampling을 적용함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;각 stage에서 gated skip branch는 input에 대한 adaptive complex average pooling을 compute 하고 $1\times 1$ complex projection을 적용함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;해당 branch는 complex downsampling, normalization, $3\times 3$ complex convolution, complex axial self-attention, $1\times 1$ complex projection 등을 가지는 main path와 summation 됨&lt;br /&gt;- Strided branch는 drop-path probability $p=0.05$로 summation 됨&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이때 논문은 encoder에서 2D spectrogram structure를 keep 하여 spatial relation을 retain 함&lt;br /&gt;- &lt;span style=&quot;color: #006dd7;&quot;&gt;Decoder는 pooling branch 없이 해당 mechanism을 mirror 하고 frequency-axis attention, complex feed-forward block에 4 upsampling stage을 적용하여 full-resolution complex spectrogram을 restore 함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-20 083630.png&quot; data-origin-width=&quot;401&quot; data-origin-height=&quot;155&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Bqduw/dJMcah5AdJw/Tmg9IUST7KnpkoTJD8MqWk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Bqduw/dJMcah5AdJw/Tmg9IUST7KnpkoTJD8MqWk/img.png&quot; data-alt=&quot;Encoder Settings&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Bqduw/dJMcah5AdJw/Tmg9IUST7KnpkoTJD8MqWk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FBqduw%2FdJMcah5AdJw%2FTmg9IUST7KnpkoTJD8MqWk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;401&quot; height=&quot;155&quot; data-filename=&quot;스크린샷 2026-05-20 083630.png&quot; data-origin-width=&quot;401&quot; data-origin-height=&quot;155&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Encoder Settings&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Vector Quantizer&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Quantization 역시 complex domain에서 수행됨&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Encoder output $z_{e}\in\mathbb{C}^{B\times C\times F\times T}$는 frequency를 channel로 collapse 하여 reshape 되고 $z_{e}^{\flat}\in\mathbb{C}^{B\times (C\cdot F)\times T}$를 생성함&lt;br /&gt;- Complex lienar projection $W_{in}\in\mathbb{C}^{D\times (C\cdot F)}$는 merged representation을 code dimension으로 mapping 하고, 이후 $S$ stage Residual Vector Quantizer가 적용됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Codebook은 current continuous encoder embedding에서 centroid seed를 sampling 하고 small complex Gaussian noise를 add 하여 30 optimization warm-up step 이후에 initialize 됨&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;각 stage에서 모든 time index에 대해, vector quantization은 Hermitian-induced Euclidean metric 하에서 nearest complex centroid를 select 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;즉, $\mathcal{E}=\{e_{k}\}_{k=1}^{K}\subset \mathbb{C}^{D}$, $x\in\mathbb{C}^{D}$에 대해&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 5)&lt;/b&gt;&lt;/span&gt; $ d_{k}(x)=||x||_{2}^{2}+||e_{k}||_{2}^{2}-2\text{Re}\left(x^{H}e_{k}\right)$&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 6)&lt;/b&gt;&lt;/span&gt; $k^{*}(x)=\arg\min_{k}d_{k}(x)$&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Stage output은 quantized reconstruction을 accumulate 하고 next stage에 대한 residual을 update 함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Encoder stability는 $z_{e}$를 assigned centroid로 pull 하는 commitment loss를 통해 promote 됨:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 7)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{commit}=\beta\frac{1}{N}\sum_{n=1}^{N}\left|\left| z_{e,n}-\text{sg}\left( e_{k*(n)}\right)\right|\right|_{2}^{2}$&lt;br /&gt;- $\text{sg}$ : stop-gradient operation&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Codebook은 assignment count와 feature sum의 exponential moving average로 update 됨&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Last stage 후, complex linear map $W_{out}\in\mathbb{C}^{(C\cdot F)\times D}$는 project back 되고 decoding을 위해 frequency를 unmerge 하여 $z_{q}\in\mathbb{C}^{B\times C\times F\times T}$를 recover 함&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;추가적으로 논문은 per-code usage $u_{k}$를 tracking 하여 $u_{k}\leq \tau$인 code를 dead로 flag 함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;각 dead code에 대해 probability $p_{refresh}=0.015$로 current mini-batch에서 randomly sampled feature $x_{i}$를 re-seed 하고 small complex Gaussian noise $\epsilon\sim\mathcal{CN}(0,\sigma^{2}I)$를 add 함&lt;br /&gt;- $\sigma=0.001$&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이후 $e_{k}\leftarrow x_{i}+\epsilon$으로 설정하고 EMA buffer를 $\bar{e}_{k}\leftarrow e_{k}$로 synchronize 하고 immediate re-pruning을 방지하기 위해 $u_{k}\leftarrow \tau+1$을 설정함&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-20 083605.png&quot; data-origin-width=&quot;521&quot; data-origin-height=&quot;510&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/WYMpn/dJMcafNtVlb/2zVfYmoK3G1BvokTFCXq1K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/WYMpn/dJMcafNtVlb/2zVfYmoK3G1BvokTFCXq1K/img.png&quot; data-alt=&quot;RVQ Embedding&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/WYMpn/dJMcafNtVlb/2zVfYmoK3G1BvokTFCXq1K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FWYMpn%2FdJMcafNtVlb%2F2zVfYmoK3G1BvokTFCXq1K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;521&quot; height=&quot;510&quot; data-filename=&quot;스크린샷 2026-05-20 083605.png&quot; data-origin-width=&quot;521&quot; data-origin-height=&quot;510&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;RVQ Embedding&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;4. Experiments&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Settings&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Dataset : LibriTTS&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Comparisons : &lt;a href=&quot;https://randomsampling.tistory.com/201&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;AudioDec&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/210&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;EnCodec&lt;/b&gt;&lt;/a&gt;, APCodec&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Results&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;전체적으로 EuleroDec의 성능이 가장 우수함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-20 083159.png&quot; data-origin-width=&quot;412&quot; data-origin-height=&quot;490&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/vKt3N/dJMcacb6nTc/Q3fvJ1Lmve81ZRywD7AlT0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/vKt3N/dJMcacb6nTc/Q3fvJ1Lmve81ZRywD7AlT0/img.png&quot; data-alt=&quot;Model 성능 비교&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/vKt3N/dJMcacb6nTc/Q3fvJ1Lmve81ZRywD7AlT0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FvKt3N%2FdJMcacb6nTc%2FQ3fvJ1Lmve81ZRywD7AlT0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;412&quot; height=&quot;490&quot; data-filename=&quot;스크린샷 2026-05-20 083159.png&quot; data-origin-width=&quot;412&quot; data-origin-height=&quot;490&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Model 성능 비교&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Ablation Study&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;각 component는 성능 향상에 유효함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-20 083345.png&quot; data-origin-width=&quot;511&quot; data-origin-height=&quot;97&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/OMBMp/dJMcabYAKAl/mw6gbfoaszryIMJkk4ilB0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/OMBMp/dJMcabYAKAl/mw6gbfoaszryIMJkk4ilB0/img.png&quot; data-alt=&quot;Ablation Study&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/OMBMp/dJMcabYAKAl/mw6gbfoaszryIMJkk4ilB0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FOMBMp%2FdJMcabYAKAl%2Fmw6gbfoaszryIMJkk4ilB0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;511&quot; height=&quot;97&quot; data-filename=&quot;스크린샷 2026-05-20 083345.png&quot; data-origin-width=&quot;511&quot; data-origin-height=&quot;97&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Ablation Study&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Complex valued AE를 사용하면 최적의 결과를 얻을 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-20 083529.png&quot; data-origin-width=&quot;557&quot; data-origin-height=&quot;221&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/deNSz1/dJMcajvwe5l/zKCgf4uxQJFUe3RAYeitIk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/deNSz1/dJMcajvwe5l/zKCgf4uxQJFUe3RAYeitIk/img.png&quot; data-alt=&quot;AutoEncoder Design&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/deNSz1/dJMcajvwe5l/zKCgf4uxQJFUe3RAYeitIk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdeNSz1%2FdJMcajvwe5l%2FzKCgf4uxQJFUe3RAYeitIk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;557&quot; height=&quot;221&quot; data-filename=&quot;스크린샷 2026-05-20 083529.png&quot; data-origin-width=&quot;557&quot; data-origin-height=&quot;221&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;AutoEncoder Design&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper/Neural Codec</category>
      <category>EuleroDec</category>
      <category>Neural Codec</category>
      <author>feVeRin</author>
      <guid isPermaLink="true">https://randomsampling.tistory.com/652</guid>
      <comments>https://randomsampling.tistory.com/652#entry652comment</comments>
      <pubDate>Wed, 20 May 2026 12:57:56 +0900</pubDate>
    </item>
    <item>
      <title>[Paper 리뷰] SyncSpeech: Efficient and Low-Latency Text-to-Speech based on Temporal Masked Transformer</title>
      <link>https://randomsampling.tistory.com/651</link>
      <description>&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/katex.min.js&quot;&gt;
&lt;/script&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/contrib/auto-render.min.js&quot;&gt;
&lt;/script&gt;
&lt;script&gt;
document.addEventListener(&quot;DOMContentLoaded&quot;, function() {
  renderMathInElement(document.body, {
    delimiters: [
      {left: &quot;$$&quot;, right: &quot;$$&quot;, display: true},
      {left: &quot;$&quot;, right: &quot;$&quot;, display: false}
    ]
  });
});
&lt;/script&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;SyncSpeech: Efficient and Low-Latency Text-to-Speech based on Temporal Masked Transformer&lt;/span&gt;&lt;/b&gt;&lt;/h2&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Text-to-Speech model은 여전히 latency의 한계가 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;SyncSpeech&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Temporal Mask Transformer를 기반으로 autoregressive model의 temporally ordered generation과 non-autoregressive model의 parallel decoding을 unify&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;추가적으로 High-Probability Masking을 통해 training efficiency를 향상&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;논문 (ICASSP 2026) : &lt;a href=&quot;https://ieeexplore.ieee.org/abstract/document/11460607/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Paper Link&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;1. Introduction&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Text-to-Speech (TTS)는 autoregressive (AR), non-autoregressive (NAR) paradigm으로 구분됨&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;AR TTS model은 conditional language modeling task로 formulate 되어 temporally ordered fashion에 따라 speech token을 left-to-right generation 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;한편 NAR TTS model은 parallel prediction을 통해 higher generation efficiency를 달성할 수 있음&lt;/span&gt;&lt;br /&gt;- &lt;span style=&quot;color: #ee2323;&quot;&gt;BUT, NAR model은 incremental generation이 어려우므로 high first-packet latency의 한계가 있음&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;-&amp;gt; 그래서 AR, NAR paradigm의 장점을 combine한 SyncSpeech를 제안&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;SyncSpeech&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;b&gt;Temporal Masked Transformer (TMT)&lt;/b&gt;를 통해 AR, NAR paradigm을 integrate&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;추가적으로 &lt;b&gt;High-Probability Masking을 도입&lt;/b&gt;해 training efficiency를 향상&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;&amp;lt; Overall of SyncSpeech &amp;gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;TMT와 High-Probability Masking을 활용한 low-latency TTS model&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;결과적으로 기존보다 우수한 성능을 달성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;2. Method&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;SyncSpeech는 text-to-token model과 token-to-speech model로 구성됨&lt;/span&gt;&lt;br /&gt;- Temporal Masked Transformer (TMT)는 text-to-token model의 backbone으로 사용되고, &lt;a href=&quot;https://randomsampling.tistory.com/520&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;CosyVoice2&lt;/b&gt;&lt;/a&gt;의 off-the-shelf chunk-aware speech decoder는 token-to-speech module로 사용됨&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Temporal Masked Generative Transformer&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Audio sample $\tilde{\mathbf{x}}$, transcript $\tilde{\mathbf{y}}$에 대해 transcribed speech dataaset $(\tilde{\mathbf{x}},\tilde{\mathbf{y}})$이 주어진다고 하자&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;Transcript $\tilde{y}$는 text tokenizer를 통해 BPE token sequence $\mathbf{y}=[y_{1},y_{2},y_{3},...,y_{L}]$로 tokenize 됨&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- $L$ : BPE token 수&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;이후 논문은 off-the-shelf semantic speech tokenizer를 사용해 speech sample $\tilde{\mathbf{x}}$를 $T$ frame discrete speech token $\mathbf{s}=[s_{1},s_{2},s_{3},...,s_{T}]$로 encode 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Speech token sequence에서 각 BPE token의 end-time을 indicate 하는 duration token $\mathbf{a}=[a_{1},a_{2},a_{3},...,a_{L}]$과 같고, 여기서 $\mathbf{a}$는 $(\tilde{\mathbf{x}},\tilde{\mathbf{y}})$ pair에 대해 alignment tool을 적용하여 얻어짐&lt;br /&gt;- $a_{L}=T$&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-19 081317.png&quot; data-origin-width=&quot;1083&quot; data-origin-height=&quot;428&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/OGq1A/dJMcadosidy/MOmdVgNhn06RKQbckuNCi1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/OGq1A/dJMcadosidy/MOmdVgNhn06RKQbckuNCi1/img.png&quot; data-alt=&quot;Overview&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/OGq1A/dJMcadosidy/MOmdVgNhn06RKQbckuNCi1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FOGq1A%2FdJMcadosidy%2FMOmdVgNhn06RKQbckuNCi1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1083&quot; height=&quot;428&quot; data-filename=&quot;스크린샷 2026-05-19 081317.png&quot; data-origin-width=&quot;1083&quot; data-origin-height=&quot;428&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Overview&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Sequence Design&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Sequence construction 시 inference process와의 consistency를 위해 random truncation strategy를 활용함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;이를 위해 random $n\in[1,L]$을 select 함&lt;br /&gt;- 이는 streaming text input을 receive 할 때 TMT가 $n$-th BPE token에 해당하는 speech token을 생성해야 하는 것을 indicate 함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Unnatural pause를 방지하기 위해, TMT는 $q$ text token을 look ahead 하고 truncated text token sequence $\mathbf{y}'=[y_{1},y_{2},y_{3},...,y_{L'}]$을 얻음&lt;/span&gt;&lt;br /&gt;- $L'=\min(L,n+q)$&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Duration token $\mathbf{a}$를 기반으로 truncated speech token sequence $\mathbf{s}_{1:a_{n}}=[s_{1},s_{2},...,s_{a_{n}}]$을 얻고, binary mask $\mathbf{m}$과 masked speech token sequence $\mathbf{s}'$를 정의함&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 1)&lt;/b&gt;&lt;/span&gt; $\mathbf{s}'=\mathbf{s}_{1:a_{n}}\odot\mathbf{m}$&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 2)&lt;/b&gt;&lt;/span&gt; $\mathbf{m}=[m_{i}]_{i=1}^{a_{n}},\,\,\mathbf{m}_{1:a_{n-1}}=0,\,\, \mathbf{m}_{a_{n-1}:a_{n}}=1$&lt;br /&gt;- $s_{i}$는 $m_{i}=1$인 경우 special $\text{&amp;lt;MASK&amp;gt;}$ token을 replace 하고 $m_{i}=0$인 경우 $s_{i}$를 그대로 사용함&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;이후 truncated text token sequence $\mathbf{y}'$, masked speech token sequence $\mathbf{s}'$, duration token $\mathbf{a}$를 사용하여 input speech sequence를 구성함&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 3)&lt;/b&gt;&lt;/span&gt; $\mathbf{f}=[\mathbf{y}',E,D,\mathbf{s}'_{1:a_{1}},...,D,\mathbf{s}'_{a_{n-1}:a_{n}},D]$&lt;br /&gt;- $E$ : end-to-text token, $D$ : duration prediction placeholder&lt;br /&gt;- 이때 $D$는 duration token $\mathbf{a}$를 기반으로 서로 다른 BPE token에 대한 masked speech token sequence $\hat{\mathbf{s}}$를 separate 하는 데 사용됨&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Loss Function&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Sequence $\mathbf{f}$는 mask, duration prediction을 training objective로 하는 TMT의 input으로 사용됨&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Sequence $\mathbf{f}$는 TMT에 전달되어 hidden state를 얻은 다음, 2개의 linear layer를 통해 text token $y_{n}$에 해당하는 speech token과 next text token $y_{n+1}$의 duration을 predict 하는 데 사용됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이를 통해 first text token의 duration prediction을 제외하고 추론 시 duration, mask prediction을 single decoding step으로 integrate 할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;결과적으로 논문은 masked generative training과 duration training을 위해, 다음의 negative log-likelihood function을 minimize 함:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 4)&lt;/b&gt;&lt;/span&gt; $ \mathcal{L}_{mask}=-\log&amp;nbsp;p(\mathbf{s}_{a_{n-1:a_{n}}}|\mathbf{f};\theta)$&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 5)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{duration}=-\log p(l_{n+1}|\mathbf{f};\theta)$&lt;br /&gt;- $\theta$ : TMT의 neural network parameter, $l_{n+1}=a_{n+1}-a_{n}$, $a_{0}=0$&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Hybrid Attention Mask&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;TMT는 causal, bidrectional pattern을 combine 한 hybrid attention mask를 사용함&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;Causal attention은 input text token과 special token에 적용되고, bidirectional attention은 masked, speech token에 적용되어 모든 preceding token과 same text token에 해당하는 모든 masked, speech token에 attend 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이를 통해 speech token은 해당 text token의 total duration을 perceive 할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Inference&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;SyncSpeech는 streaming fashion으로 text를 process 함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;Input text BPE token $\mathbf{y}$의 수가 look-ahead threshold $q$를 exceed 하면 input sequence $\mathbf{f}=[\mathbf{y},D]$가 구성된 다음 TMT에 전달되어 $y_{1}$의 duration을 predict 함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Predicted duration을 기반으로 mask token과 duration prediction placeholder를 insert 하여 sequence padding을 수행함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Updated sequence는 TMT에 다시 전달되어 $y_{1}$에 해당하는 speech token과 $y_{2}$의 duration을 predict 한 다음, input sequence $\mathbf{s}$를 update 하고 additional padding을 수행함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;이후 각 text token을 receive 할 때마다 prediction, update, padding step을 repeat 하여 각 text token과 speech token에 대한 synchronous generation을 지원함&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이때 생성된 speech token 수가 speech decoder의 chunk size를 exceed 하면 해당 token과 speaker prompt를 사용하여 speech waveform을 생성할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;추가적으로 text, speech token에 대해 separate positional embedding이 사용됨&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- High-Probability Masked Pre-Training&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;SyncSpeech에 대한 from-scratch training strategy는 각 step에서 gradient가 하나의 text token에만 backpropagate 되므로 inefficient 함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;이를 해결하기 위해 논문은 High-Probability Masked Pre-Training을 도입함&lt;/span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;먼저 masked speech token을 $\hat{\mathbf{s}}=\mathbf{s}\odot \hat{\mathbf{m}}$이라고 하자&lt;br /&gt;- $\hat{\mathbf{m}}=[\hat{m}_{i}]_{i=1}^{a_{L}}$ : speech token의 binary mask&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;이때 masking rule은 high masking probability와 inference process의 consistency를 보장하는 것을 목표로 함&lt;/span&gt;&lt;br /&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Text token의 binary mask $\hat{\mathbf{m}}_{bpe}$에 대해 first value는 Bernolli distribution $p=0.5$에 따라 sampling 되고, 이때 subsequent adjacent value는 동일하지 않음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Duration token $\mathbf{a}$를 기반으로 text token mask $\hat{\mathbf{m}}_{bpe}$는 speech token mask $\hat{\mathbf{m}}$으로 convert 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;그러면 input sequence는 다음과 같이 구성되고, TMT는 masked generative training과 duration training에 대한 negative log-likelihood를 minimize 하도록 optimize 됨:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 6)&lt;/b&gt;&lt;/span&gt; $\hat{\mathbf{f}}=[\mathbf{y},E,D,\hat{\mathbf{s}}_{1:a_{1}},...,D,\hat{\mathbf{s}}_{a_{L-1}:a_{L}}]$&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;결과적으로 논문은 text, speech token alignment를 위해 high-probability masked pre-training을 수행하고, prediction process와 consistent 한 training strategy를 통해 model을 fine-tuning 함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Other Modules&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;SyncSpeech의 나머지 module은 &lt;a href=&quot;https://randomsampling.tistory.com/520&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;CosyVoice2&lt;/b&gt;&lt;/a&gt;를 기반으로 구축됨&lt;br /&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- Supervised Speech Semantic (S3) tokenizer는 speech tokenzier로 사용되고, conditional flow matching decoder와 &lt;a href=&quot;https://randomsampling.tistory.com/51&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;HiFi-GAN&lt;/b&gt;&lt;/a&gt; vocoder는 chunk-sized semantic token으로부터 waveform을 생성하는 데 사용됨&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;3. Experiments&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Settings&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Dataset : LibriTTS&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Comparisons : &lt;a href=&quot;https://randomsampling.tistory.com/394&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;CosyVoice&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/520&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;CosyVoice2&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Results&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;전체적으로 SyncSpeech의 성능이 가장 뛰어남&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-19 080815.png&quot; data-origin-width=&quot;958&quot; data-origin-height=&quot;332&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/WOmOA/dJMcaiQSIGw/OIGGtFgzXMb63kopZqhl7k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/WOmOA/dJMcaiQSIGw/OIGGtFgzXMb63kopZqhl7k/img.png&quot; data-alt=&quot;Model 성능 비교&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/WOmOA/dJMcaiQSIGw/OIGGtFgzXMb63kopZqhl7k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FWOmOA%2FdJMcaiQSIGw%2FOIGGtFgzXMb63kopZqhl7k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;958&quot; height=&quot;332&quot; data-filename=&quot;스크린샷 2026-05-19 080815.png&quot; data-origin-width=&quot;958&quot; data-origin-height=&quot;332&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Model 성능 비교&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Ablation Study&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;각 component는 성능 향상에 유효함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-19 080907.png&quot; data-origin-width=&quot;538&quot; data-origin-height=&quot;137&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/buvyac/dJMcabK4zGw/9N7YfWCdaz7qu70z0QXMmk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/buvyac/dJMcabK4zGw/9N7YfWCdaz7qu70z0QXMmk/img.png&quot; data-alt=&quot;Ablation Study&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/buvyac/dJMcabK4zGw/9N7YfWCdaz7qu70z0QXMmk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbuvyac%2FdJMcabK4zGw%2F9N7YfWCdaz7qu70z0QXMmk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;538&quot; height=&quot;137&quot; data-filename=&quot;스크린샷 2026-05-19 080907.png&quot; data-origin-width=&quot;538&quot; data-origin-height=&quot;137&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Ablation Study&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Analysis&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Duration prediction 시 $\text{Top-k}=3$ sampling을 사용했을 때 최적의 결과를 달성함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Token prediction 시에는 greedy search가 효과적임&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-19 081012.png&quot; data-origin-width=&quot;528&quot; data-origin-height=&quot;332&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cXZlY2/dJMcahEqKWg/uA82HkvToNec5CHBq0HWb0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cXZlY2/dJMcahEqKWg/uA82HkvToNec5CHBq0HWb0/img.png&quot; data-alt=&quot;Top-$k$ Threshold 별 성능&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cXZlY2/dJMcahEqKWg/uA82HkvToNec5CHBq0HWb0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcXZlY2%2FdJMcahEqKWg%2FuA82HkvToNec5CHBq0HWb0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;528&quot; height=&quot;332&quot; data-filename=&quot;스크린샷 2026-05-19 081012.png&quot; data-origin-width=&quot;528&quot; data-origin-height=&quot;332&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Top-$k$ Threshold 별 성능&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Look ahead $q=1$일 때 가장 낮은 WER을 보임&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-19 081128.png&quot; data-origin-width=&quot;522&quot; data-origin-height=&quot;168&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/nQSNd/dJMcajhXdMf/yKrvLqk0femUWOM5K2YOd0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/nQSNd/dJMcajhXdMf/yKrvLqk0femUWOM5K2YOd0/img.png&quot; data-alt=&quot;Look Ahead 별 성능&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/nQSNd/dJMcajhXdMf/yKrvLqk0femUWOM5K2YOd0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FnQSNd%2FdJMcajhXdMf%2FyKrvLqk0femUWOM5K2YOd0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;522&quot; height=&quot;168&quot; data-filename=&quot;스크린샷 2026-05-19 081128.png&quot; data-origin-width=&quot;522&quot; data-origin-height=&quot;168&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Look Ahead 별 성능&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper/TTS</category>
      <category>SyncSpeech</category>
      <category>text-to-speech</category>
      <category>TTS</category>
      <author>feVeRin</author>
      <guid isPermaLink="true">https://randomsampling.tistory.com/651</guid>
      <comments>https://randomsampling.tistory.com/651#entry651comment</comments>
      <pubDate>Tue, 19 May 2026 15:18:33 +0900</pubDate>
    </item>
    <item>
      <title>[Paper 리뷰] VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec</title>
      <link>https://randomsampling.tistory.com/650</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/katex.min.js&quot;&gt;&lt;/script&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/contrib/auto-render.min.js&quot;&gt;&lt;/script&gt;
&lt;script&gt;document.addEventListener(&quot;DOMContentLoaded&quot;, function() {  renderMathInElement(document.body, {    delimiters: [      {left: &quot;$$&quot;, right: &quot;$$&quot;, display: true},      {left: &quot;$&quot;, right: &quot;$&quot;, display: false}    ]  });});&lt;/script&gt;
&lt;/span&gt;&lt;/b&gt;&lt;/h2&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec&lt;/span&gt;&lt;/b&gt;&lt;/h2&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Low complexity, low latency neural codec이 필요함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;VoCodec&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;a href=&quot;https://randomsampling.tistory.com/245&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Vocos&lt;/b&gt;&lt;/a&gt; vocoder를 backbone으로 사용하여 complexity를 절감&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Speech enhancement capability를 extend 하기 위해 front end에 lightweight neural network를 cascade&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;논문 (ICASSP 2026) : &lt;a href=&quot;https://arxiv.org/pdf/2601.13055&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Paper Link&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;1. Introduction&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Neural codec은 encoder, decoder, quantizer module로 구성됨&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;Encoder는 speech를 latent representation으로 compress 하고 decoder는 quantized vector로부터 waveform을 reconstruct 하고, quantizer는 encoder, decoder와 함께 end-to-end training 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;특히 neural codec은 discrete token을 compression, reconstruction에 사용함&lt;br /&gt;- 구조적으로는 VQ-GAN architecture를 기반으로 perceptual quality 향상을 위한 discriminator를 도입함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #ee2323; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;BUT, 기존 neural codec은 high complexity, non-causality로 인해 real-time communication의 한계가 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;-&amp;gt; 그래서 low computational complexity neural codec인 VoCodec을 제안&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;VoCodec&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;a href=&quot;https://randomsampling.tistory.com/245&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Vocos&lt;/b&gt;&lt;/a&gt; architecture를 기반으로 &lt;b&gt;time-frequency domain에서 speech codec을 directly operate&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;b&gt;Lightweight speech enhancement model을 front end에 cascade&lt;/b&gt; 하여 dereverberation을 수행&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;&amp;lt; Overall of VoCodec &amp;gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Vocos architecture를 기반으로 한 low complexity, low latency neural codec&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;결과적으로 기존보다 우수한 성능을 달성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;2. Method&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Generator&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;VoCodec은 time-frequency domain에서 동작하고 &lt;a href=&quot;https://randomsampling.tistory.com/245&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Vocos&lt;/b&gt;&lt;/a&gt;를 encoder-decoder backbone으로 채택함&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Speech는 time-frequency domain에서 highly pronounced harmonic structure를 가지고 STFT/iSTFT를 통해 downsampling/upsampling을 single step으로 수행할 수 있기 때문&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;먼저 speech signal $x\in\mathbb{R}^{L}$이 주어지면, STFT를 통해 frequency, frame axis $F, T$에 대한 complex spectrum $X\in \mathbb{C}^{F\times T}$로 transform 함&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;이후 complex spectrum $X$의 logarithmic magnitude와 phase를 추출한 다음, frequency axis를 따라 concatenate 하여 input feature $Z_{in}$으로 사용함&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 1)&lt;/b&gt;&lt;/span&gt; $ Z_{in}=\text{Concat}\left(\log&amp;nbsp;(|X|),&amp;nbsp;\text{angle}\left(X_{i},X_{r}\right)\right)\in&amp;nbsp;\mathbb{R}^{2F\times&amp;nbsp;T}$&lt;br /&gt;- $X_{r},X_{i}$ : real/imaginary part, $|\cdot|$ : complex value에 대한 norm, $\text{Concat}(\cdot)$ : concatenation operation&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Complexity를 줄이기 위해 fully-connected layer를 통해 $Z_{in}$을 low-dimensional space로 project 함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Encoder는 &lt;a href=&quot;https://randomsampling.tistory.com/450&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;WavTokenizer&lt;/b&gt;&lt;/a&gt;를 따라 $M$ stacked ConvNeXt block과 attention module로 구성됨&lt;/span&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;각 ConvNeXt block은 depthwise convolution을 통해 higher dimensionality로 project 한 다음, 두 개의 pointwise convolution으로 project back 하는 inverted bottleneck으로 구성됨&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Attention module은 VoCodec의 sequence modeling을 향상하기 위해 $N$ basic ResNet block을 incorporate 하고 self-attention block을 add 함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Quantizer는 Residual Vector Quantizer (RVQ)를 사용함&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- 특히 &lt;a href=&quot;https://randomsampling.tistory.com/258&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;DAC&lt;/b&gt;&lt;/a&gt;를 따라 factorized code와 $L2$-normalization을 적용함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Decoder는 Encoder의 mirror로써 computational complexity를 위해 ConvNeXt의 inverted design을 remove 하고 ResNet block에 group convolution을 적용함&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- Network는 complex spectral coefficient를 생성하고, speech는 iSTFT를 통해 reconstruct 됨&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-18 081603.png&quot; data-origin-width=&quot;1571&quot; data-origin-height=&quot;300&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ldhnY/dJMcabjYGE0/GX66QJriGB2cekUiWE9f61/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ldhnY/dJMcabjYGE0/GX66QJriGB2cekUiWE9f61/img.png&quot; data-alt=&quot;Overview&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ldhnY/dJMcabjYGE0/GX66QJriGB2cekUiWE9f61/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FldhnY%2FdJMcabjYGE0%2FGX66QJriGB2cekUiWE9f61%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1571&quot; height=&quot;300&quot; data-filename=&quot;스크린샷 2026-05-18 081603.png&quot; data-origin-width=&quot;1571&quot; data-origin-height=&quot;300&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Overview&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Discriminator&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;VoCodec은 time-frequency domain에서 동작하므로 multi-scale STFT discriminator를 적용할 수 있음&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이때 window length는 $[128, 256, 512, 1024, 2048]$, hop size는 $\texttt{window length}/4$로 fix 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Multi-scale discriminator, multi-period discriminator 등은 사용되지 않음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Combined Enhancement and Compression&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Time-frequency domain masking에 기반한 lightweight speech enhancement model을 codec의 front end에 integrate 하여 noise interference와 reverberation을 줄일 수 있음&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이를 위해 논문은 UL-UNAS model과 VoCodec을 cascade 함&lt;br /&gt;- &lt;span style=&quot;color: #006dd7;&quot;&gt;먼저 speech signal $x$는 UL-UNAS를 통과하여 enhanced spectrum $X_{enh}$를 생성하고, 이후 $X_{enh}$는 preprocess 되어 VoCodec으로 전달됨&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이때 UL-UNAS의 parameter를 fix 하고 각 model을 independently train 한 다음, VoCodec을 fine-tuning 함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Loss Functions&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;먼저 UL-UNAS training을 위한 loss function은&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Negative Scale Invariant SNR (SI-SNR) loss $\mathcal{L}_{SI\text{-}SNR}$, power-compressed spectrum loss $\mathcal{L}_{mag}$, $\mathcal{L}_{real/imag}$로 구성됨&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 2)&lt;/b&gt;&lt;/span&gt; $ \mathcal{L}_{SI\text{-}SNR}(\hat{x},x)=-\log_{10}\left(&amp;nbsp;\frac{||\hat{x}_{t}||_{2}^{2}}{||\hat{x}-\hat{x}_{t}||_{2}^{2}}\right);&amp;nbsp;\,\,\,&amp;nbsp;\hat{x}_{t}=\frac{\langle&amp;nbsp;\hat{x},x\rangle&amp;nbsp;x}{||x||_{2}^{2}}$&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 3)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{mag}(\hat{X},X)=\left|\left| |\hat{X}|^{0.3}-|X|^{0.3}\right|\right|_{2}^{2}$&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 4)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{real/imag}(\hat{X},X)=\left|\left| \frac{\hat{X}_{r/i}}{|\hat{X}|^{0.7}} -\frac{X_{r/i}}{|X|^{0.7}}\right|\right|_{2}^{2}$&lt;br /&gt;- $x,\hat{x}$ : clean/enhanced speech, $X,\hat{X}$ : clean/enhanced spectrogram&lt;br /&gt;- $r, i$ : spectrogram의 real/imaginary part, $\langle\cdot,\cdot\rangle$ : inner product operator&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;그러면 general loss function $\mathcal{L}_{sc}$&lt;/span&gt;는:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 5)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{sc}=\lambda_{1}\mathcal{L}_{SI\text{-}SNR}(\hat{x},x)+\lambda_{2}\mathcal{L}_{mag}(\hat{X},X)+\lambda_{3}\left(\mathcal{L}_{real}(\hat{X},X)+\mathcal{L}_{imag}(\hat{X},X)\right)$&lt;br /&gt;- $\lambda_{1},\lambda_{2},\lambda_{3}$ : weight&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;VoCodec에서 generator loss $\mathcal{L}_{generator}$는 다음과 같이 구성됨&lt;/span&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Reconstruction loss $\mathcal{L}_{rec}$를 위한 multi-scale mel-spectrogram loss&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 6)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{rec}=\left|\left| \log\left( \mathcal{M}(x)\right)-\log \left(\mathcal{M}(\hat{x})\right)\right|\right|_{1}$&lt;br /&gt;- $x,\hat{x}$ : target, reconstructed speech, $\mathcal{M}(\cdot)$ : mel-spectrogram transform&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Adversarial loss $\mathcal{L}_{g}$&lt;span style=&quot;color: #000000;&quot;&gt;:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 7)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{g}=||1-D(\hat{x})||_{2}^{2}$&lt;br /&gt;- $D(\cdot)$&amp;nbsp;:&amp;nbsp;discriminator&amp;nbsp;output&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Feature matching loss $\mathcal{L}_{feat}$&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 8)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{feat}=2\sum_{l}\left|\left|D^{l}(x)-D^{l}(\hat{x})\right|\right|_{1}$&lt;br /&gt;-&amp;nbsp;$D^{l}(\cdot)$&amp;nbsp;:&amp;nbsp;$l$-th&amp;nbsp;discriminator&amp;nbsp;layer의&amp;nbsp;feature&amp;nbsp;map&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;최종적으로 codebook loss $\mathcal{L}_{code}$, commitment loss $\mathcal{L}_{c}$를 포함한 final generator loss $\mathcal{L}_{generator}$는&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 9)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{generator}=\lambda_{rec}\mathcal{L}_{rec}+\lambda_{g}\mathcal{L}_{g} +\lambda_{feat}\mathcal{L}_{feat} + \lambda_{code}\underset{\mathcal{L}_{code}}{\underbrace{\left|\left| \text{sg}[\mathbf{z}_{e}]-\mathbf{e}_{k}\right|\right|_{2}^{2}}}+\lambda_{c}\underset{\mathcal{L}_{c}}{\underbrace{\left|\left| \mathbf{z}_{e}-\text{sg}[\mathbf{e}_{k}]\right|\right|_{2}^{2}}}$&lt;br /&gt;- $\text{sg}[\cdot]$ : stop-gradient operation, $\mathbf{e}_{k}$ : codebook vector&lt;br /&gt;- $\lambda_{rec},\lambda_{g},\lambda_{feat}, \lambda_{code},\lambda_{c}$ : weight&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Discriminator는 adversarial loss $\mathcal{L}_{d}$로 separately train 됨:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 10)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{d}=||1-D(x)||_{2}^{2}+||D(\hat{x})||_{2}^{2}$&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Training 시 mel-spectrogram은 $[32, 64, 128, 256, 512, 1024, 2048]$의 multiple window length로 compute 되고 fixed hop size는 $\texttt{window length}/4$로 설정됨&lt;br /&gt;- Mel bin size는 $[5, 10, 20, 40, 80, 160, 320]$을 사용함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;3. Experiments&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Settings&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Dataset : LRAC&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Comparisons : &lt;a href=&quot;https://randomsampling.tistory.com/450&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;WavTokenizer&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Results&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;전체적으로 VoCodec의 성능이 가장 우수함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;그림01.png&quot; data-origin-width=&quot;1028&quot; data-origin-height=&quot;238&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bSrPDj/dJMcaffB51x/AvGYk7tZnNn5HmH70sdXiK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bSrPDj/dJMcaffB51x/AvGYk7tZnNn5HmH70sdXiK/img.png&quot; data-alt=&quot;Model 성능 비교&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bSrPDj/dJMcaffB51x/AvGYk7tZnNn5HmH70sdXiK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbSrPDj%2FdJMcaffB51x%2FAvGYk7tZnNn5HmH70sdXiK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1028&quot; height=&quot;238&quot; data-filename=&quot;그림01.png&quot; data-origin-width=&quot;1028&quot; data-origin-height=&quot;238&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Model 성능 비교&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Subjective evaluation 측면에서도 우수한 성능을 보임&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;그림02.png&quot; data-origin-width=&quot;1033&quot; data-origin-height=&quot;136&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Al4sC/dJMcadvdpjw/RDUkrUXU3tiPEuL78vcUy0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Al4sC/dJMcadvdpjw/RDUkrUXU3tiPEuL78vcUy0/img.png&quot; data-alt=&quot;Subjective Evaluation&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Al4sC/dJMcadvdpjw/RDUkrUXU3tiPEuL78vcUy0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FAl4sC%2FdJMcadvdpjw%2FRDUkrUXU3tiPEuL78vcUy0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1033&quot; height=&quot;136&quot; data-filename=&quot;그림02.png&quot; data-origin-width=&quot;1033&quot; data-origin-height=&quot;136&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Subjective Evaluation&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper/Neural Codec</category>
      <category>Neural Codec</category>
      <category>VoCodec</category>
      <author>feVeRin</author>
      <guid isPermaLink="true">https://randomsampling.tistory.com/650</guid>
      <comments>https://randomsampling.tistory.com/650#entry650comment</comments>
      <pubDate>Mon, 18 May 2026 12:54:58 +0900</pubDate>
    </item>
    <item>
      <title>[Paper 리뷰] Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation</title>
      <link>https://randomsampling.tistory.com/649</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/katex.min.js&quot;&gt;&lt;/script&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/contrib/auto-render.min.js&quot;&gt;&lt;/script&gt;
&lt;script&gt;document.addEventListener(&quot;DOMContentLoaded&quot;, function() {  renderMathInElement(document.body, {    delimiters: [      {left: &quot;$$&quot;, right: &quot;$$&quot;, display: true},      {left: &quot;$&quot;, right: &quot;$&quot;, display: false}    ]  });});&lt;/script&gt;
&lt;/b&gt;&lt;/h2&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation&lt;/span&gt;&lt;/b&gt;&lt;/h2&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Flow-based model은 iterative sampling으로 인한 추론 속도의 한계가 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Int-MeanFlow&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Average velocity를 temporal interval 동안 teacher의 instantaneous velocity로 approximate&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;추가적으로 Optimal Step Sampling Search를 도입하여 model-specific optimal sampling step을 identify&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;논문 (ICASSP 2026) : &lt;a href=&quot;https://arxiv.org/pdf/2510.07979&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Paper Link&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;1. Introduction&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Text-to-Speech (TTS)에서 flow-based model은 iterative sampling으로 인해 추론 속도의 한계가 있음&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;이때 MeanFlow를 활용하면 Number of Function Evalutation (NFE)를 줄이면서 sampling quality를 향상할 수 있음&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;BUT, MeanFlow를 TTS에 적용하기 위해서는 다음을 고려해야 함:&lt;/span&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;MeanFlow의 training process는 self-bootstrap mechanism에 기반하고, flow matching과 유사한 instantaneous velocity guidance mixing이 필요함&lt;br /&gt;- 특히 guidance strength는 model 성능에 큰 영향을 미침&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;MeanFlow는 상당한 GPU memory를 소비하는 Jacobian-Vector Product를 사용함&lt;br /&gt;- 즉, Memory 한계로 인해 large-scale TTS model을 training 하기 어려움&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;-&amp;gt; 그래서 TTS task를 위해 MeanFlow의 한계점을 개선한 Int-MeanFlow를 제안&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Int-MeanFlow&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;MeanFlow framework를 기반으로 &lt;b&gt;instantaneous velocity 대신 averaged velocity를 학습하고, model-specific near-optimal sampling step을 identity 하는 Optimal Step Sampling Search (OS3)를 도입&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;추가적으로 pre-trained flow matching model에 적용할 수 있는 &lt;b&gt;initialization strategy를 구성&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;&amp;lt; Overall of Int-MeanFlow &amp;gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Averaged velocity 학습과 OS3 algorithm에 기반한 MeanFlow-based TTS framework&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;결과적으로 기존보다 우수한 성능을 달성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;2. Method&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Int-MeanFlow: MeanFlow Distillation via Integral Velocity&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Int-MeanFlow는 individual time step의 instantaneous velocity 대신 time interval에 대한 averaged velocity를 학습하는 것을 목표로 함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;MeanFlow의 coarse-to-fine nature를 retain 하면서 training 시에는 fine-grained detail을 capture 하기 위해, smaller interval을 emphasize 하고 broader temporal dynamics는 gradually learning 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;Distillation process에서 student model은 flow matching teacher model로 guide 됨&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Teacher model은 initial distribution $p_{0}$를 time-dependent vector field $v(z_{t},t;\theta)$를 통해 target distribution $p_{1}$으로 transform 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;State evolution $z_{t}$는 Ordinary Differential Equation (ODE)를 통해 govern 됨:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 1)&lt;/b&gt;&lt;/span&gt; $ \frac{d}{dt}z_{t}=v(z_{t},t;\theta),\,\,z_{0}\sim&amp;nbsp;p_{0},&amp;nbsp;\,\,&amp;nbsp;z_{1}\sim&amp;nbsp;p_{1},\,\,&amp;nbsp;t\in[0,1]$&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;이때 teacher의 loss function은&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 2)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{CFM}=\mathbb{E}_{t,p_{0}(z_{0}),q(z_{1})}\left[\left|\left| v(z_{t},t;\theta)-(z_{1}-z_{0})\right|\right|^{2}\right]$&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Teacher가&amp;nbsp;instantaneous&amp;nbsp;velocity&amp;nbsp;$v(z_{t},t;\theta)$&amp;nbsp;modeling을&amp;nbsp;학습하는&amp;nbsp;동안,&amp;nbsp;student&amp;nbsp;model은&amp;nbsp;time&amp;nbsp;interval&amp;nbsp;$[t,r]$에&amp;nbsp;대한&amp;nbsp;averaged&amp;nbsp;velocity를&amp;nbsp;학습함&lt;/span&gt;: &lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 3)&lt;/b&gt;&lt;/span&gt; $\bar{v}(z_{t},t,r)=\frac{z_{r}-z_{t}}{r-t}$&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;-&amp;nbsp;$z_{t},z_{r}$&amp;nbsp;:&amp;nbsp;time&amp;nbsp;$t,&amp;nbsp;r$에&amp;nbsp;대한&amp;nbsp;state,&amp;nbsp;$z_{r}$은&amp;nbsp;추론&amp;nbsp;시&amp;nbsp;iteratively&amp;nbsp;compute 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Student는 teacher의 instantaneous velocity를 사용하여 averaged velocity를 approximate 하도록 train 됨&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이를 위해 논문은 distillation 시 iterative sampling을 수행함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;먼저 interval $[t,r]$은 $n$ sub-interval로 discretize 되고 time step은 $t_{0}=t,t_{1},...,t_{n}=r$과 같음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;각 step에서 teacher는 ODE의 discrete approximation을 따라 state를 evolve 함:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 4)&lt;/b&gt;&lt;/span&gt; $z_{t_{k+1}}=z_{t_{k}}+(t_{k+1}-t_{k})\cdot v(z_{t_{k}},t_{k};\theta)$&lt;br /&gt;- $t_{0}=t, t_{n}=r$, $t_{1},t_{2},...,t_{n-1}$ : intermediate time step&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Interval $[t,r]$에 대한 total displacement는&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 5)&lt;/b&gt;&lt;/span&gt; $\Delta z^{teacher}=\sum_{k=0}^{n-1}(z_{t_{k+1}}-z_{t_{k}}) =\sum_{k=0}^{n-1}(t_{k+1}-t_{k})\cdot v(z_{t_{k}},t_{k};\theta)$&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;해당&amp;nbsp;discrete&amp;nbsp;displacement는&amp;nbsp;student가&amp;nbsp;modeling 하는&amp;nbsp;continuous&amp;nbsp;process인&amp;nbsp;$[t,r]$에&amp;nbsp;대한&amp;nbsp;instantaneous&amp;nbsp;velocity&amp;nbsp;$v(z_{t},t;\theta)$의&amp;nbsp;integral을&amp;nbsp;approximate 함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Averaged velocity를 approximate 하기 위해 displacement는 interval length로 normalize 됨:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 6)&lt;/b&gt;&lt;/span&gt; $\bar{v}_{teacher}(z_{t},t,r)=\frac{\Delta z^{teacher}}{r-t}$&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Averaged velocity의 continuous form은&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 7)&lt;/b&gt;&lt;/span&gt; $\bar{v}(z_{t},t,r)=\frac{1}{r-t}\int_{t}^{r}v(z_{\tau},\tau;\theta)d\tau$&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;결과적으로 teacher의 discrete displacement는 해당 integral의 numerical approximation으로 사용되고, student model은 distillation loss를 minimize 하도록 training 됨:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 8)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{distill}=\mathbb{E}_{t,r}\left[\left|\left| u_{student}(z_{t},t,r)-\bar{v}_{teacher}(z_{t},t,r)\right|\right|^{2}\right]$&lt;br /&gt;- $u_{student}(z_{t},t,r)$ : student model에서 predict 된 velocity, $\bar{v}_{teacher}(z_{t},t,r)$ : teacher의 target velocity&lt;br /&gt;- Student model은 teacher guidance를 따라 averaged velocity를 predict 하고, iterative sampling을 통해 instantaneous velocity의 integral을 approximate 함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-15 081954.png&quot; data-origin-width=&quot;652&quot; data-origin-height=&quot;270&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/v2pCA/dJMcahdmFfe/rVNI3FWKyWKlb6QrtkUPH0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/v2pCA/dJMcahdmFfe/rVNI3FWKyWKlb6QrtkUPH0/img.png&quot; data-alt=&quot;Overview&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/v2pCA/dJMcahdmFfe/rVNI3FWKyWKlb6QrtkUPH0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fv2pCA%2FdJMcahdmFfe%2FrVNI3FWKyWKlb6QrtkUPH0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;652&quot; height=&quot;270&quot; data-filename=&quot;스크린샷 2026-05-15 081954.png&quot; data-origin-width=&quot;652&quot; data-origin-height=&quot;270&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Overview&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Optimal Step Sampling Searching (OS3)&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;기존 flow-based TTS에서는 NFE requirement를 만족하기 위해 continuous function이나 hard-coded discrete step schedule을 사용함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이와 달리 논문은 model inference process에 맞춰 sampling step을 optimize 함&lt;br /&gt;- Sampling step position의 function에 대한 speech quality는 near-convex behavior를 가지기 때문&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;결과적으로 논문의 Optimal Sampling Step Search (OS3) algorithm은 추론 interval $[0,1]$ 전체에 대해 고정된 수의 sampling step distribution을 optimize 함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;OS3는 ternary search를 활용하여 각 sampling step의 placement를 optimize 함&lt;/span&gt;&lt;br /&gt;- 즉, 하나의 sampling step을 제외한 나머지 step을 fix 하고 optimization을 위한 ternary search를 적용함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;해당 process는 각 step마다 repeat 되고 development set에서 further improvement가 없을 때까지 수행됨&lt;/span&gt;&lt;br /&gt;- 이를 통해 OS3는 sampling step의 optimal distribution을 identify 할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;[Algorithm 1]&lt;/b&gt;의 metric function $\mathcal{L}$은 sampling step set $T$, development set을 기반으로 generated audio에 대한 pre-defined metric을 compute 함&lt;br /&gt;- 논문에서는 speaker similarity를 metric으로 채택함&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-15 082014.png&quot; data-origin-width=&quot;657&quot; data-origin-height=&quot;480&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/n3IFz/dJMcabjWPXt/EI74omq364XdntmhO7tIMk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/n3IFz/dJMcabjWPXt/EI74omq364XdntmhO7tIMk/img.png&quot; data-alt=&quot;Optimal Step Sampling Searching&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/n3IFz/dJMcabjWPXt/EI74omq364XdntmhO7tIMk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fn3IFz%2FdJMcabjWPXt%2FEI74omq364XdntmhO7tIMk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;657&quot; height=&quot;480&quot; data-filename=&quot;스크린샷 2026-05-15 082014.png&quot; data-origin-width=&quot;657&quot; data-origin-height=&quot;480&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Optimal Step Sampling Searching&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Initialization Strategy for Int-MeanFlow&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Flow matching model을 Int-MeanFlow에 adapt 하기 위해 additional parameter $r$을 도입함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;$t, r$은 동일한 embedding layer를 통과한 다음 concatenate 되고 linear mapping $\mathbf{W}$를 사용하여 feature space로 project back 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;$t, r$의 embedding을 각각 $\mathbf{e}_{t}=\mathcal{E}(t), \mathbf{e}_{r}=\mathcal{E}(r)$ 이라고 하자&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Concatenated, mapped embedding $\mathbf{e}_{t,r},\mathbf{e}'_{t,r}$은&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 9)&lt;/b&gt;&lt;/span&gt; $\mathbf{e}_{t,r}=[\mathbf{e}_{t},\mathbf{e}_{r}],\,\,\,\mathbf{e}'_{t,r}=\mathbf{We}_{t,r}$&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Original model behavior를 preserve 하기 위해 $\mathbf{W}$는 다음과 같이 initialize 됨:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 10)&lt;/b&gt;&lt;/span&gt; $\mathbf{W}=[D_{diag}\,\, 0]$&lt;br /&gt;- $D_{diag}$ : diagonal matrix&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-15 082041.png&quot; data-origin-width=&quot;637&quot; data-origin-height=&quot;263&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bpNTAu/dJMb990Jrla/xkJ7H5Hb2qZkmJP5fzZGP1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bpNTAu/dJMb990Jrla/xkJ7H5Hb2qZkmJP5fzZGP1/img.png&quot; data-alt=&quot;Sampling Step 별 Speaker Similarity&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bpNTAu/dJMb990Jrla/xkJ7H5Hb2qZkmJP5fzZGP1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbpNTAu%2FdJMb990Jrla%2FxkJ7H5Hb2qZkmJP5fzZGP1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;637&quot; height=&quot;263&quot; data-filename=&quot;스크린샷 2026-05-15 082041.png&quot; data-origin-width=&quot;637&quot; data-origin-height=&quot;263&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Sampling Step 별 Speaker Similarity&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;3. Experiments&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Settings&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Dataset : Emilia&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Comparisons : &lt;a href=&quot;https://randomsampling.tistory.com/494&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;F5-TTS&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/520&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;CosyVoice2&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Results&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;전체적으로 Int-MeanFlow의 성능이 가장 우수함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-15 081327.png&quot; data-origin-width=&quot;1328&quot; data-origin-height=&quot;631&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/F5GC9/dJMcahLcM2n/Qwm2JZ0rxkX2zUv6Ws0qgK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/F5GC9/dJMcahLcM2n/Qwm2JZ0rxkX2zUv6Ws0qgK/img.png&quot; data-alt=&quot;Model 성능 비교&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/F5GC9/dJMcahLcM2n/Qwm2JZ0rxkX2zUv6Ws0qgK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FF5GC9%2FdJMcahLcM2n%2FQwm2JZ0rxkX2zUv6Ws0qgK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1328&quot; height=&quot;631&quot; data-filename=&quot;스크린샷 2026-05-15 081327.png&quot; data-origin-width=&quot;1328&quot; data-origin-height=&quot;631&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Model 성능 비교&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Token-to-Mel 측면에서도 Int-MeanFlow가 가장 뛰어난 성능을 달성함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-15 081411.png&quot; data-origin-width=&quot;1322&quot; data-origin-height=&quot;153&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cmsmkm/dJMcafzRkVy/Aa2uFlu1P5YwoXUFLvAwV1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cmsmkm/dJMcafzRkVy/Aa2uFlu1P5YwoXUFLvAwV1/img.png&quot; data-alt=&quot;Token-to-Mel 성능&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cmsmkm/dJMcafzRkVy/Aa2uFlu1P5YwoXUFLvAwV1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fcmsmkm%2FdJMcafzRkVy%2FAa2uFlu1P5YwoXUFLvAwV1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1322&quot; height=&quot;153&quot; data-filename=&quot;스크린샷 2026-05-15 081411.png&quot; data-origin-width=&quot;1322&quot; data-origin-height=&quot;153&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Token-to-Mel 성능&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Int-MeanFlow는 작은 NFE에서도 뛰어난 성능을 보임&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-15 081619.png&quot; data-origin-width=&quot;650&quot; data-origin-height=&quot;233&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Sb2Zn/dJMcacb20bu/SIgdRw7C9GdvR5RhUYulTK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Sb2Zn/dJMcacb20bu/SIgdRw7C9GdvR5RhUYulTK/img.png&quot; data-alt=&quot;NFE 별 성능&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Sb2Zn/dJMcacb20bu/SIgdRw7C9GdvR5RhUYulTK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FSb2Zn%2FdJMcacb20bu%2FSIgdRw7C9GdvR5RhUYulTK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;650&quot; height=&quot;233&quot; data-filename=&quot;스크린샷 2026-05-15 081619.png&quot; data-origin-width=&quot;650&quot; data-origin-height=&quot;233&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;NFE 별 성능&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Teacher NFE가 클수록 training time은 증가함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-15 081813.png&quot; data-origin-width=&quot;651&quot; data-origin-height=&quot;87&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/5Em0w/dJMb99TU9iP/H7DaVcVjLQDogWFGINLslK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/5Em0w/dJMb99TU9iP/H7DaVcVjLQDogWFGINLslK/img.png&quot; data-alt=&quot;Teacher NFE 별 Training Time&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/5Em0w/dJMb99TU9iP/H7DaVcVjLQDogWFGINLslK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F5Em0w%2FdJMb99TU9iP%2FH7DaVcVjLQDogWFGINLslK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;651&quot; height=&quot;87&quot; data-filename=&quot;스크린샷 2026-05-15 081813.png&quot; data-origin-width=&quot;651&quot; data-origin-height=&quot;87&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Teacher NFE 별 Training Time&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper/TTS</category>
      <category>Int-MeanFlow</category>
      <category>text-to-speech</category>
      <category>TTS</category>
      <author>feVeRin</author>
      <guid isPermaLink="true">https://randomsampling.tistory.com/649</guid>
      <comments>https://randomsampling.tistory.com/649#entry649comment</comments>
      <pubDate>Fri, 15 May 2026 12:50:07 +0900</pubDate>
    </item>
    <item>
      <title>[Paper 리뷰] IPACue-TTS: Integrating Prosody and Articulatory Cues in Conditional Flow Matching for Multilingual Zero-Shot TTS</title>
      <link>https://randomsampling.tistory.com/648</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/katex.min.js&quot;&gt;&lt;/script&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/contrib/auto-render.min.js&quot;&gt;&lt;/script&gt;
&lt;script&gt;document.addEventListener(&quot;DOMContentLoaded&quot;, function() {  renderMathInElement(document.body, {    delimiters: [      {left: &quot;$$&quot;, right: &quot;$$&quot;, display: true},      {left: &quot;$&quot;, right: &quot;$&quot;, display: false}    ]  });});&lt;/script&gt;
&lt;/b&gt;&lt;/h2&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;IPACue-TTS: Integrating Prosody and Articulatory Cues in Conditional Flow Matching for Multilingual Zero-Shot TTS&lt;/span&gt;&lt;/b&gt;&lt;/h2&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Native-sounding cross-lingual, code-mixed Text-to-Speech model이 필요함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;IPACue-TTS&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Pronunciation, prosodic accuracy를 향상하기 위해 articulatory phoneme refinement를 incorporate&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Flow-based framework를 통해 fine-grained acoustic, prosodic feature를 explicitly modeling&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;논문 (ICASSP 2026) : &lt;a href=&quot;https://ieeexplore.ieee.org/abstract/document/11462369/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Paper Link&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;1. Introduction&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;기존 Text-to-Speech (TTS) model은 cross-lingual, code-switching 환경에서 accent leakage가 발생함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #ee2323; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;한편 speech-production cue를 incorporate 하는 articulatory-informed TTS를 활용하면 clarity를 향상할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;추가적으로 대부분의 TTS model은 coarse global embedding이나 learned style token을 활용해 style을 modeling 하므로 linguistic content를 disentangle 하기 어려움&lt;br /&gt;- &lt;span style=&quot;color: #006dd7;&quot;&gt;이때 jitter, formant frequency, shimmer 등과 같은 temporal acoustic parameter를 도입하면 acoustic characteristic에 대한 low-level description을 향상할 수 있음&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;-&amp;gt; 그래서 articulatory cue와 temporal acoustic feature를 활용한 IPACue-TTS를 제안&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;IPACue-TTS&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;Articulatory cue에 기반한 phonological rule을 도입&lt;/b&gt;하여 naturalness를 향상하고 Text Encoder를 language embedding에 condition 하여 language-specific articulatory variation을 capture&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;b&gt;Temporal acoustic, prosodic feature를 Conditional Flow에 incorporate&lt;/b&gt; 하여 fine-grained acoustic information을 modeling&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;&amp;lt; Overall of IPACue-TTS &amp;gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Articulatory cue를 활용한 Conditional Flow Matching-based multilingual code-switching TTS model&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;결과적으로 기존보다 우수한 성능을 달성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;2. Integrating Articulation and Prosody&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Articulatory information과 prosodic pattern을 incorporate 하면 naturalness를 향상할 수 있으므로, 논문은 speech articulation에서 derive 된 phonological rule을 구성함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Closure Phoneme Modeling for Stop Consonants&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;Aspirated, unaspirated stop consonant ($\texttt{/p/}$, $\texttt{/t/}$, $\texttt{/}\texttt{k}^\texttt{h}\texttt{/}$)는 closure 이후에 release burst가 이어지는 2-phase articulation을 가짐&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이때 기존의 grapheme-to-phoneme conversion은 release phase만 capture 하므로 under-articulated, short stop segment를 생성할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;따라서 논문은 각 stop consonant에 대해 explicit closure phoneme을 도입하여 closure phase를 modeling 함&lt;/span&gt;&lt;br /&gt;- 이를 통해 voicing contrast perception을 향상하고 natural temporal pattern을 제공할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Explicit Representation of Geminated Sounds&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Gemination은 stop의 경우 longer closure, fricative, nasal의 경우 sustained constriction으로 articulate 되고 surrounding vowel의 shortening으로 이어짐&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;즉, phoneme을 두 번 repeat 되는 경우, model은 prolonged closure/frication에 대한 single articulatory gesture를 independent two phoneme으로 취급할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;이를 해결하기 위해 논문은 lengthened sound의 temporal, spectral continuity를 encode 하는 dedicated geminate phone을 도입함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Temporal Acoustic and Prosody Modeling&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;논문은 articulatory cue 외에도 formant detail, energy, shimmer, Hammarberg index, spectral tilt와 같은 low-level temporal acoustic descriptor를 explicit conditioning variable로 추가함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;해당 descriptor를 통해 model은 intonation, rhythm, stress에 대한 fine-grained temporal, spectral variation을 capture 하고 naturalness와 expressiveness를 향상할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;특히 prosody descriptor에 대한 explicit conditioning을 통해 speaker generalization을 위한 finer-level speaker-specific characteristic을 preserve 할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-14 080603.png&quot; data-origin-width=&quot;660&quot; data-origin-height=&quot;478&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/djROhe/dJMcahdlCVA/DfNMKVJ8BB99644gXkLSJk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/djROhe/dJMcahdlCVA/DfNMKVJ8BB99644gXkLSJk/img.png&quot; data-alt=&quot;Overview&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/djROhe/dJMcahdlCVA/DfNMKVJ8BB99644gXkLSJk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdjROhe%2FdJMcahdlCVA%2FDfNMKVJ8BB99644gXkLSJk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;660&quot; height=&quot;478&quot; data-filename=&quot;스크린샷 2026-05-14 080603.png&quot; data-origin-width=&quot;660&quot; data-origin-height=&quot;478&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Overview&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;3. Architecture&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;IPACue-TTS에서 text는 $\texttt{phonemizer}$를 통해 International Phonetic Alphabet (IPA)로 convert 됨&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;Articulatory cue는 앞선 phonological rule을 따라 IPA에 적용되고, 각 IPA는 128-dimensional non-trainable embedding으로 convert 됨&lt;/span&gt;&lt;br /&gt;- Phoneme positional information은 Rotational Positional Embedding (RoPE)를 통해 incorporate 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;여기서 phoneset는 language 간에 공통이므로, language에 condition 된 Text Encoder를 사용하여 language 간 articulatory variation을 capture 함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이를 위해 IPA embedding과 함께 16-dimensional fixed embedding을 Text Encoder에 전달함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Text Encoder는 ConvNeXt module로 구성되고, 얻어진 linguistic representation은 target mel-spectrogram의 frame 수에 맞게 filler token embedding과 concatenate 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Mel-spectrogram은 target speech에서 40ms frame length, 20ms frame shift, 1024 FFT length를 사용하여 얻어지고, temporal acoustic, prosodic parameter는 $\texttt{Opensmile}$을 통해 추출됨&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;IPACue-TTS는 conditional flow matching을 기반으로 mel-spectrogram의 neighbor segment를 condition으로 segment를 predict 하는 infilling task로 training 됨&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Acoustic reference는 temporal acoustic, prosodic parameter와 mel-spectrogram을 concatenate 한 다음, Conditional Flow module을 통해 modeling 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이때 predict 할 acoustic reference segment는 randomly mask 되고, &lt;a href=&quot;https://randomsampling.tistory.com/494&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;F5-TTS&lt;/b&gt;&lt;/a&gt;와 같이 normalizing flow의 0-th step에서 acoustic reference와 동일한 dimension의 noise input이 initialize 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Filled linguistic representation, masked acoustic reference, noise input은 concatenate 되어 Diffusion Transformer (DiT) module에 전달되고, masked segment인 acoustic target을 predict 함&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;이후 predicted output에 inverse mask를 적용하여 masked region에 대한 loss를 compute 함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;추론 시에는 reference speech와 text가 concatenate 되어 전달됨&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;/span&gt;&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Reference speech에서 추출된 acoustic reference는 unmasked segment로 사용되고, average phone duration으로 compute 된 approximate length의 mask를 filling task에 활용함&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Acoustic target은 16-step으로 predict 되고 generated mel-spectrogram은 pre-trained &lt;a href=&quot;https://randomsampling.tistory.com/245&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Vocos&lt;/b&gt;&lt;/a&gt; vocoder를 통해 reconstruct 됨&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-14 080623.png&quot; data-origin-width=&quot;660&quot; data-origin-height=&quot;487&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cgGBVx/dJMcagMiSJE/vlaIW55J0JpBeIsM3an010/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cgGBVx/dJMcagMiSJE/vlaIW55J0JpBeIsM3an010/img.png&quot; data-alt=&quot;Inference&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cgGBVx/dJMcagMiSJE/vlaIW55J0JpBeIsM3an010/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcgGBVx%2FdJMcagMiSJE%2FvlaIW55J0JpBeIsM3an010%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;660&quot; height=&quot;487&quot; data-filename=&quot;스크린샷 2026-05-14 080623.png&quot; data-origin-width=&quot;660&quot; data-origin-height=&quot;487&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Inference&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;4. Experiments&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Settings&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Dataset : IndicTTS, SYSPIN, RASA&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Comparisons : &lt;a href=&quot;https://randomsampling.tistory.com/102&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;YourTTS&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/38&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;FastSpeech2&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/217&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Matcha-TTS&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/496&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;E2-TTS&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/494&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;F5-TTS&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Results&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;전체적으로 IPACue-TTS의 성능이 가장 우수함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-14 080342.png&quot; data-origin-width=&quot;1191&quot; data-origin-height=&quot;270&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ch9K62/dJMb99TUanx/OZoTMdyCzazfGacR2OtKsk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ch9K62/dJMb99TUanx/OZoTMdyCzazfGacR2OtKsk/img.png&quot; data-alt=&quot;Model 성능 비교&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ch9K62/dJMb99TUanx/OZoTMdyCzazfGacR2OtKsk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fch9K62%2FdJMb99TUanx%2FOZoTMdyCzazfGacR2OtKsk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1191&quot; height=&quot;270&quot; data-filename=&quot;스크린샷 2026-05-14 080342.png&quot; data-origin-width=&quot;1191&quot; data-origin-height=&quot;270&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Model 성능 비교&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper/TTS</category>
      <category>IPACue-TTS</category>
      <category>text-to-speech</category>
      <category>TTS</category>
      <author>feVeRin</author>
      <guid isPermaLink="true">https://randomsampling.tistory.com/648</guid>
      <comments>https://randomsampling.tistory.com/648#entry648comment</comments>
      <pubDate>Thu, 14 May 2026 14:03:00 +0900</pubDate>
    </item>
    <item>
      <title>[Paper 리뷰] IBPCodec: A Low-Bitrate Lightweight Speech Codec with Inter-Band Prediction</title>
      <link>https://randomsampling.tistory.com/647</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/katex.min.js&quot;&gt;&lt;/script&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/npm/katex@0.16.22/dist/contrib/auto-render.min.js&quot;&gt;&lt;/script&gt;
&lt;script&gt;document.addEventListener(&quot;DOMContentLoaded&quot;, function() {  renderMathInElement(document.body, {    delimiters: [      {left: &quot;$$&quot;, right: &quot;$$&quot;, display: true},      {left: &quot;$&quot;, right: &quot;$&quot;, display: false}    ]  });});&lt;/script&gt;
&lt;/b&gt;&lt;/h2&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;IBPCodec: A Low-Bitrate Lightweight Speech Codec with Inter-Band Prediction&lt;/span&gt;&lt;/b&gt;&lt;/h2&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Neural codec은 high computational complexity로 인한 한계가 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;IBPCodec&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Inter-Band Prediction을 활용하여 low-frequency information을 modeling&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Decoding 시에는 high-/low-frequency band 간의 correlation을 활용하여 full speech reconstruction을 지원&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;논문 (ICASSP 2026) : &lt;a href=&quot;https://ieeexplore.ieee.org/abstract/document/11462198/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;Paper Link&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;1. Introduction&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Speech codec은 continuous waveform을 discrete representation으로 compress 함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;특히 &lt;a href=&quot;https://randomsampling.tistory.com/211&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;SoundStream&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/210&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;EnCodec&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/258&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;DAC&lt;/b&gt;&lt;/a&gt;와 같은 neural codec은 encoder-decoder architecture와 vector quantization을 활용하여 discrete sequence를 생성함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;BUT, 대부분의 neural codec은 high computational complexity를 가짐&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;- &lt;span style=&quot;color: #006dd7;&quot;&gt;이로 인해 edge device나 downstream task에 적용하는데 한계가 있음&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;-&amp;gt; 그래서 neural codec의 complexity를 개선한 IBPCodec을 제안&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;IBPCodec&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;b&gt;Low-frequency component에 대한 feature extraction&lt;/b&gt;과 quantization을 수행&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Quantized feature decoding 시에는 &lt;b&gt;Inter-Band Prediction Module을 사용&lt;/b&gt;하여 low-frequency component로부터 high-frequency component를 predict&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;&amp;lt; Overall of IBPCodec &amp;gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Inter-Band Prediction을 활용한 lightweight, low-bitrate neural codec&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;결과적으로 기존보다 우수한 성능을 달성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;2. Method&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Overview&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Low-frequency component는 subjective/objective evaluation에서 dominate 하므로, 논문은 low-frequency spectral component를 preserve 하는 것을 목표로 함&lt;/span&gt;&lt;br /&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;구조적으로 IBPCodec은 encoder-quantizer-decoder architecture를 따름&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Input signal $x\in\mathbb{R}^{1\times T}$는 STFT를 통해 time-frequency spectrum $f\in\mathbb{R}^{F\times N}$으로 convert 됨&lt;br /&gt;- $T$ : time-domain sample 수, $N$ : temporal frame 수, $F$ : frequency bin 수&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;이후 low-frequency clipping을 적용한 amplitude, real, imagninary part $f_{low}\in\mathbb{R}^{3\times&amp;nbsp;F'\times&amp;nbsp;N}$를 network input으로 사용함&lt;br /&gt;- $F'$ : low-frequency bin&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Total bandwidth에서 low-frequency band proportion은 $P=\frac{F'}{F}$와 같음&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Encoder는 low-frequency input component $f_{low}$에서 latent feature $z$를 추출하고 quantizer를 통해 discrete latent representation $z_{q}$를 얻음&lt;br /&gt;- 이후 decoder는 quantized latent representation으로부터 low-frequency component를 recover 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;Decode 된 low-frequency component $f'_{low}$에 대해, Inter-Band Prediction Module (IBPM)을 사용하여 low frequency band에서 high frequency를 predict 함:&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 1)&lt;/b&gt;&lt;/span&gt; $ f'_{high}=\text{IBPM}(f'_{low})$&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;$f'_{low},f'_{high}$를 complete spectrum $f'$으로 concatenate 하고 iSTFT를 통해 complete speech signal $x'$을 얻음&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 2)&lt;/b&gt;&lt;/span&gt; $x'=\text{iSTFT}(\text{concat}(f'_{low},f'_{high}))$&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-13 081540.png&quot; data-origin-width=&quot;1151&quot; data-origin-height=&quot;480&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/zVyH5/dJMcaciKcR7/dQoQ52jxidMma2IVmEiifk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/zVyH5/dJMcaciKcR7/dQoQ52jxidMma2IVmEiifk/img.png&quot; data-alt=&quot;Overview&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/zVyH5/dJMcaciKcR7/dQoQ52jxidMma2IVmEiifk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FzVyH5%2FdJMcaciKcR7%2FdQoQ52jxidMma2IVmEiifk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1151&quot; height=&quot;480&quot; data-filename=&quot;스크린샷 2026-05-13 081540.png&quot; data-origin-width=&quot;1151&quot; data-origin-height=&quot;480&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Overview&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Encoder and Decoder&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #ee2323;&quot;&gt;Encoder는 ConvEncoder와 Temporal Aggregation Module (TAM)으로 구성됨&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;ConvEncoder는 stacked Encoder block으로 구성됨&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;각 block은 channel 수 $C$, kernel size $(k,1)$, stride $(s,1)$을 가진 2D downsampling convolution과 channel 수 $C$, kernel size $(1,1)$, stride $(1,1)$의 2D convolution으로 구성됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;각 convolution과 Encoder block은 ReLU activation으로 connect 됨&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;Time domain에서 downsampling stride와 kernel size가 $1$로 설정되어 있으므로, ConvEncoder의 모든 feature extraction은 각 frame 내에서 수행되고 inter-frame correlation은 고려하지 않음&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;이를 해결하기 위해 논문은 quantizer의 각 side에 TAM을 추가하고, 구조적으로는&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;&lt;a href=&quot;https://randomsampling.tistory.com/572&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;FocalCodec&lt;/b&gt;&lt;/a&gt;을 따라 FocalBlock의 causal variant를 채택함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;FocalBlock은 서로 다른 size의 depth convolution을 사용하여 dependency를 capture 하고, gated aggregation을 통해 multiple granularity의 contextual feature를 single feature vector로 condense 함&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Decoder는 Encoder의 mirror로써 downsampling convolution을 upsampling convolution으로 replace 함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Inter-Band Prediction Module&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;논문은 Inter-Band Prediction Module (IBPM)을 codec에 directly embed 함&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;IBPM은 high-/low-frequency 간의 correlation을 활용하여 low-frequency component를 통해 high-frequency component를 predict 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;먼저 temporal resolution을 preserve 하기 위해 input tensor의 channel, frequency dimension을 merge 함&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;이후 3 layer의 pointwise 1D convolution을 사용하여 temporal dimension을 preserve 하면서 low frequency에서 high frequency로 dimension을 project 함&lt;/span&gt;&lt;br /&gt;- 각 1D convolution은 PReLU activation을 가짐&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;최종적으로 reshape operation을 통해 original tensor layout을 restore 함&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Quantizer and Discriminator&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;논문은 &lt;a href=&quot;https://randomsampling.tistory.com/243&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;HiFi-Codec&lt;/b&gt;&lt;/a&gt;을 따라 GRVQ를 quantizer로 채택함&lt;/span&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #006dd7;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;이때 group 수 $G=2$로 설정하고 GRVQ layer 수를 통해 bitrate를 modify 할 수 있음&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Discriminator는 &lt;a href=&quot;https://randomsampling.tistory.com/210&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;EnCodec&lt;/b&gt;&lt;/a&gt;을 따라 Multi-Scale STFT (MS-STFT) discriminator를 사용함&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Loss Function&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Reconstruction loss $\mathcal{L}_{rec}$, adversarial loss $\mathcal{L}_{adv}$, feature matching loss $\mathcal{L}_{feat}$, commitment loss $\mathcal{L}_{cmt}$에 대해&lt;/span&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;IBPCodec의 loss function은&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 3)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}=\lambda_{rec}\mathcal{L}_{rec}+\lambda_{adv}\mathcal{L}_{adv}+\lambda_{feat}\mathcal{L}_{feat}+\lambda_{cmt}\mathcal{L}_{cmt}$&lt;br /&gt;- $\lambda_{rec},\lambda_{adv},\lambda_{feat},\lambda_{cmt}$ : weight coefficient&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;이때 reconstruction loss $\mathcal{L}_{rec}$는&lt;/span&gt;:&lt;br /&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;(Eq. 4)&lt;/b&gt;&lt;/span&gt; $\mathcal{L}_{rec}=\lambda_{wav}\mathcal{L}_{wav}+\lambda_{mel}\mathcal{L}_{mel}$&lt;br /&gt;- $\lambda_{wav},\lambda_{mel}$ : weight coefficient, $\mathcal{L}_{wav}$ : waveform loss, $\mathcal{L}_{mel}$ : mel-spectrogram loss&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #9feec3; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;3. Experiments&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Settings&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Dataset : VCTK&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;Comparisons : &lt;a href=&quot;https://randomsampling.tistory.com/258&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;DAC&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/449&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;SpeechTokenizer&lt;/b&gt;&lt;/a&gt;, &lt;a href=&quot;https://randomsampling.tistory.com/417&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;b&gt;FunCodec&lt;/b&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;&lt;b&gt;- Results&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif; color: #000000;&quot;&gt;전체적으로 IBPCodec의 성능이 가장 우수함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-13 081114.png&quot; data-origin-width=&quot;1461&quot; data-origin-height=&quot;478&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/zho2E/dJMcahYGKV8/SyYnQGbk2j4guEYJc3B9z0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/zho2E/dJMcahYGKV8/SyYnQGbk2j4guEYJc3B9z0/img.png&quot; data-alt=&quot;Model 성능 비교&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/zho2E/dJMcahYGKV8/SyYnQGbk2j4guEYJc3B9z0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fzho2E%2FdJMcahYGKV8%2FSyYnQGbk2j4guEYJc3B9z0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1461&quot; height=&quot;478&quot; data-filename=&quot;스크린샷 2026-05-13 081114.png&quot; data-origin-width=&quot;1461&quot; data-origin-height=&quot;478&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Model 성능 비교&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Subjective evaluation 측면에서도 우수한 성능을 보임&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-13 081150.png&quot; data-origin-width=&quot;731&quot; data-origin-height=&quot;627&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c5StUn/dJMcaiXyxTV/Ua389VPiRLYFmkTNpgunY0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c5StUn/dJMcaiXyxTV/Ua389VPiRLYFmkTNpgunY0/img.png&quot; data-alt=&quot;Subjective Evaluation&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c5StUn/dJMcaiXyxTV/Ua389VPiRLYFmkTNpgunY0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc5StUn%2FdJMcaiXyxTV%2FUa389VPiRLYFmkTNpgunY0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;731&quot; height=&quot;627&quot; data-filename=&quot;스크린샷 2026-05-13 081150.png&quot; data-origin-width=&quot;731&quot; data-origin-height=&quot;627&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Subjective Evaluation&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Low-band proportion $P=0.75$ 일 때 최적의 성능을 달성함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-13 081256.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;197&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dhPLSl/dJMcaaywCmH/5G8zjGuuLiS4GwB95HZmt1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dhPLSl/dJMcaaywCmH/5G8zjGuuLiS4GwB95HZmt1/img.png&quot; data-alt=&quot;Low-band Proportion 별 성능&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dhPLSl/dJMcaaywCmH/5G8zjGuuLiS4GwB95HZmt1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdhPLSl%2FdJMcaaywCmH%2F5G8zjGuuLiS4GwB95HZmt1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;673&quot; height=&quot;197&quot; data-filename=&quot;스크린샷 2026-05-13 081256.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;197&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Low-band Proportion 별 성능&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;Ablation Study&lt;/span&gt;&lt;/b&gt;
&lt;ul style=&quot;list-style-type: circle;&quot; data-ke-list-type=&quot;circle&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000; font-family: AppleSDGothicNeo-Regular, 'Malgun Gothic', '맑은 고딕', dotum, 돋움, sans-serif;&quot;&gt;각 component는 성능 향상에 유효함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;스크린샷 2026-05-13 081324.png&quot; data-origin-width=&quot;678&quot; data-origin-height=&quot;242&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bcogML/dJMcaiwv8ZD/sY2KsQpg0NKsjKZsDrqeFK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bcogML/dJMcaiwv8ZD/sY2KsQpg0NKsjKZsDrqeFK/img.png&quot; data-alt=&quot;Ablation Study&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bcogML/dJMcaiwv8ZD/sY2KsQpg0NKsjKZsDrqeFK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbcogML%2FdJMcaiwv8ZD%2FsY2KsQpg0NKsjKZsDrqeFK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;678&quot; height=&quot;242&quot; data-filename=&quot;스크린샷 2026-05-13 081324.png&quot; data-origin-width=&quot;678&quot; data-origin-height=&quot;242&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Ablation Study&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper/Neural Codec</category>
      <category>IBPCodec</category>
      <category>Neural Codec</category>
      <author>feVeRin</author>
      <guid isPermaLink="true">https://randomsampling.tistory.com/647</guid>
      <comments>https://randomsampling.tistory.com/647#entry647comment</comments>
      <pubDate>Wed, 13 May 2026 13:04:07 +0900</pubDate>
    </item>
  </channel>
</rss>