Deepdub vs Top Real-Time TTS | Top-tier Emotive TTS in English

Methodology

Blind-tested against every real-time leader

To see how we stack up against the industry's top real-time engines, we ran a rigorous blind listening study in English, comparing Phantom X 3.2 with Inworld, Hume, Async, and ElevenLabs. Linguistic experts conducted thousands of blind pairwise comparisons. The results show Phantom X 3.2 sets a new bar for expressivity and quality at exceptionally low latency.

Result

Phantom X 3.2
ranked at the top

Expressive performance was our primary metric, judged on sound quality, prosody, intonation, and absence of artifacts.

ELO Scores — English

Inworld

TTS 1.5-max

1549

Deepdub

tied #1

Phantom X 3.2

1545

Hume

Octave

1498

Async

Flash v1.0

1493

ElevenLabs

Turbo v2.5

1416

Average ELO rating · higher is better

Real-time latency

The fastest model among real-time TTS

In real-time conversational AI, latency is the ultimate barrier to immersion. Our Time-To-First-Audio (TTFA) optimization ensures Phantom X 3.2 responds before the human listener can perceive a delay — at ~125 ms.

Time-To-First-Audio · ms · lower is better

Deepdub Phantom X 3.2

125 ms

Async Flash v1.0

166 ms

Hume Octave

200 ms

Inworld TTS 1.5-Max

250 ms

ElevenLabs Turbo v2.5

300 ms

Time-to-first-audio (ms) · measured under identical conditions

Head-to-head matchups

Listeners prefer Phantom X 3.2 across the field

Our model finished in a statistical tie for first place and decisively outperformed every other competitor tested. The result: Deepdub Phantom X 3.2 sits at the very top of the industry on the only benchmark that matters — what real people actually prefer to hear.

Phantom X 3.2

65%

ElevenLabs Turbo 2.5

35%

Phantom X 3.2

57.9%

Async Flash 1.0

42.1%

Phantom X 3.2

57.8%

Hume Octave

42.2%

Phantom X 3.2

49.4%

Inworld TTS 1.5-Max

50.6%

Blind pairwise listener preference · English

Expressivity

What makes Phantom X 3.2 actually emote

Two systems work together: a wide emotion library that you control inline in the script, and a paralinguistic layer that adds the small unconscious signals listeners associate with real speech.

Emotional Layering

80+ emotion styles, from supportive to malicious. Inline tags let a single line shift tone mid-sentence naturally.

Paralinguistic Cues

Natural breath, pauses, and micro-shifts in tone. The small, unconscious signals that make speech sound spoken, not generated.

User

Hey — my flight just got cancelled and I have a meeting in Berlin tomorrow morning. I'm freaking out.

Agent (Phantom X 3.2)

supportive

Hi, I understand. We'll figure it out together.

focused

I'm pulling alternates for you now.

reassuring

There's a 6:40 to Frankfurt with a

connection that lands you at 09:15, with time to spare. Would you like to book it?

~125 ms TTFA · 3 emotion shifts in one turn

Years of AI dubbing for the world's top streaming platforms set our standard for what 'human' sounds like. Phantom X 3.2 brings that gold standard to the agentic world: tied for #1 on expressivity among the world's top real- time TTS models, at ~125 ms latency, imperceptible to the listener.

Moshe Michelashvili

VP Research · Deepdub

Top-tier

expressive voice

100+

languages and dialects

125 ms

real-time latency

Ready when you are

Ready to hear
Phantom X 3.2?

Drop into the playground — type a line, pick a voice, hear it in any of 100+ languages.

Try it in Playground

Phantom X 3.2 Benchmark — English

Blind-tested against every real-time leader

Phantom X 3.2ranked at the top

The fastest model among real-time TTS

Listeners prefer Phantom X 3.2 across the field

What makes Phantom X 3.2 actually emote

Ready to hearPhantom X 3.2?

Phantom X 3.2
ranked at the top

Ready to hear
Phantom X 3.2?