Phantom X 3.2 · Top-ranked expressive TTS among real-time models

Phantom X 3.2 Benchmark — English

In a blind English-language test against the top real-time TTS models, Phantom X 3.2 landed in the top tier — tied for #1 on expressivity, at ~125 ms latency.

Methodology

Blind-tested against every real-time leader

To see how we stack up against the industry's top real-time engines, we ran a rigorous blind listening study in English, comparing Phantom X 3.2 with Inworld, Hume, Async, and ElevenLabs. Linguistic experts conducted thousands of blind pairwise comparisons. The results show Phantom X 3.2 sets a new bar for expressivity and quality at exceptionally low latency.

Result

Phantom X 3.2
ranked at the top

Expressive performance was our primary metric, judged on sound quality, prosody, intonation, and absence of artifacts.

ELO Scores — English
Inworld
#1
TTS 1.5-max
1549
Deepdub
tied #1
Phantom X 3.2
1545
Hume
Octave
1498
Async
Flash v1.0
1493
ElevenLabs
Turbo v2.5
1416
Average ELO rating · higher is better
Real-time latency

The fastest model among real-time TTS

In real-time conversational AI, latency is the ultimate barrier to immersion. Our Time-To-First-Audio (TTFA) optimization ensures Phantom X 3.2 responds before the human listener can perceive a delay — at ~125 ms.

Time-To-First-Audio · ms · lower is better
1
Deepdub Phantom X 3.2
125 ms
2
Async Flash v1.0
166 ms
3
Hume Octave
200 ms
4
Inworld TTS 1.5-Max
250 ms
5
ElevenLabs Turbo v2.5
300 ms
Time-to-first-audio (ms) · measured under identical conditions
Head-to-head matchups

Listeners prefer Phantom X 3.2 across the field

Our model finished in a statistical tie for first place and decisively outperformed every other competitor tested. The result: Deepdub Phantom X 3.2 sits at the very top of the industry on the only benchmark that matters — what real people actually prefer to hear.

Phantom X 3.2
65%
ElevenLabs Turbo 2.5
35%
Phantom X 3.2
57.9%
Async Flash 1.0
42.1%
Phantom X 3.2
57.8%
Hume Octave
42.2%
Phantom X 3.2
49.4%
Inworld TTS 1.5-Max
50.6%
Blind pairwise listener preference · English
Expressivity

What makes Phantom X 3.2 actually emote

Two systems work together: a wide emotion library that you control inline in the script, and a paralinguistic layer that adds the small unconscious signals listeners associate with real speech.

Emotional Layering
80+ emotion styles, from supportive to malicious. Inline tags let a single line shift tone mid-sentence naturally.
Paralinguistic Cues
Natural breath, pauses, and micro-shifts in tone. The small, unconscious signals that make speech sound spoken, not generated.
User
Hey — my flight just got cancelled and I have a meeting in Berlin tomorrow morning. I'm freaking out.
Agent (Phantom X 3.2)
supportive
Hi, I understand. We'll figure it out together.
focused
I'm pulling alternates for you now.
reassuring
There's a 6:40 to Frankfurt with a
connection that lands you at 09:15, with time to spare. Would you like to book it?
~125 ms TTFA · 3 emotion shifts in one turn
Years of AI dubbing for the world's top streaming platforms set our standard for what 'human' sounds like. Phantom X 3.2 brings that gold standard to the agentic world: tied for #1 on expressivity among the world's top real- time TTS models, at ~125 ms latency, imperceptible to the listener.
Moshe Michelashvili
VP Research · Deepdub
Top-tier
expressive voice
100+
languages and dialects
125 ms
real-time latency
Ready when you are

Ready to hear
Phantom X 3.2?

Drop into the playground — type a line, pick a voice, hear it in any of 100+ languages.

© Deepdub. Phantom X 3.2 benchmark · English. Methodology details available on request.