In a blind English-language test against the top real-time TTS models, Phantom X 3.2 landed in the top tier.

To see how we stack up against the industry's top real-time engines, we ran a rigorous blind listening study in English, comparing Phantom X 3.2 with Inworld, Hume, Async, and ElevenLabs. Linguistic experts conducted thousands of blind pairwise comparisons. The results show Phantom X 3.2 sets a new bar for expressivity and quality at exceptionally low latency.
We benchmarked English specifically, the most saturated language in TTS, where every major model has been heavily optimized and quality gaps between leaders are vanishingly small. Expressive performance was our primary metric, judged on sound quality, prosody, intonation, and absence of artifacts.
Phantom X 3.2 achieved an impressive ELO of 1545, a statistical dead heat for #1.

Phantom X 3.2 returns first audio in ~125 ms, Inworld TTS 1.5-max in ~250 ms.
In real-time conversational AI, latency is the ultimate barrier to immersion. Our Time-To-First-Audio (TTFA) optimization ensures that Phantom X 3.2 responds before the human listener can perceive a delay.
.jpg)
Our model finished in a statistical tie for first place and decisively outperformed every other competitor tested.Key head-to-head matchups:
The result: Deepdub Phantom X 3.2 sits at the very top of the industry on the only benchmark that matters — what real people actually prefer to hear.
.jpg)
Phantom X 3.2 carries the emotional range of premium drama into every real-time line.
Emotional Layering
80+ emotion styles, from supportive to malicious. Inline tags let a single line shift tone mid-sentence, without sounding stitched together.
Paralinguistic Cues
Natural breath, pauses, and micro-shifts in tone. The small, unconscious signals that make speech sound spoken, not generated.

“Years of AI dubbing for the world's top streaming platforms set our standard for what 'human' sounds like. Phantom X 3.2 brings that gold standard to the agentic world: tied for #1 on expressivity among the world's top real-time TTS models, at ~125 ms latency,imperceptible to the listener.”
Moshe Michelashvili, VP Research @Deepdub
Deepdub is the gold standard in AI dubbing for premium media. For years, our eTTS has powered Hollywood-grade localization for the world's top streaming platforms, across thousands of drama series, feature films, and documentaries.
Phantom X 3.2 brings that same gold standard to the agentic world, giving developers a foundation to deploy expressive, localized speech at scale. All in 100+ languages, at real-time latency.
Check our voices in 100+ languages and dialects for free.
Try the Playground →
Take your content anywhere you want it to be, in any language.
