In a blind English-language test against the top real-time TTS models, Phantom X 3.2 landed in the top tier — tied for #1 on expressivity, at ~125 ms latency.

To see how we stack up against the industry's top real-time engines, we ran a rigorous blind listening study in English, comparing Phantom X 3.2 with Inworld, Hume, Async, and ElevenLabs. Linguistic experts conducted thousands of blind pairwise comparisons. The results show Phantom X 3.2 sets a new bar for expressivity and quality at exceptionally low latency.
Expressive performance was our primary metric, judged on sound quality, prosody, intonation, and absence of artifacts.
In real-time conversational AI, latency is the ultimate barrier to immersion. Our Time-To-First-Audio (TTFA) optimization ensures Phantom X 3.2 responds before the human listener can perceive a delay — at ~125 ms.
Our model finished in a statistical tie for first place and decisively outperformed every other competitor tested. The result: Deepdub Phantom X 3.2 sits at the very top of the industry on the only benchmark that matters — what real people actually prefer to hear.
Two systems work together: a wide emotion library that you control inline in the script, and a paralinguistic layer that adds the small unconscious signals listeners associate with real speech.

© Deepdub. Phantom X 3.2 benchmark · English. Methodology details available on request.