A new AI translation system for headphones clones multiple voices simultaneously

May Be Interested In:AI paper mills are swamping science with garbage studies


Spatial Speech Translation consists of two AI models, the first of which divides the space surrounding the person wearing the headphones into small regions and uses a neural network to search for potential speakers and pinpoint their direction. 

The second model then translates the speakers’ words from French, German, or Spanish into English text using publicly available data sets. The same model extracts the unique characteristics and emotional tone of each speaker’s voice, such as the pitch and the amplitude, and applies those properties to the text, essentially creating a “cloned” voice. This means that when the translated version of a speaker’s words is relayed to the headphone wearer a few seconds later, it sounds as if it’s coming from the speaker’s direction and the voice sounds a lot like the speaker’s own, not a robotic-sounding computer.

Given that separating out human voices is hard enough for AI systems, being able to incorporate that ability into a real-time translation system, map the distance between the wearer and the speaker, and achieve decent latency on a real device is impressive, says Samuele Cornell, a postdoc researcher at Carnegie Mellon University’s Language Technologies Institute, who did not work on the project.

“Real-time speech-to-speech translation is incredibly hard,” he says. “Their results are very good in the limited testing settings. But for a real product, one would need much more training data—possibly with noise and real-world recordings from the headset, rather than purely relying on synthetic data.”

Gollakota’s team is now focusing on reducing the amount of time it takes for the AI translation to kick in after a speaker says something, which will accommodate more natural-sounding conversations between people speaking different languages. “We want to really get down that latency significantly to less than a second, so that you can still have the conversational vibe,” Gollakota says.

This remains a major challenge, because the speed at which an AI system can translate one language into another depends on the languages’ structure. Of the three languages Spatial Speech Translation was trained on, the system was quickest to translate French into English, followed by Spanish and then German—reflecting how German, unlike the other languages, places a sentence’s verbs and much of its meaning at the end and not at the beginning, says Claudio Fantinuoli, a researcher at the Johannes Gutenberg University of Mainz in Germany, who did not work on the project. 

Reducing the latency could make the translations less accurate, he warns: “The longer you wait [before translating], the more context you have, and the better the translation will be. It’s a balancing act.”

share Share facebook pinterest whatsapp x print

Similar Content

Evan Hardy Collegiate students left flowers and words of encouragement in front of their school in the days after after a violent assault that left a student and teacher seriously injured on Thursday, Sept. 5, 2024.
Direct indictment sends Evan Hardy fire accused to trial stage
McArthur ‘confident’ ahead of key vote on assisted dying bill at Holyrood
McArthur ‘confident’ ahead of key vote on assisted dying bill at Holyrood
Latham ordered to pay independent MP’s legal costs after losing defamation case – as it happened
Latham ordered to pay independent MP’s legal costs after losing defamation case – as it happened
Myles Garrett says $40m-a-year deal with Browns about winning rather than money
Myles Garrett says $40m-a-year deal with Browns about winning rather than money
Companies News Today Live Updates on February 28, 2025: Want a payhike? Clear an exam first, LTI Mindtree tells staff
Companies News Today Live Updates on February 28, 2025: Want a payhike? Clear an exam first, LTI Mindtree tells staff
How Brown's fire and ice create Origin field goal magic
How Brown’s fire and ice create Origin field goal magic
Global Focus: Events that Define Our World | © 2025 | Daily News