Joint speech and text machine translation for up to 100 languages
[SEAMLESS Communication Team]; Nature; 2025年1月15日
『Abstract』Creating the Babel Fish, a tool that helps individuals translate speech between any two languages, requires advanced technological innovation and linguistic expertise. Although conventional speech-to-speech translation systems composed of multiple subsystems performing translation in a cascaded fashion exist , scalable and high-performing unified systems remain underexplored. To address this gap, here we introduce SEAMLESSM4T–Massively Multilingual and Multimodal Machine Translation–a single model that supports speech-to-speech translation (101 to 36 languages), speech-to-text translation (from 101 to 96 languages), text-to-speech translation (from 96 to 36 languages), text-to-text translation (96 languages) and automatic speech recognition (96 languages). Built using a new multimodal corpus of automatically aligned speech translations and other publicly available data, SEAMLESSM4T is one of the first multilingual systems that can translate from and into English for both speech and text. Moreover, it outperforms the existing state-of-the-art cascaded systems, achieving up to 8% and 23% higher BLEU (Bilingual Evaluation Understudy) scores in speech-to-text and speech-to-speech tasks, respectively. Beyond quality, when tested for robustness, our system is, on average, approximately 50% more resilient against background noise and speaker variations in speech-to-text tasks than the previous state-of-the-art systems. We evaluated SEAMLESSM4T on added toxicity and gender bias to assess translation safety. For the former, we included two strategies for added toxicity mitigation working at either training or inference time. Finally, all contributions in this work are publicly available for non-commercial use to propel further research on inclusive speech translation technologies.
『摘要』
开发巴别鱼(Babel Fish)这一能够帮助个人在任意两种语言之间进行语音翻译的工具,需要先进的技术创新和语言专业知识。尽管目前存在由多个子系统以级联方式进行翻译的传统语音到语音翻译系统,但可扩展且高性能的统一系统仍有待深入探索。为弥补这一空白,我们在此推出了SEAMLESSM4T(大规模多语种和多模态机器翻译)——这是一个支持语音到语音翻译(101种语言至36种语言)、语音到文本翻译(101种语言至96种语言)、文本到语音翻译(96种语言至36种语言)、文本到文本翻译(96种语言)以及自动语音识别(96种语言)的单一模型。SEAMLESSM4T是使用一个新的多模态语料库(包含自动对齐的语音翻译和其他公开可用的数据)构建的,是首批能够实现英语与另一种语言之间语音和文本互译的多语言系统之一。此外,它的表现优于现有的最先进的级联系统,在语音到文本和语音到语音任务中,分别取得了高出8%和23%的BLEU(双语评估替补)分数。除了质量上的提升,在鲁棒性测试方面,我们的系统在语音到文本任务中,相比之前的系统平均提高了约50%的抗背景噪声和语音变化能力。我们还评估了SEAMLESSM4T在附加毒性内容和性别偏见方面的表现,以评估翻译的安全性。对于前者,我们采用了两种策略来减轻附加毒性内容,这两种策略分别在训练或推理时起作用。最后,本工作的所有成果均公开供非商业使用,以推动对包容性语音翻译技术的进一步研究。
『总结』
研究推出了SEAMLESSM4T,这是一款多功能机器翻译模型,支持多种语言和模态的翻译,且表现优于现有系统,同时更具鲁棒性,并已公开供非商业使用。
【闲叙】
AI 现在做复杂、系统的事情确实有难度,但在自然语言的处理方面确实有优势,大胆预测一下:以后外语的学习可能没那么重要了。