Powering Automated Translation in Time of Corona Crisis

Machine translation is an important technology in the event of a crisis. When integrated in the rapid-response communication plans it increases the speed with which the information is passed on, but also the language coverage. The one pre-condition is that there is enough data available on the topic at hand.

Data Contributors

TAUS Corona Crisis Corpora

These corpora are the result of a collective industry charity effort where participants contributed their own translation memories covering this domain so that together we were able to expand both the volume of good data and the language spread. TAUS also generated corpora by applying Matching Data selection to DataCloud and ParaCrawl data. The query corpus used is crawled from the web for the latest Corona virus-related articles and news. The selected data is related to virology, epidemic, medicine, and healthcare.

Each file contains two tab-separated columns: the first column is source text and the second is the target. Anyone who is training their own MT engines can download these corpora and use them to improve their translation services and systems. ModelFront helped in filtering the corpora further, and removed misaligned or bad translations.

Language PairSegment Count
English-French885,606
English-German613,318
English-Italian381,710
English-Spanish879,926

The Chinese corpus is a selection from TAUS DataCloud and UN Parallel Corpus.

English-Chinese450,507

The Russian corpus is a selection from TAUS DataCloud and data provided by Neotech and The Russian Archives of Internal Medicine.

English-Russian192,614

Corona Crisis Translation models by SYSTRAN

SYSTRAN has contributed to this initiative by producing Corona Crisis Translation Models in 12 languages, based on quality parallel data provided by TAUS. The models are publicly available at no cost. Together, we ensure that people and communities in need have access to accurate coronavirus-related information in their local language.

Try the models for free on SYSTRAN Translate

The corpus is licenesed under Creative Commons Attribution-NonCommercial 4.0. You are free to: Share - copy and redistribute the material in any medium or format, Adapt - remix, transform, and build upon the material. The licensor cannot revoke these freedoms as long as you follow the license terms.

If you would like to contribute data, please contact the TAUS Data team at data@taus.net