Powering Automated Translation in Time of Corona Crisis

Machine translation is an important technology in the event of a crisis. When integrated in the rapid-response communication plans it increases the speed with which the information is passed on, but also the language coverage. The one pre-condition is that there is enough data available on the topic at hand.

TAUS Corona Crisis Corpus

This corpus is generated by applying Matching Data selection to TAUS DataCloud and ParaCrawl data. The query corpus used is crawled from web for latest Corona virus related articles and news. The selected data is related to virology, epidemic, medicine and healthcare.

Each file contains two tab seperated columns: first column is source text and second is the target. Anyone who is training their own MT engines can download these corpora and use them to improve their translation services and systems.

ModelFront helped in filtering the corpora further. Following files are now updated removing misaligned or bad translations.

Language PairSegment Count
English-French885,606
English-German613,318
English-Italian381,710
English-Spanish879,926

The Chinese corpus is a selection from TAUS DataCloud and UN Parallel Corpus

English-Chinese450,507

The Russian corpus is a selection from TAUS DataCloud and data provided by Neotech and The Russian Archives of Internal Medicine.

English-Russian192,614

Call for Language Industry Collaboration

In support of people and communities in need, TAUS is launching the Corona Crisis Corpus project.

We will collect language data specific to virus outbreaks, health conditions and cures, symptoms and medicines, hospitals and treatments, and everything that citizens and patients around the world want and need to know about the coronavirus and our joint effort to defeat it. We will clean, cluster and organize the data and make them available on the TAUS website. We will kick this off with a ‘Corona Starter Kit Corpus’ containing all relevant matches from the existing TAUS Data Cloud in a number of languages.

Anyone who is training their own MT engines can download these corpora and use them to improve their translation services and systems.

How Can You Help?

We invite everyone to contribute their own translation memories covering this domain, so that together we can expand both the volume of good data and the language spread.

Get in touch with our Data Team and contribute your data.

The Rules

  • Data are shared for free - this is a charity project, resulting in no financial gain for those involved. TAUS is putting in the labor and infrastructure for free and we ask all language data contributors to share their data for free.
  • Anyone who has domain-specific data may contribute data to create the specific Corona Crisis Corpora.
  • Data can be provided in monolingual and bilingual format.
  • TAUS offers the available Corona Crisis Corpora at no cost.
  • Duration of this project is until April 1, 2021.