U.S. State Department Crawled Corpus

  Initiator: TAUS
  Domain: Crawled / News / Low-Resource Languages
 Language(s):
English - Hindi English - Urdu

International news and global public affairs shift focus frequently, have constantly evolving language and terminology. U.S. Department of State press releases closely mirror these developments and sometimes originate them. This corpus allows to incorporate these shifts in language and topic into news, diplomatic and other current affairs translations.

The press releases consist of news articles, diplomatic statements and transcribed press conferences. The press releases are from the time period April 21st, 2017 to June 21st, 2019. The translations were automatically segmented and aligned, deduplicated, shuffled and cleaned using common sense cleaning criteria. Individual translations are in the public domain. We thank the U.S. Department of State and the Office of Language Services for making these translations available.

To view samples please login.
English - Hindi Tokens
Corpus Size Segments Source Target
Crawled 24,130 481,296 581,805
Sample Login to view
blurred-text
English - Urdu Tokens
Corpus Size Segments Source Target
Crawled 23,487 513,493 693,867
Sample Login to view
blurred-text
Language Pair
Crawled
English - Hindi
Member Price
Price in Euro / Partner Credits
Price in Data Cloud Credits
€ 3,850
8 million
Non-Member Price
Price in Euro
€ 4,620
 
English - Urdu
Member Price
Price in Euro / Partner Credits
Price in Data Cloud Credits
€ 3,900
9 million
Non-Member Price
Price in Euro
€ 4,680
 

Couldn't find what you were looking for?

Do you have a query corpus to submit?
Request Matching Data
Contact us to get more information
Contact us