HLP Colloquial Corpus #1

  Initiator: TAUS
  Domain: Colloquial Text
 Language(s):
English - Hindi English - Urdu English - Tamil English - Nepali English - Turkish English - Pashto English - Sorani English - Bengali English - Burmese English - Assamese English - Telugu English - Sinhalese English - Dari English - Punjabi (Pakistan) English - Punjabi (India) English - Lao

With the Human Language Project, TAUS offers any of the domains in the Matching Data catalogue, or any new domain for that matter, in a language of your choice. A carefully selected part of the colloquial Matching Data corpus has been translated and reviewed by native speakers in many long-tail languages, to get the highest-quality customized set for your MT training.

The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.

This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.

To view samples please login.
English - Hindi Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,057,654 1,199,515
Sample Login to view
blurred-text
English - Urdu Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,057,654 1,249,464
Sample Login to view
blurred-text
English - Tamil Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,057,654 775,480
Sample Login to view
blurred-text
English - Nepali Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,057,654 1,008,586
Sample Login to view
blurred-text
English - Turkish Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,886 758,634
Sample Login to view
blurred-text
English - Pashto Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 932,515
Sample Login to view
blurred-text
English - Sorani Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 932,515
Sample Login to view
blurred-text
English - Bengali Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 951,283
Sample Login to view
blurred-text
English - Burmese Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 480,416
Sample Login to view
blurred-text
English - Assamese Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 973,656
Sample Login to view
blurred-text
English - Telugu Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 951,995
Sample Login to view
blurred-text
English - Sinhalese Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 946,670
Sample Login to view
blurred-text
English - Dari Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 1,141,651
Sample Login to view
blurred-text
English - Punjabi (Pakistan) Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 1,208,265
Sample Login to view
blurred-text
English - Punjabi (India) Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 1,205,446
Sample Login to view
blurred-text
English - Lao Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 244,543
Sample Login to view
blurred-text
Language Pair
Human Generated Corpora
English - Hindi
Request a price
English - Urdu
Request a price
English - Tamil
Request a price
English - Nepali
Request a price
English - Turkish
Request a price
English - Pashto
Request a price
English - Sorani
Request a price
English - Bengali
Request a price
English - Burmese
Request a price
English - Assamese
Request a price
English - Telugu
Request a price
English - Sinhalese
Request a price
English - Dari
Request a price
English - Punjabi (Pakistan)
Request a price
English - Punjabi (India)
Request a price
English - Lao
Request a price

Couldn't find what you were looking for?

Do you have a query corpus to submit?
Request Matching Data
Contact us to get more information
Contact us