HLP Colloquial Corpus #1

  Initiator: TAUS
  Domain: Colloquial Text
 Language(s):
English - Hindi English - Urdu English - Tamil English - Nepali English - Turkish English - Sorani English - Bengali

This corpus is first of its kind, created from scratch through the Human Language Project - TAUS custom-made platform for corpora creation in long-tail languages. Based on a carefully selected colloquial English source, the target data is translated and reviewed by native speakers to get the highest-quality customized set for your MT training.

The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.

This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.

To view samples please login.
English - Hindi Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,057,654 1,199,515
Sample Login to view
blurred-text
English - Urdu Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,057,654 1,249,464
Sample Login to view
blurred-text
English - Tamil Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,057,654 775,480
Sample Login to view
blurred-text
English - Nepali Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,057,654 1,008,586
Sample Login to view
blurred-text
English - Turkish Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,886 758,634
Sample Login to view
blurred-text
English - Sorani Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 932,515
Sample Login to view
blurred-text
English - Bengali Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 100,000 1,022,888 951,283
Sample Login to view
blurred-text
Language Pair
Human Generated Corpora
English - Hindi
Request a price
English - Urdu
Request a price
English - Tamil
Request a price
English - Nepali
Request a price
English - Turkish
Request a price
English - Sorani
Request a price
English - Bengali
Request a price

Couldn't find what you were looking for?

Do you have a query corpus to submit?
Request Matching Data
Contact us to get more information
Contact us