HLP Colloquial Corpus #2

  Initiator: TAUS
  Domain: Colloquial Text
 Language(s):
English - Hindi

With the Human Language Project, TAUS offers any of the domains in the Matching Data catalogue, or any new domain for that matter, in a language of your choice. A carefully selected part of the colloquial Matching Data corpus has been translated and reviewed by native speakers in many long-tail languages, to get the highest-quality customized set for your MT training.

The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.

This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.

To view samples please login.
English - Hindi Tokens
Corpus Size Segments Source Target
TAUS HLP Corpus 72,302 1,011,778 1,164,506
Sample Login to view
blurred-text
Language Pair
Human Generated Corpora
English - Hindi
Request a price

Couldn't find what you were looking for?

Do you have a query corpus to submit?
Request Matching Data
Contact us to get more information
Contact us