HLP Colloquial Corpus #1
With the Human Language Project, TAUS offers any of the domains in the Matching Data catalogue, or any new domain for that matter, in a language of your choice. A carefully selected part of the colloquial Matching Data corpus has been translated and reviewed by native speakers in many long-tail languages, to get the highest-quality customized set for your MT training.
The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.
This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.