This corpus is first of its kind, created from scratch through the Human Language Project - TAUS custom-made platform for corpora creation in long-tail languages. Based on a carefully selected colloquial English source, the target data is translated and reviewed by native speakers to get the highest-quality customized set for your MT training.
The corpus is a great fit for training chat bots or social media content, and will give the conversation with your local audience a friendly, casual tone. From product user reviews and blog post comments to everyday business small talk, your MT engine will be able to handle even the most creative user voices.
This corpus contains over 1 million words, and a total vocabulary of more than 37000 different words. Need more data? In the following months, TAUS will release more equally sized corpora for the same domain and language combinations, with a significant increase of vocabulary.