Off-the-Shelf Language Data for MT Training
Choose from our Library of ready-made, domain-specific corpora created using TAUS Matching Data clustered search, web-scale crawling and crowdsourced data generation.
ConvenientReady to download whenever you need it
Clean dataData and corpora validated by TAUS members
Easy overviewNumber of segments, word counts and samples
Human Generated Corpora
This corpus is first of its kind, created from scratch through the Human Language Project - TAUS custom-made platform for corpora creation in long-tail languages. Based on a carefully selected colloquial English source, the target data is translated and reviewed by native speakers to get the highest-quality customized set for your MT training.
IT-related communication comes in many shapes and forms. One of the main forms is consumer-oriented instruction - the type of text that appears in FAQs, short setup instructions, troubleshooting manuals or responses and requests for assistance.
Crawled / News / Low-Resource Languages
International news and global public affairs shift focus frequently, have constantly evolving language and terminology. U.S. Department of State press releases closely mirror these developments and sometimes originate them. This corpus allows to incorporate these shifts in language and topic into news, diplomatic and other current affairs translations.
This corpus is based on many monolingual support chat sessions, and only the short sentences: not more than 10 words each. This makes it a targeted corpus that includes simple lines with a lot of technical terminology and a conversational tone.
The corpus is created in collaboration with Systran based on a marketing and e-learning query content. It contains carefully crafted translations for outward-facing marketing and fun learning.
This corpus combines just the right conversational, instructional, marketing and e-commerce elements from a wide variety of customer service tickets and chats to power the seamless customer interactions in Hungarian. Helping the user to make a payment, change a password, sign in to their account, but also simply asking them how they like a service or a product and answering politely to a these are all the situations where this data set can add value.
Univerzita Karlova v Praze
Legal / Financial / VAT
These carefully selected parallel corpora are meant to make legal translations a breeze and give customer support systems handling tax-related or legal queries a nudge. The query corpus originated from UFAL, an institute that is part of the Charles University in Prague. The provided corpora was bilingual with added monolingual terms in target languages, specifically focused on value added tax.
Reliable product descriptions and information are a crucial asset in any e-commerce environment. In these corpora you'll find carefully filtered and cleaned data on a great variety of product types, that will make it even easier for your global customers to click on the 'Add to shopping cart' button!
Universitat Autonoma de Barcelona
Medical / Pharmaceutical
High fidelity MT training data is always important, even more so when it comes to medical subjects. This is a must-have corpus for anyone seeking for pharma-related data.
Need help in fine-tuning your customer support data into Dutch? Be it for your webshop, product documentation or website, information related to customer support is usually pretty standardized and therefore best handled with automation.
Is your chat bot not chatty enough? Or your MT engine looks at you puzzled when it has to deal with informal business communication or user generated content? This corpus will give the conversation with your local audience a friendly, casual tone.