Data Library

Off-the-Shelf Language Data for MT Training

Choose from our Library of ready-made, domain-specific corpora created using TAUS Matching Data clustered search, web-scale crawling and crowdsourced data generation.

 
Convenient
Ready to download whenever you need it
 
Clean data
Data and corpora validated by TAUS members
 
Easy overview
Number of segments, word counts and samples
TAUS
Colloquial Text
Human Generated Corpora
Human Generated Corpora

HLP Colloquial Corpus #1


With the Human Language Project, TAUS offers any of the domains in the Matching Data catalogue, or any new domain for that matter, in a language of your choice. A carefully selected part of the colloquial Matching Data corpus has been translated and reviewed by native speakers in many long-tail languages, to get the highest-quality customized set for your MT training.
English - Hindi English - Urdu English - Tamil English - Nepali English - Turkish English - Pashto English - Sorani English - Bengali English - Burmese English - Assamese English - Telugu English - Sinhalese English - Dari English - Punjabi (Pakistan) English - Punjabi (India) English - Lao English - Kurmanji (lat) English - Kurmanji (arab) English - Hausa
TAUS
Colloquial Text

Colloquial


Is your chat bot not chatty enough? Or your MT engine looks at you puzzled when it has to deal with informal business communication or user generated content? This corpus will give the conversation with your local audience a friendly, casual tone.
English - Spanish English - Portuguese (Brazil) English - Chinese (PRC) English - Korean English - Japanese English - French English - Dutch English - German English - Russian English - Arabic English - Italian English - Romanian English - Chinese (Traditional) English (United States) - Danish English (United States) - Finnish English (United States) - Czech English (United States) - Polish English (United States) - Swedish
Universitat Autonoma de Barcelona
Medical / Pharmaceutical

Medical/Pharmaceutical


High fidelity MT training data is always important, even more so when it comes to medical subjects. This is a must-have corpus for anyone seeking for pharma-related data.
English - Spanish English (United States) - German (Germany) English (United States) - French (France) English - Italian English - Polish English - Russian English - Swedish English - Portuguese (Brazil) English - Chinese (PRC) English - Danish English - Dutch English - Slovene English - Slovak English - Norwegian English - Croatian English - Finnish English - Bulgarian English - Latvian English - Hungarian English - Estonian English - Greek English - Czech English - Lithuanian
eBay
E-commerce

E-commerce


Reliable product descriptions and information are a crucial asset in any e-commerce environment. In these corpora you'll find carefully filtered and cleaned data on a great variety of product types, that will make it even easier for your global customers to click on the 'Add to shopping cart' button!
French - Dutch German - Polish German - Italian English - Japanese German (Germany) - Spanish (Spain) German (Germany) - French (France) English - German English - Spanish English - French English - Italian English - Danish English - Hungarian English - Swedish English - Finnish English - Polish English - Dutch English - Portuguese (Brazil) English - Russian English - Portuguese Chinese (PRC) - Spanish Chinese (PRC) - French Chinese (PRC) - Russian
SYSTRAN
Customer support, IT

FAQ and customer support


This corpus targets customer support that is defined by short and clear questions and to-the-point answers. Suitable for the FAQ section for your software or IT related documentation.
English (United States) - Swedish English (United States) - Korean English (United States) - Japanese
SYSTRAN
Energy, public utilities

Energy and public utilities


Corpus based on a collection of documents from a big utility corporation, including a large terminology base. The resulting segments are rooted in science, education, human resources, recruitment, marketing and news articles, all centered around electricity and energy.
English - Japanese English - French
SYSTRAN
Business, Legal

Legal, contracts and obligations


Contracts and agreements: this corpus specializes in that aspect of business exclusively. It contains all the correct formulations of what is carefully set in subsections and articles.
English (United States) - German (Germany) English (United States) - French (France) English (United States) - Spanish (Spain) English (United States) - Chinese (PRC) English (United States) - Chinese (Taiwan)
TAUS
Colloquial Text
Human Generated Corpora
Human Generated Corpora

HLP Colloquial Corpus #2


With the Human Language Project, TAUS offers any of the domains in the Matching Data catalogue, or any new domain for that matter, in a language of your choice. A carefully selected part of the colloquial Matching Data corpus has been translated and reviewed by native speakers in many long-tail languages, to get the highest-quality customized set for your MT training.
English - Hindi English - Urdu
Univerzita Karlova v Praze
Legal / Financial / VAT

Legal/Financial with a focus on Value Added Tax


These carefully selected parallel corpora are meant to make legal translations a breeze and give customer support systems handling tax-related or legal queries a nudge. The query corpus originated from UFAL, an institute that is part of the Charles University in Prague. The provided corpora was bilingual with added monolingual terms in target languages, specifically focused on value added tax.
English - Czech English - German English - Polish English - Romanian English - Spanish English - Hungarian German - Czech English - Dutch English - French English - Chinese (PRC)
SYSTRAN
Medical, Pharmaceutical, Health, Biochemistry

Pharmaceutics and Biochemistry, Top Selection


This corpus is a top selection outtake of pharmaceutical descriptions and biochemistry reporting.
English - Japanese
SYSTRAN
Marketing

Business to Consumer Retail Marketing


This corpus is about retail marketing, focusing on business-to-consumer communication, product information and lifestyle.
English - German
TAUS
Financial, Business

Financial & business reporting


This corpus is taken from a big range of annual reports, financial statements, accounting reports, and business strategy reports. It also contains content from new stories, human interest and journalism.
Chinese (Taiwan, traditional) - English
TAUS
Business, Legal

Business and legal communication


This corpus is about business, contracts and agreements. It contains translations for any content that in professional business communications.
English (United States) - French (France)
RWS Moravia
Customer Support/Help

Customer Support


Need help in fine-tuning your customer support data into Dutch? Be it for your webshop, product documentation or website, information related to customer support is usually pretty standardized and therefore best handled with automation.
English - Dutch English (United States) - Russian
TAUS
IT services

IT Instructions


IT-related communication comes in many shapes and forms. One of the main forms is consumer-oriented instruction - the type of text that appears in FAQs, short setup instructions, troubleshooting manuals or responses and requests for assistance.
English - Spanish English - Portuguese (PT) English - Bulgarian English - Dutch English - Czech English - German English (United States) - Russian
Unbabel
Customer Support

Customer Support Tickets & Chat


This corpus combines just the right conversational, instructional, marketing and e-commerce elements from a wide variety of customer service tickets and chats to power the seamless customer interactions in Hungarian. Helping the user to make a payment, change a password, sign in to their account, but also simply asking them how they like a service or a product and answering politely to a these are all the situations where this data set can add value.
English - Hungarian English (United States) - Russian
RWS Moravia
Business, Pharmaceuticals

Professional services for pharmaceutical industry


Business, human resource management and administration for the pharmaceutical industry is the main focus of this corpus. It contains translations for any content that is related to doing business, with a connection to life sciences and pharmacy.
English (United States) - German (Germany)
RWS Moravia
Automotive

Automotive


The corpus is created in collaboration with RWS Moravia based on a automotive query content. It contains translations for a wider context of the automotive industry, and includes related topics as well.
English (United States) - German (Germany)
Dell
Customer Support

Customer Support (short sentences)


This corpus is based on many monolingual support chat sessions, and only the short sentences: not more than 10 words each. This makes it a targeted corpus that includes simple lines with a lot of technical terminology and a conversational tone.
English - German English - Portuguese (Brazil) English - French
TAUS
Crawled / News / Low-Resource Languages

U.S. State Department Crawled Corpus


International news and global public affairs shift focus frequently, have constantly evolving language and terminology. U.S. Department of State press releases closely mirror these developments and sometimes originate them. This corpus allows to incorporate these shifts in language and topic into news, diplomatic and other current affairs translations.
English - Hindi English - Urdu
SYSTRAN
Marketing

Retail Marketing & Training


The corpus is created in collaboration with Systran based on a marketing and e-learning query content. It contains carefully crafted translations for outward-facing marketing and fun learning.
English - Spanish

Couldn't find what you were looking for?

Do you have a query corpus to submit?
Request Matching Data
Contact us to get more information
Contact us