skrywer

avatar

Profiel van UMC005

Geskep by 2014.08.16
Gemaak deur Administratorus
lisensie: Proprietary

UMC005 English-Urdu is a parallel corpus of texts in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation. The texts come from four different sources: Quran Bible Penn Treebank (Wall Street Journal) Emille corpus We provide the religious texts of Quran and Bible for direct download. Because of licensing reasons, Penn and Emille texts cannot be redistributed freely. However, if you already hold a license for the original corpora, we are able to provide scripts that will recreate our data on your disk. Our modifications include but are not limited to the following: Correction of Urdu translations and manual sentence alignment of the Emille texts. Manually corrected sentence alignment of the other corpora. Our data split (training-development-test) so that our published experiments can be reproduced. Tokenization (optional, but needed to reproduce our experiments). Normalization (optional) of e.g. European vs. Urdu numerals, European vs. Urdu punctuation, removal of Urdu diacritics. UMC005 is available for research, educational and non-profit use free of charge. Contact us if you are interested in obtaining a different type of license. Note: These terms apply to our modifications/additions to the data. You need to obtain separate license for the English texts of the Penn Treebank and for the Emille corpus. The Urdu translation of the Penn Treebank texts has been provided by CRULP and is distributed under the GNU GPL license. Quran and Bible are religious texts whose copyright had expired long ago; our version has been collected from the web. Bushra Jawaid, Daniel Zeman: Word-Order Issues in English-to-Urdu Statistical Machine Translation. Submitted for publication in: The Prague Bulletin of Mathematical Linguistics, No. 95, Copyright © Univerzita Karlova, Praha, Czechia, ISSN 0032-6585, May 2011