skrywer

avatar

Profiel van Giga-fren

Geskep by 2013.05.27
Gemaak deur Guest
lisensie: Proprietary with permission for Glosbe

This is a great source of data, described by the author, Chris Callison-Burch, here: https://groups.google.com/d/msg/wmt09/GEo6ZvCWhCA/-Z0O1ArhGvcJ . Please let me paste here the description: ---- Chris Callison-Burch 12/19/08 I will be writing a more detailed description of how I created the corpus in our overview paper, but here's a short description: (1) I crawled a variety of Canadian, European and international web sites, gathering somewhere on the order of 40 million files consisting of more than 1TB of data. (2) I converted pdf, doc, html, asp, php, etc. files into text, preserving the directory structure of the web crawl. (3) I wrote a set of simple heuristics to transform French URLs onto English URLs (i.e. replacing "fr" with "en" and about 40 other hand- written rules), and assume that these documents are translations of each other. (4) This yielded 2.9 million French files paired with their English equivalents. (5) I split each of these files into their sentences, and put <P> markers between paragraphs. (6) I used Bob Moore's sentence aligner to align the files in batches of 10,000 files. (7) I de-dupcliated the corpus, removing all sentence pairs that occur more than once in the parallel corpus. A lot of the documents are duplicates or near duplicates, and a lot of the text is repeated (for instance web site navigation). I used a Bloom Filter to do de- duplication, so I might have thrown out more than I need to. (8) I further cleaned the corpus by eliminating sentence pairs that are mainly numbers, or which varied from previous sentences by only numbers. (9) I deleted sentence pairs where the "French" and "English" sentences are identical. Sometimes one or the other of the documents wasn't actually translated. This is an easy way of handling many of the untranslated documents without having to do language identification. (10) Finally, I concatenated all of the cleaned sentence alignments together. --Chris