Multilingual Lexicon Extraction from Comparable Co.. (MULTILEX)
Multilingual Lexicon Extraction from Comparable Corpora
Start date: 01 Sep 2014,
End date: 31 Aug 2018
"Given large collections of parallel (i.e. translated) texts, it is well-known how to, by successively applying a sentence- and aword-alignment step, establish correspondences between words across languages. However, parallel texts are a scarceresource for most language pairs involving lesser-used languages. On the other hand, human second language acquisitionseems not to require the reception of large amounts of translated texts, which indicates that there must be another way ofcrossing the language barrier. Apparently, the human capabilities are based on looking at comparable resources, i.e. textsor speech on related topics in different languages, which, however, are not translations of each other. Comparable (writtenor spoken) corpora are far more common than parallel corpora, thus offering the chance to overcome the data acquisitionbottleneck. Despite its cognitive motivation, in the proposed project we will not attempt to simulate the complexities ofhuman second language acquisition, but will show that it is possible by purely technical means to automatically extractinformation on word- and multiword-translations from comparable corpora. The aim is to push the boundaries of currentapproaches, which typically utilize correlations between co-occurrence patterns across languages, in several ways: 1)Eliminating the need for initial lexicons by using a bootstrapping approach which only requires a few seed translations. 2)Implementing a new methodology which first establishes alignments between comparable documents across languages,and then computes cross-lingual alignments between words and multiword-units. 3) Improving the quality of computed wordtranslations by applying an interlingua approach, which, by relying on several pivot languages, allows a highly effectivemulti-dimensional cross-check. 4) We will show that, by looking at foreign citations, language translations can even bederived from a single monolingual text corpus."
Get Access to the 1st Network for the European Cooperation