Abstract
Cross-lingual word embeddings are representations for vocabularies of two or more languages in one common continuous vector space and are widely used in various natural language processing tasks. A state-ofthe-art way to generate cross-lingual word embeddings is to learn a linear mapping, with an assumption that the vector representations of similar words in different languages are related by a linear relationship. However, this assumption does not always hold true, especially for substantially different languages. We therefore propose to use kernel canonical correlation analysis to capture a non-linear relationship between word embeddings of two languages. By extensively evaluating the learned word embeddings on three tasks (word similarity, cross-lingual dictionary induction, and cross-lingual document classification) across five language pairs, we demonstrate that our proposed approach achieves essentially better performances than previous linear methods on all of the three tasks, especially for language pairs with substantial typological difference.
| Original language | English |
|---|---|
| Article number | 29 |
| Journal | ACM Transactions on Asian and Low-Resource Language Information Processing |
| Volume | 17 |
| Issue number | 4 |
| DOIs | |
| State | Published - Jul 2018 |
Keywords
- Cross-lingual word representation
- Kernel canonical correlation analysis (KCCA)
- Word embedding evaluation
Fingerprint
Dive into the research topics of 'Improving vector space word representations via Kernel canonical correlation analysis'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver