Abstract:
Wikipedia is the largest encyclopedia ever assembled with the vision of enabling every human being to freely share in the sum of all knowledge. Wikipedia currently has a total of more than six million articles and over 17 billion words in its English edition. Unfortunately, millions of people cannot access this resource because it’s not available in their language. For instance, at the moment there are only 218 Tigrinya Wikipedia and 15,018 Amharic Wikipedia articles.
In this project, we investigate the (...)
Wikipedia is the largest encyclopedia ever assembled with the vision of enabling every human being to freely share in the sum of all knowledge. Wikipedia currently has a total of more than six million articles and over 17 billion words in its English edition. Unfortunately, millions of people cannot access this resource because it’s not available in their language. For instance, at the moment there are only 218 Tigrinya Wikipedia and 15,018 Amharic Wikipedia articles.
In this project, we investigate the problem of translating Wikipedia articles from a high resource language into low resource languages using human-in-the-loop MT systems. In particular, we investigate different approaches to translate a sample of English Wikipedia articles into Tigrinya and Amharic. Currently, this repository contains 100k English Wikipeida abstracts translated using Lesan (https://lesan.ai) into Amharic and Tigrinya.
Structure of data directory:
data
├── human
└── mt
├── google
├── lesan
│ ├── am.txt
│ ├── en.txt
│ └── ti.txt
└── microsoft
(Read More)