Title
Peru is Multilingual, Its Machine Translation Should Be Too?
Date Issued
01 January 2021
Access level
metadata only access
Resource Type
conference paper
Author(s)
Universidad de Edinburgh
Publisher(s)
Association for Computational Linguistics (ACL)
Abstract
Peru is a multilingual country with a long history of contact between the indigenous languages and Spanish. Taking advantage of this context for machine translation is possible with multilingual approaches for learning both unsupervised subword segmentation and neural machine translation models. The study proposes the first multilingual translation models for four languages spoken in Peru: Aymara, Ashaninka, Quechua and Shipibo-Konibo, providing both many-to-Spanish and Spanish-to-many models and outperforming pairwise baselines in most of them. The task exploited a large English-Spanish dataset for pretraining, monolingual texts with tagged back-translation, and parallel corpora aligned with English. Finally, by fine-tuning the best models, we also assessed the out-of-domain capabilities in two evaluation datasets for Quechua and a new one for Shipibo-Konibo1.
Start page
194
End page
201
Language
English
OCDE Knowledge area
Informática y Ciencias de la Información Lingüística Ciencias de la computación Etnología
Publication version
Version of Record
Scopus EID
2-s2.0-85123956688
Resource of which it is part
Proceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
ISBN of the container
978-195408544-2
Sponsor(s)
The author is supported by funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements No 825299 (GoURMET) and the EP-SRC fellowship grant EP/S001271/1 (MTStretch). The author is also thankful to the insightful feedback of the anonymous reviewers.
Sources of information: Directorio de Producción Científica Scopus