Title
Assessing back-translation as a corpus generation strategy for non-English tasks: A study in reading comprehension and word sense disambiguation
Date Issued
01 January 2019
Access level
metadata only access
Resource Type
conference paper
Author(s)
Universidade de São Paulo
Universidade de São Paulo
Publisher(s)
Association for Computational Linguistics (ACL)
Abstract
Corpora curated by experts have sustained Natural Language Processing mainly in English, but the expensiveness of corpora creation is a barrier for the development in further languages. Thus, we propose a corpus generation strategy that only requires a machine translation system between English and the target language in both directions, where we filter the best translations by computing automatic translation metrics and the task performance score. By studying Reading Comprehension in Spanish and Word Sense Disambiguation in Portuguese, we identified that a more quality-oriented metric has high potential in the corpora selection without degrading the task performance. We conclude that it is possible to systematise the building of quality corpora using machine translation and automatic metrics, besides some prior effort to clean and process the data.
Start page
81
End page
89
Language
English
OCDE Knowledge area
Ciencias de la computación Lingüística Ingeniería de sistemas y comunicaciones
Publication version
Version of Record
Scopus EID
2-s2.0-85084294944
Resource of which it is part
LAW 2019 - 13th Linguistic Annotation Workshop, Proceedings of the Workshop
ISBN of the container
978-195073738-3
Conference
13th Linguistic Annotation Workshop, LAW 2019, held in conjunction with the Annual Meeting of the Association for Computational Linguistics, ACL 2019
Sponsor(s)
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 825299. Besides, we acknowledge the support of the NVIDIA Corporation with the donation of the Titan Xp GPU used for this study. Finally, the first author is granted by the “Programa de apoyo al desarrollo de tesis de licenciatura” (Support programme of undergraduate thesis development, PADET 2018, PUCP).
Sources of information: Directorio de Producción Científica Scopus