Title
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
Date Issued
01 January 2020
Access level
metadata only access
Resource Type
conference paper
Publisher(s)
European Language Resources Association (ELRA)
Abstract
We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.
Start page
2914
End page
2923
Language
English
OCDE Knowledge area
Estudios generales de idiomas
Lingüística
Subjects
Scopus EID
2-s2.0-85096526337
Resource of which it is part
LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
ISBN of the container
979-109554634-4
Conference
12th International Conference on Language Resources and Evaluation, LREC 2020
Sponsor(s)
We are grateful to the computational linguistic team at PUCP: John Miller, Erasmo Gómez, Kervy Rivas, Gema Silva, Gildo Valero, Jaime Montoya and Gonzalo Acosta. Similarly, we thank the bilingual teachers from the UCSS who provided their own crafted material for the evaluation, and more specifically to Juan Rubén Ruiz for his support. Besides, we appreciate the comments of Fernando Alva-Manchego on a draft version and the feedback of our anonymous reviewers. Finally, we acknowledge the research grant of the “Con-sejo Nacional de Ciencia, Tecnología e Innovación Tec-nológica” (CONCYTEC, Peru) under the contract 183-2018-FONDECYT-BM-IADT-MU, and the support of NVIDIA Corporation with the donation of a Titan Xp GPU used for the study.
Sources of information:
Directorio de Producción Científica
Scopus