Title
Language identification with scarce data: A case study from Peru
Date Issued
2018
Access level
restricted access
Resource Type
conference paper
Author(s)
Espichán-Linares A.
Publisher(s)
Springer Verlag
Abstract
Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future. © Springer International Publishing AG, part of Springer Nature 2018.
Start page
90
End page
105
Volume
795
Number
1
Language
English
Scopus EID
2-s2.0-85045991573
Source
Communications in Computer and Information Science
ISSN of the container
1865-0929
ISBN of the container
9783319905952
Conference
4th Annual International Symposium on Information Management and Big Data, SIMBig 2017
Sponsor(s)
Furthermore, it is acknowledged the support of the “Concejo Nacional de Ciencia, Tecnología e Innovación Tecnológica” (CONCYTEC Perú) under the contract 225-2015-FONDECYT.
Sources of information: Directorio de Producción Científica