Title
Using lexical language models to detect borrowings in monolingual wordlists
Date Issued
01 December 2020
Access level
open access
Resource Type
journal article
Publisher(s)
Public Library of Science
Abstract
Lexical borrowing, the transfer of words from one language to another, is one of the most frequent processes in language evolution. In order to detect borrowings, linguists make use of various strategies, combining evidence from various sources. Despite the increasing popularity of computational approaches in comparative linguistics, automated approaches to lexical borrowing detection are still in their infancy, disregarding many aspects of the evidence that is routinely considered by human experts. One example for this kind of evidence are phonological and phonotactic clues that are especially useful for the detection of recent borrowings that have not yet been adapted to the structure of their recipient languages. In this study, we test how these clues can be exploited in automated frameworks for borrowing detection. By modeling phonology and phonotactics with the support of Support Vector Machines, Markov models, and recurrent neural networks, we propose a framework for the supervised detection of borrowings in mono-lingual wordlists. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages from different families, featuring a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in mono-lingual borrowing detection. While the general results appear largely unsatisfying at a first glance, further tests show that the performance of our models improves with increasing amounts of attested borrowings and in those cases where most borrowings were introduced by one donor language alone. Our results show that phonological and phonotactic clues derived from monolingual language data alone are often not sufficient to detect borrowings when using them in isolation. Based on our detailed findings, however, we express hope that they could prove to be useful in integrated approaches that take multi-lingual information into account.
Volume
15
Issue
12 December
Language
English
OCDE Knowledge area
Lingüística Lenguas, Literatura Ciencias de la computación
Publication version
Version of Record
Scopus EID
2-s2.0-85097836723
PubMed ID
Source
PLoS ONE
ISSN of the container
1932-6203
Sponsor(s)
JEM, has received funding and encouragement from the Graduate School of the Pontificia Universidad Católica del Perú (PUCP) through the Huiracocha-2019 scholarship program (https://posgrado.pucp.edu.pe). TT, JML, have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. ERC Grant #715618, "Computer-Assisted Language Comparison"). (https://erc. europa.eu). RZ, has received funding from Pontificia Universidad Católica del Perú (PUCP) through the project (604 DGI-PUCP) ¿Gramáticas que mueren?: Aproximación crítica a la obsolescencia de las lenguas desde la documentación y la tipología lingüísticas, las ciencias de la información y la inteligencia artificial. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank Mei-Shin Wu for work on the White Hmong and Mandarin profiles used to convert WOLD word forms to IPA sound segments. We also thank the Chana team (promoting technologies for indigenous languages of Peru) of the Pontificia Universidad Católica del Perú (PUCP) for their help and encouragement.
Sources of information: Directorio de Producción Científica Scopus