Title
Using lexical language models to detect borrowings in monolingual wordlists
Date Issued
01 December 2020
Access level
open access
Resource Type
journal article
Author(s)
Miller J.E.
Tresoldi T.
ZARIQUIEY BIONDI, ROBERTO DANIEL
Morozova N.
BELTRAN CASTA脩ON, CESAR ARMANDO
List J.M.
Publisher(s)
Public Library of Science
Abstract
Lexical borrowing, the transfer of words from one language to another, is one of the most frequent processes in language evolution. In order to detect borrowings, linguists make use of various strategies, combining evidence from various sources. Despite the increasing popularity of computational approaches in comparative linguistics, automated approaches to lexical borrowing detection are still in their infancy, disregarding many aspects of the evidence that is routinely considered by human experts. One example for this kind of evidence are phonological and phonotactic clues that are especially useful for the detection of recent borrowings that have not yet been adapted to the structure of their recipient languages. In this study, we test how these clues can be exploited in automated frameworks for borrowing detection. By modeling phonology and phonotactics with the support of Support Vector Machines, Markov models, and recurrent neural networks, we propose a framework for the supervised detection of borrowings in mono-lingual wordlists. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages from different families, featuring a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in mono-lingual borrowing detection. While the general results appear largely unsatisfying at a first glance, further tests show that the performance of our models improves with increasing amounts of attested borrowings and in those cases where most borrowings were introduced by one donor language alone. Our results show that phonological and phonotactic clues derived from monolingual language data alone are often not sufficient to detect borrowings when using them in isolation. Based on our detailed findings, however, we express hope that they could prove to be useful in integrated approaches that take multi-lingual information into account.
Volume
15
Issue
12 December
Language
English
OCDE Knowledge area
Ling眉铆stica Lenguas, Literatura Ciencias de la computaci贸n
Publication version
Version of Record
Scopus EID
2-s2.0-85097836723
PubMed ID
Source
PLoS ONE
ISSN of the container
1932-6203
Sponsor(s)
JEM, has received funding and encouragement from the Graduate School of the Pontificia Universidad Cat贸lica del Per煤 (PUCP) through the Huiracocha-2019 scholarship program (https://posgrado.pucp.edu.pe). TT, JML, have received funding from the European Research Council (ERC) under the European Union鈥檚 Horizon 2020 research and innovation programme (grant agreement No. ERC Grant #715618, "Computer-Assisted Language Comparison"). (https://erc. europa.eu). RZ, has received funding from Pontificia Universidad Cat贸lica del Per煤 (PUCP) through the project (604 DGI-PUCP) 驴Gram谩ticas que mueren?: Aproximaci贸n cr铆tica a la obsolescencia de las lenguas desde la documentaci贸n y la tipolog铆a ling眉铆sticas, las ciencias de la informaci贸n y la inteligencia artificial. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank Mei-Shin Wu for work on the White Hmong and Mandarin profiles used to convert WOLD word forms to IPA sound segments. We also thank the Chana team (promoting technologies for indigenous languages of Peru) of the Pontificia Universidad Cat贸lica del Per煤 (PUCP) for their help and encouragement.
Sources of information: Directorio de Producci贸n Cient铆fica Scopus