Using lexical language models to detect borrowings in monolingual wordlists

Miller J.E.; Tresoldi T.; ZARIQUIEY BIONDI, ROBERTO DANIEL; Morozova N.; BELTRAN CASTAÑON, CESAR ARMANDO; List J.M.

Title

Date Issued

01 December 2020

Access level

open access

Resource Type

journal article

Author(s)

Miller J.E.

Tresoldi T.

ZARIQUIEY BIONDI, ROBERTO DANIEL

Morozova N.

BELTRAN CASTAÑON, CESAR ARMANDO

List J.M.

Pontificia Universidad Católica del Perú

Publisher(s)

Public Library of Science

Abstract

Lexical borrowing, the transfer of words from one language to another, is one of the most frequent processes in language evolution. In order to detect borrowings, linguists make use of various strategies, combining evidence from various sources. Despite the increasing popularity of computational approaches in comparative linguistics, automated approaches to lexical borrowing detection are still in their infancy, disregarding many aspects of the evidence that is routinely considered by human experts. One example for this kind of evidence are phonological and phonotactic clues that are especially useful for the detection of recent borrowings that have not yet been adapted to the structure of their recipient languages. In this study, we test how these clues can be exploited in automated frameworks for borrowing detection. By modeling phonology and phonotactics with the support of Support Vector Machines, Markov models, and recurrent neural networks, we propose a framework for the supervised detection of borrowings in mono-lingual wordlists. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages from different families, featuring a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in mono-lingual borrowing detection. While the general results appear largely unsatisfying at a first glance, further tests show that the performance of our models improves with increasing amounts of attested borrowings and in those cases where most borrowings were introduced by one donor language alone. Our results show that phonological and phonotactic clues derived from monolingual language data alone are often not sufficient to detect borrowings when using them in isolation. Based on our detailed findings, however, we express hope that they could prove to be useful in integrated approaches that take multi-lingual information into account.

Volume

15

Issue

12 December

Language

English

OCDE Knowledge area

Lenguas, Literatura Ciencias de la computación Lingüística

Publication version

Version of Record

DOI

10.1371/journal.pone.0242709

Scopus EID

2-s2.0-85097836723

PubMed ID

33296372

Source

PLoS ONE

ISSN of the container

1932-6203

Sponsor(s)

JEM, has received funding and encouragement from the Graduate School of the Pontificia Universidad Católica del Perú (PUCP) through the Huiracocha-2019 scholarship program (https://posgrado.pucp.edu.pe). TT, JML, have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. ERC Grant #715618, "Computer-Assisted Language Comparison"). (https://erc. europa.eu). RZ, has received funding from Pontificia Universidad Católica del Perú (PUCP) through the project (604 DGI-PUCP) ¿Gramáticas que mueren?: Aproximación crítica a la obsolescencia de las lenguas desde la documentación y la tipología lingüísticas, las ciencias de la información y la inteligencia artificial. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank Mei-Shin Wu for work on the White Hmong and Mandarin profiles used to convert WOLD word forms to IPA sound segments. We also thank the Chana team (promoting technologies for indigenous languages of Peru) of the Pontificia Universidad Católica del Perú (PUCP) for their help and encouragement.

Sources of information: Directorio de Producción Científica Scopus

Options