Machine learning for language technology requires an amount of training data that simply does not exist for many languages. Can we exploit existing models for high-resource languages to provide tools and resources for low-resource ones? This topic will be addressed in "Finding Words that Aren't There: Using Word Embeddings to Improve Dictionary Search for Low-resource Languages", an invited talk by Antti Arppe of the University of Alberta. The talk is part of OFAI's 2022 Lecture Series.
Members of the public are cordially invited to attend the talk via Zoom on Wednesday, 14 September 2022 at 18:30 CEST:
Meeting ID: 842 8244 2460
Talk abstract: Modern machine learning techniques have produced many impressive results in language technology, but these techniques generally require an amount of training data that is many orders of magnitude greater than what exists for low-resource languages in general, and endangered ones in particular. However, dictionary definitions in a comparatively much more well-resourced majority language can provide a link between low-resource languages and machine learning models trained on massive amounts of majority-language data. By leveraging a pre-trained English word embedding to compute sentence embeddings for definitions in a Plains Cree (nêhiyawêwin) dictionary, we have obtained promising results for dictionary search. Not only are the search results in the majority language of the definitions more relevant, but they can be semantically relevant in ways not achievable with classic information retrieval techniques: users can perform successful searches for words that do not occur at all in the dictionary. These techniques are directly applicable to any bilingual dictionary providing translations between a high- and low-resource language.
Speaker biography: Dr. Antti Arppe received his Ph.D in General Linguistics from the University of Helsinki in 2009. Prior to his graduate studies in linguistics, he had completed in 1995 his M.Sc. in Industrial Management at the former Helsinki University of Technology (now part of Aalto University), after which he worked in the late 1990s on supervising the development of proofing tools for the majority Nordic languages at Lingsoft, a small language technology company based on Helsinki, Finland. He is currently an Associate Professor of Quantitative Linguistics in the University of Alberta, having been a faculty member since 2012, and the Founding Director of Alberta Language Technology Laboratory (ALTLab) since 2013. His research interests include lexical semantics, corpus linguistics, statistical and computational methods, as well as exploiting multiple methods and sources of evidence. More recently he has started work in language documentation and developing language technological tools and applications for Indigenous languages to support their revitalization, in particular for Plains Cree (nêhiyawêwin) but also other languages in the Algonquian, Dene, and other Indigenous language families spoken in North America (e.g. https://itwewina.altlab.app and https://speech-db.altlab.app). In this vein, he is the Project Director for the research and development Partnership "21st Century Tools for Indigenous Languages" funded during 2019-2026 by the Social Sciences and Humanities Research Council (SSHRC) of Canada, involving some 30 researchers and language community members in over 10 academic and non-academic partner organizations (universities, First Nations, and non-governmental organizations) in Canada, the US, and Norway. Furthermore, he is Founding and current President of SIGEL, the Special Interest Group for Endangered Languages under the Association of Computational Linguistics (ACL).