Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification

Inurrieta, Uxoa; Aduriz, Itziar; Diaz de Ilarraza, Arantza; Labaka, Gorka; Sarasola, Kepa

Please use this identifier to cite or link to this item: http://diposit.ub.edu/dspace/handle/2445/174917

Title:	Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification
Author:	Inurrieta, Uxoa Aduriz, Itziar Diaz de Ilarraza, Arantza Labaka, Gorka Sarasola, Kepa
Keywords:	Morfologia (Gramàtica) Semàntica Aprenentatge automàtic Morphology (Grammar) Semantics Machine learning
Issue Date:	27-Aug-2020
Publisher:	Public Library of Science (PLoS)
Abstract:	Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a linguistically motivated method to gather detailed information about verb+noun MWEs (VNMWEs) from corpora. Although the main focus of this study is Spanish, the method is easily adaptable to other languages. Monolingual and parallel corpora are used as input, and data about the morphosyntactic variability of VNMWEs is extracted. This information is then tested in an identification task, obtaining an F score of 0.52, which is considerably higher than related work.
Note:	Reproducció del document publicat a: https://doi.org/10.1371/journal.pone.0237767
It is part of:	PLoS One, 2020, vol. 15, num. 8, p. e0237767
URI:	https://hdl.handle.net/2445/174917
Related resource:	https://doi.org/10.1371/journal.pone.0237767
ISSN:	1932-6203
Appears in Collections:	Articles publicats en revistes (Filologia Catalana i Lingüística General)

Files in This Item:

File	Description	Size	Format
703445.pdf		1.07 MB	Adobe PDF	View/Open

Show full item record

This item is licensed under a Creative Commons License