DISCOver: DIStributional approach based on syntactic dependencies for discovering COnstructions

: One of the goals in Cognitive Linguistics is the automatic identification and analysis of constructions, since they are fundamental linguistic units for understanding language. This article presents DISCOver, an unsupervised methodology for the automatic discovery of lexico-syntactic patterns that can be considered as candidates for constructions. This methodology follows a distributional semantic approach. Concretely, it is based on our proposed pattern-construction hypothesis: those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions. Our proposal uses Distributional Semantic Models (DSM) for modeling the context taking into account syntactic dependencies. After a clustering process, we linked all those clusters with strong relationships and we use them as a source of information for deriving lexico-syntactic patterns, obtaining a total number of 220,732 candidates from a 100 million token corpus of Spanish. We evaluated the patterns obtained intrinsically, applying statistical association measures and they were also evaluated qualitativaly by experts. Our results were superior to the baseline in both quality and quantity in all cases. While our experiments have been carried out using a Spanish corpus, this methodology is language independent and only requires a large corpus annotated with the parts of speech and dependencies to be applied.


Introduction
In cognitive models of language [Croft and Cruse, 2004], a construction is a conventional symbolic unit that involves a pairing of form and meaning that occurs with a certain frequency. Constructions can be of different types depending on their complexity -morphemes, words, compound words, collocates, idioms and more schematic patterns [Goldberg, 1995[Goldberg, , 2006. Cognitive Linguistics assumes the hypothesis that these constructions are learned from usage and stored in the human memory [Tomasello, 2000], where they are accessed during both the production and comprehension of language. Therefore, constructions are fundamental linguistic units for inferring the structure of language and their identification is crucial for understanding language.
Although a broad range of these linguistic structures have been subjected to linguistic analysis [Nunberg et al., 1994, Wray and Perkins, 2000, Fillmore et al., 2012, we assume that there exist a huge number of constructions that are as yet undiscovered. There are very different approaches to the task of identifying and discovering them, depending on the type of construction we are looking for or dealing with. This fact allows for the use of a wide range of methods and approaches aiming at the treatment of this kind of linguistic units. We distinguish between two different approaches, those that have been guided by previously gathered empirical data 1 , and those approaches that apply methods oriented to discovering new constructions from scratch (see Section 2).
Following the latter approach, this article presents DISCOver, an unsupervised methodology for the automatic identification and extraction of lexicosyntactic patterns that are candidates for consideration as constructions (see Section 3). It is based on the Harris distributional hypothesis [Harris, 1954] 2 , which states that semantically related words (or other linguistic units) will share the same context. 3 We propose the pattern-construction hypothesis, which states that those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions. What is new in our hypothesis is that we consider all the contexts that are relevant to define a cluster of semantically related words to be part of a construction. In these approaches, Distributional Space Models (DSMs) are used to represent the semantics of words on the basis of the contexts they share. This is in line with the idea proposed by Landauer et al. [2007], who states that DSMs are plausible models of some aspects of human cognition [Baroni and Lenci, 2010].
In our methodology, the DSM consists of a frequency lemma-context matrix, in which the context is modeled taking into account syntactic dependency relations. Then, we build up clusters of semantically related words that share the same context and link them using the information present in their contexts.
We automatically calculate a threshold in order to determine which clusters are more strongly related. We filter out those related clusters that do not reach the determined threshold and derive lexico-syntactic patterns that are candidates to be considered as constructions. These candidates are tuples involving two lexical items (lemmas) related both by a dependency direction and a dependency label (examples in (1) The tuples correspond to different kinds of linguistic constructions, ranging from collocates (1a) to (parts of) verbal argument structures (1b). All the lexicosyntactic patterns obtained are instances of one of the syntactic dependencies present in the source corpus. We applied this methodology to the Diana-Araknion corpus, obtaining 220,732 patterns that are good candidates to be constructions 5 .
Finally, we evaluated the quality of these patterns in two ways: applying statistical association measures and by manual revision by human experts. The results show significant improvement with respect to several baselines (see Section 4).
Although this method has been applied to the obtention of Spanish constructions, it is language independent and only requires a large corpus annotated with part-of-speech (POS) and syntactic dependencies.
The article is structured as follows. After presenting the related work in Section 2, the methodology applied for obtaining the constructions is described in Section 3. The evaluation of our methodology is presented in Section 4 and, finally, the conclusions and future work are drawn in Section 5.

Related Work
The boundaries of what a construction is are fuzzy: constructions can be lexical, syntactic, lexico-syntactic, morphological and can combine different levels of abstraction from concrete forms to abstract categories, including the possibility 4 The symbols '<' and '>' indicate the dependency direction and mod, subj and dobj are dependency labels (where mod stands for modifier, and subj and dobj stand for subject and direct object respectively). 5 All patterns obtained will be made available online. of using variables, so they cover a wide range of linguistic constructs. For more examples, see Goldberg [2013].
As a consequence, there is no one accepted typology of this kind of linguistic units [Wray and Perkins, 2000]. There is, therefore, a broad field of research in which to explore the characteristics, the limits and the properties of constructions. In this context, an important task is to acquire the maximum amount of empirically grounded data concerning this kind of units. Thus, when approaching the task of attempting to identify the possible constructions that constitute the core of languages, it is difficult to decide what to look at or where to start [Sag et al., 2002]. For this reason, constructions are a challenge for Linguistics and Natural Language Processing (NLP), where we find statistical and symbolic approaches to deal with them.
Several linguistic traditions converge when we are trying to define the diverse form that a construction can take. From one side, there is an (almost total) overlapping between constructions and argument structure [Goldberg, 1995] and diatheses alternations [Levin, 1993]; from another side, in the lexicographic tradition, constructions also overlap with idioms and collocates. In the field of Computational Linguistics, these linguistic units tend to be grouped under the umbrella term MultiWord Expressions (MWE). Baldwin and Kim [2010] define MWE as those lexical items that are decomposable into multiple lexemes and present idiomatic behaviour at some level of linguistic analysis, as a consequence they should be considered as a unit at some level of computational processing. Also in the Computational Linguistics field, Stefanowitsch and Gries [2003] propose the term "collostruction" to refer to the wide range of complex linguistic units as defined in theoretical proposals of Cognitive Grammar. In our approach we consider as constructions those syntactic units consisting of two or more lexical items with internal semantic coherence. These constructions are compositional and appear with a frequency higher than expected.
From the NLP perspective, most approaches for dealing with constructions tend to apply methods that use previously defined empirical knowledge to find instances and variants of specific types of constructions in corpora. This approach allows us to obtain preidentified units and their variations at different degrees of complexity, but does not allow for the identification of as yet unidentified constructions. In order to discover new knowledge, we need an open and flexible method that give us usable and interpretable results. We organised this overview taking into consideration those approaches that try to find or discover constructions.
A frequent approach to gathering empirical data about constructions using NLP techniques is to look for well-known, highly conventionalized and previously defined constructions (see the works of Hwang et al. [2010], Muischnek and Sajkan [2009], Kesselmeier et al. [2009], O'Donnell and Ellis [2010, Duffield et al. [2010]).
Very tied to Construction Grammar theory and in the framework of the methodologies based on statistical metrics, it is worth noting the works of Stefanowitsch and Gries [2003], Stefanowitsch and Gries [2008], and Gries et al. [2005]. Their research always focuses on specific types of constructions, on the analysis of their variants and on the degree of entrenchment between their elements. Gries and Ellis [2015] summarize different statistical measures applied to the analysis of constructions and evaluate their linguistic interpretation and impact.
From the perspective of methods oriented to the discovery of new constructions, we should distinguish between those approaches that include some kind of linguistic filtering of the type of constructions to be dealt with and those that do not apply any kind of restriction. All these methods are strongly grounded on statistical measures: in Evert [2008] and Pecina [2010] there is an exhaustive summary and criticism of statistical measures that calculate the degree of association between words. 6 Looking for ways to identify potential collocations in corpora using statistical measures, Bartsch [2004] explores certain types of collocations involving verbs of verbal communication. Her approach is semiautomatic and involves a manual revision of the results. We also highlight the work of Pecina [2010], based on fully statistical methods. However, supervised machine learning requires annotated data, which creates a bottleneck in the absence of large corpora annotated for collocation extraction. A solution to this problem is presented by Dubremetz and Nivre [2014] who propose the use of the MWEtoolkit [Ramisch et al., 2010] to automatically extract candidates that fit a certain POS pattern. See also the work of Forsberg et al. [2014], Farahmand andMartins [2014], Tutubalina [2015].
From a different perspective, based on the calculation of n-grams, we also consider the results of the StringNet project [Wible and Tsao, 2010], a knowledge base (KB) which contains candidates to be constructions. In this case, no filters are applied to the lexico-syntactic patterns obtained. As a result, StringNet is a lexicogrammatical KB automatically extracted from the British National Corpus (BNC) 7 consisting of a massive archive of hybrid n-grams of co-occurring combinations of POS tags, lexemes and specific word forms.
We also want to highlight the approaches that use syntactic information for obtaining constructions, such as the work of Zuidema [2006], Sangati and van Cranenburgh [2015], based on the framework of Tree Substitution Grammar (TSG).
Harris distributional hypothesis has a great acceptance in the treatment of linguistic semantics to overcome traditional symbolic representations. Relying on this hypothesis, Gamallo et al. [2005] developed an unsupervised strategy to acquire syntactico-semantic restrictions for nouns, verbs and adjectives from partially parsed corpora. Although the resulting data could be used for deriving lexico-syntactic patterns their objective was to capture semantic generalizations, both for the predicates and their arguments.
Currently, there is an increasing interest in the use of distributional models for representing semantics, such as DSMs [Turney andPantel, 2010, Baroni, 2013] or word embeddings [Mikolov et al., 2013]. These models derive wordrepresentations in an unsupervised way from very large corpora. All of them rely on co-ocurrence patterns but differ in the way they reduce dimensionality. As pointed out in Murphy et al. [2012], the representations they derive from corpora are lacking in cognitive plausibility, with exceptions such as those defined in Baroni et al. [2010]. Our proposal shares with these authors the same semantic approach (distributional hypothesis), because we consider that these models are a good option in which to frame our methodology. In concrete, we used DSMs because they are highly linguistically interpretable and allow us to modelize the context, a key point in our methodology.
DSMs have been applied successfully in linguistic research [Shutova et al., 2010], in different NLP tasks and applications [Baroni and Lenci, 2010] and, especially, in tasks related with measuring different kinds of semantic similarity between words [Turney and Pantel, 2010]. Like us, Shutova et al. [2017] use distributional clustering techniques, though they use DSMs to investigate how to find metaphorical expressions. Recently, DSMs have been extended to phrases and sentences by means of composition operations deriving meaning representations for phrases and sentences from their parts (see Baroni [2013] and Mitchell and Lapata [2010] for an overview). Nevertheless, DSMs have rarely focused on the discovery of constructions. In this line, it is worth noting the papers presented in the shared task of the Workshop on Distributional Semantics and Compositionality [Biemann and Giesbrecht, 2011]. This workshop focused on the extraction of non-compositional phrases from large corpora by applying distributional models that assign a graded compositional score to a phrase. This score denotes the extent to which compositionality holds for a given expression. The participants applied a variety of approaches that can be classified into lexical association measures and Word Space Models. It is also worth noting that approaches based on Word Space Models performed slightly better than methods relying solely on statistical association measures.
In the next section, we describe in depth the DISCOver methodology that we developed to discover lexico-syntactic constructions.

Methodology for discovering constructions
Following a distributional semantic approach, we developed an unsupervised bottom-up method for obtaining the lexico-syntactic patterns that can be considered candidates for constructions. This method uses a medium-sized corpus (100 million tokens) to obtain the distributional properties of words and to stablish similarity relations among them from their contexts. The representation of the contexts is based on syntactic dependencies. Figure 1 depicts the five main steps involved in obtaining the lexico-syntactic patterns. Briefly, the first step is the linguistic processing of the Diana-Araknion corpus (See Section 3.2). In the next step, a DSM matrix is constructed with the frequencies of the lemmas in each one of the contexts (see Section 3.3).
Step 3 focuses on clustering semantically related lemmas, that is, those lemmas that share a set of contexts (see Section 3.4). In the fourth step, we applied a generalization process by linking all clusters taking into account the information contained in the contexts and then filtering only those links that mantain the strongest relationships (See Section 3.5). Finally, we generate the lexico-syntactic patterns to be considered as candidates to be constructions from the related clusters selected in the previous step (See Section 3.6).

Description of the task
Our methodology is based on the pattern-construction hypothesis, which states that those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions. In our experiments, "lexico-syntactic constructions" are patterns in the form of [lemma, Dependency_label is a type of syntactic relation between lemma and context_lemma, while depen-dency_direction is the direction of the dependency_label. To be considered candidates to be constructions patterns must have the following properties: -Syntactic-semantic coherence: We expect the two lemmas in each pattern candidate to be syntactically and semantically related. -Generalizability: The patterns can be generalized and/or derived from other patterns through generalization.
Based on these properties of constructions and the initial pattern-construction hypothesis, the main aims of the DISCOver methodology are the following: 1. To identify the contexts that are relevant for the definition of a cluster of semantically related words. Each of these contexts is part of a pattern candidate to be construction attested in the corpus (henceforth Attested-Patterns). 2. To use the previous contexts in a generalization process in order to identify unseen, but possible candidates to be constructions (henceforth Unattested-Patterns).
As a result we obtain two sets of qualitatively different patterns that are candidates to be constructions: attested and unattested patterns. We then proceed to evaluate the internal syntactic-semantic coherence of these patterns.

The Corpus
As shown in Figure 1, corpus creation is the first step in the process of obtaining lexico-syntactic patterns. Specifically, we built the Diana-Araknion 9 corpus, a Spanish corpus which consists of approximately 100 million tokens 10 (corresponding to 3 million sentences) gathered mainly from the Spanish Wikipedia (2009), literary works and texts from Spanish parliamentary discussions, news reports, news agency documents, and Spanish Royal Family speeches.
The corpus was automatically tokenized and linguistically processed with POS and lemma tagging, and syntactic dependency parsing. We used the Spanish analyzers available in the Freeling 11 open source language-processing library [Padró and Stanilovsky, 2012].
For the purpose of evaluation, we built Diana-Araknion++, a new corpus gathered from web-pages in Spanish. It includes Wikipedia 2015, articles from online newspapers, speeches from the European Parliament, university articles and sites from the Spanish webspace. This corpus was automatically tokenized and POS tagged and consists of 600M tokens.

Matrix
To generate the frequency matrix (see Step 2 in Figure 1), we used only the 15,000 most frequent lemmas extracted from the Diana-Araknion corpus including nouns (N ), verbs (V ), adjectives (A) and adverbs (R). We modeled the context in which the words occur giving rise to a lemma-dep matrix. This matrix corresponds to the type of word-context matrix defined in Turney and Pantel [2010] and in Baroni and Lenci [2010]. In the lemma-dep matrix, the context is based on parsed texts in which both dependency directions and dependency labels are taken into account. Each context is a triple of [dependency_direction, dependency_label, context_lemma_POS].
In what follows, we introduce how this lemma-context matrix is formally represented (see Section 3.3.1) and then we describe the matrix in more detail (see Section 3.3.2).

Formalization of the lemma-context matrix
Our DSM consists of a lemma-context PPMI matrix with rows and columns. Note that each row vector corresponds to a lemma, each column corresponds to a co-occurrence context, and each cell in has a numerical weighted value, . This weighted value is the result of applying Positive Pointwise Mutual Information (PPMI) [Niwa and Nitta, 1994] to a lemma-context frequency matrix with size × . Each element in this matrix, , is computed as the number of occurrences of lemma in context in the whole corpus. Lapesa and Evert [2014] perform a large-scale evaluation of different co-occurrence DSM models over various tasks. They show that term weighting through association scores significantly improves the performance of the DSM model.

Lemma-dep matrix
The matrix proposed in this work is a lemma-context matrix, hereafter lemma-dep matrix, based on syntactic dependencies 12 . In this matrix, the context of a lemma is a context word (context_lemma) directly related by a dependency direction (dep_dir) and a dependency label (dep_lab) to the lemma . The words of the lemma belong to the following POS: N, V, A and R. Each lemma is assigned its corresponding POS. Therefore, in the matrix, context contains three elements as defined in 1: where: -_ : has two possible values '<' or '>', indicating the direction of the dependency.
-_ : indicates the dependency label of the lemma and context_lemma . The possible values are {subj, dobj, iobj, creg, cpred, atr, cc, cag, spec, sp and mod}. In the case of dependencies between a preposition and a noun, adjective or verb, the dependency label is labeled by the same preposition and its corresponding _ , that is, dobj, iobj, creg, cag, sp or/and cc. -_ is the lemma of the context word with its corresponding POS, which can be N, V, A, R, preposition(P), number(Z) and date(W). In the case of proper nouns, they are replaced by the pn_n (proper noun) POS. Figure 2 shows an example of a dependency parsed sentence from which, for instance, three different contexts of the noun lemma barba_n 13 are generated: For each context obtained from the dependency structure, three different dependency contexts are generated: one that makes all the elements of the context explicit, that is, the dep_dir, dep_lab and context_lemma (for example, [<:dobj:afeitar_v]); another in which the dep_lab is generalized by the variable 'oth' (for example, [<:oth:afeitar_v]) 18 and, finally, one context that generalizes the context_lemma by substituting it for the variable '*' (for example, [<:dobj:*_v]) 19 . The three lemmas represented in example (2) do not share any context, therefore they could not be semantically related in our model. Instead, applying the generalization of contexts, we obtained a relationship between lemma 1 and lemma 2 in example (3), and between lemma 1 and lemma 3 in example (4). In example (3), the dep_lab is generalized, whereas in example (4) the context_lemma is generalized.
2. lemma 1 [<: : _ 20 ] lemma 2 [<: : _ ] lemma 3 [<: : : In this way, the generalization of contexts allows us to take into account contexts that are similar (they share two, but not all of the elements, of their context), but not identical. Therefore, we can distinguish between those lemmas that share the same or similar context, and those that have a completly different context. By adding these contexts that are similar but not identical we add new knowledge, that is, knowledge not directly present in the corpus. This new knowledge is used to generate the Unattested-Patterns.

Clustering
Once we described the matrix, we proceeded to the third step detailed in Figure 1 that is devoted to the clustering of this matrix. The motivation of 18 The tag 'oth' (other) means that the dependency label is not specified. 19 The symbol '*_v' means that a verb occurs in this position, but we do not specify which one it is. 20 'to_rob' 21 'to_steal' the clustering process is to find, for each lemma in the matrix, all semantically related words (lemmas). This will allow us to create new Unattested-Patterns after the linking and filtering cluster processes. To perform this clustering step, we used the Cluto toolkit [Karypis, 2003] 22 , which is used to cluster a collection of objects (in our case, lemmas) into a predetermined number of clusters labeled . We applied a methodology based on Caliński and Harabasz [1974] and using cosine similarity and Cluto's ℋ 2 metric to estimate the optimal amount of clusters.
We experimented with a number of different clustering configurations. The variables we took into account were: a) the number of most frequent lemmas, with the 10,000 to 15,000 most frequent lemmas giving the best results; b) the inclusion of proper nouns or their substitution for their POS; and c) considering the lemmas with and without their POS.
We evaluated the results of these configurations manually and opted for 15,000 lemmas with proper nouns grouped according to their POS tag (pn_n) and with the POS tag assigned to the lemmas. This configuration gave an optimal of 1,500 clusters applying the Caliński and Harabasz [1974] method and the ℋ 2 metric.
The inclusion of POS improves the internal consistency of the clusters. Since the POS tagger does not distinguish between subclasses of proper names (person, organization, place, etc.), grouping them according to the pn_n tag also gives better results. Regarding the number of lemmas, all results obtained using between 10,000 and 15,000 lemmas gave satisfactory results. The choice of the number of lemmas determines the number and the content of the clusters. In all cases, the quality of clusters obtained was acceptable. We consider a cluster as acceptable when all or almost all words contained in it share one of the following relations: synonymy, hypernymy, or hyponymy. This would allow for the use of one or more configurations for the obtention of the final lexico-syntactic patterns (see Section 3.6).
Using Cluto with the selected configuration, we obtained a set of clusters = { : 1 ≤ ≤ } from matrix . Formally, the content of each cluster ∈ is defined in 2, where is a set of related lemmas and is a set of contexts. Each lemma_pos only belongs to one cluster (i.e., it can only be defined in one ), whereas a context_lemma can be in several contexts ( ) of different clusters.
=< , > Formally, a context (called _ ) in is described as follows: where _ , _ , _ corresponds to the definition of a context as shown in Section 3.3.2. The is the sum of the different scores given by Cluto 23 .
For example, Table 1 24 describes the lemmas, , and the most scored contexts, , in cluster number 421_n (one of the clusters obtained in the corpus analyzed).

Results of the clustering process
Following our configuration, we obtained a total of 1,500 clusters in the clustering process ( =1500). It is worth noting that the clusters are highly morphosyntactically and semantically cohesive. The clusters contain lemmas belonging mostly to the same POS. It is worth mentioning that more than half of the clusters are nouns (54.20%), followed by verbs (25.80%) and adjectives (16.67%). Clusters of adverbs make up only 3.33% of the total.
Clusters contain relevant implicit information, in the sense that their lemmas belong to well-defined semantic categories, often at a very fine-grained level. For instance, we obtained clusters of adjectives with a Positive Polarity (5) and with a Negative Polarity (6) 24 . These results encourage us to tag all the clusters with one or more semantic labels. That will enrich the obtained patterns.

Linking and filtering clusters
The process of linking clusters (see Step 4 in Figure 1) is based on the set of clusters and contexts obtained using Cluto. The processes of linking clusters and pattern generation detailed in Section 3.6 are the core steps of the DISCOver methodology. The process of linking clusters uses the set of the twenty-five highest scored contexts in each cluster. According to our pattern-construction hypothesis (see Section 3.1), the goal of the linking of clusters is to establish the relationships between clusters using their contexts, as defined in (3), obtaining as a result a matrix of all possible contextual relations between clusters (see Section 3.5.1). Next, we apply a filtering process in order to select strongly related links taking into account different criteria (see Section 3.5.2).

Linking clusters and building the matrix of related clusters
Basically, the aim of the cluster linking process is to establish the relationships between clusters and to store them in a matrix, _ , with rows and columns. The -value corresponds to the number of clusters obtained in the clustering step.
For building the matrix, for each origin cluster ( ) each _ and _ of the _ (defined in Equation 3) are converted into a _ (see Equation 4), while the _ of the _ is used to locate the cluster ( ) in which it occurs. We obtain as a result a matrix, _ , in which clusters are related according to a set of contextual relations stored in a _ . The sum of the scores of the _ in 3 are added together in a matrix, _ . The _ matrix is later used in the process for determining filtering thresholds.
For the contextual relation, defined in 4, _ and _ are the dependency direction and the dependency label defined in a context of cluster related to cluster . Note that the _ of a cluster in itself is empty as Following the example of cluster 421_n, described in Table 1, the result of the cluster linking process for this particular cluster ( = 421_ ) is shown in Table 2 27 . The first column in this table shows the related clusters, , the second column shows the relation_type that relates cluster 421_n to the related clusters (i.e. strong, semi or weak, See 3.5.2), and finally the last column describes the lemmas in the related clusters.

Filtering related clusters
In the _ matrix, not all contextual relationships between clusters are accepted since they have a low _ . For this reason, we established two criteria to automatically determine which relationships will be maintained and which ones are filtered out in the pattern generation process. For each criterion only those relations higher than a predetermined score value will be considered. The criteria are the following: is higher than a predetermined value, that is, ℎ ℎ 2 , which is determined by finding a value that allows for the grouping of 50% of the clusters. The relations that fulfill criterion 2 are called Semi relations.
Considering the example of cluster 421_n, the result of the filtering process is that, out of the three clusters linked to cluster 421_n in our example 24 (1223_a, 932_v, and 405_n), we will only select those with strong and semi relations, that is, 1223_a, and 932_v. Those labelled as weak (e.g., 405_n shown in Table 2) are filtered out because they do not reach the established thresholds.

Pattern generation
Once the process for automatically linking and filtering clusters was carried out, we proceeded to generate the lexico-syntactic patterns to be considered as candidates for constructions (see Step 5 in Figure 1). Each generated pattern is defined as follows: where and are the lemmas contained in the related clusters ( and ), _ and _ are the dependency direction and the dependency label between the related clusters. So, there is a pattern for each and pair. As we mentioned in Section 3.4, all possible configurations using between 10,000 and 15,000 lemmas gave acceptable related clusters. In order to increase the number of patterns generated we carried out the same process with a configuration using 10,000 lemmas. We combined the patterns obtained using the 10,000 and 15,000 lemmas together and removed those that were shared by both configurations. In Tables 3, 4 and 5, we show the number of resulting clusters and patterns, after removing the overlapping patterns, for the two configurations.  Table 3 (second and third columns), more than 55% of the linked clusters maintain Strong and Semi relationships, whereas only the 2.68% of the clusters remain unrelated. The total number of lexico-syntactic patterns obtained from the two configurations of clusters (780 and 857 Strong and Semi related clusters) is 237,444. For the purpose of pattern generation, Strong and Semi clusters have been treated equally. From these patterns, we removed 16,712 patterns, those that were present in both sets of generated patterns, given as a result the total number of 220,732 patterns (See Table 5). The DISCOver methodology allows for the generation of patterns that actually occur in the corpus (Attested-Patterns), but also of lexico-syntactic patterns that are not present in the corpus but which are highly plausible in Spanish (Unattested-Patterns), since the components of the clusters are closely semantically related. As a result, we are able to enlarge the descriptive power of the source corpus. Among the patterns we generated, 61,820 were Attested-Patterns, that is, patterns that are present in the source corpus, and 175,624 were Unattested-Patterns, that is, new patterns (see Table 5).

As shown in
Retaking the example of cluster 421_n and its related clusters we obtain patterns such as those shown in (7)  perfectly acceptable in Spanish. These patterns would not have been extracted using, for example, a n-gram based method or plain statistical methods.
It is worth noting the high degree of semantic cohesion between the lemmas of the same cluster and between the lemmas of the related clusters ((8) 29 , (9) 30 , (10) 31 and (11)  This strong cohesion allows for a semantic annotation of the clusters to obtain more abstract syntactico-semantic constructions that combine semantic categories (12) and (13). The semantic labels associated with each cluster have been manually added, taking into account the WordNet upper ontologies.

Evaluation
In this section we evaluate the quality of the results obtained through the DISCOver methodology: the clusters obtained (see Section 4.1) and the lexicosyntactic patterns (see Section 4.2).

Clustering evaluation
DISCOver is a methodology for discovering lexico-syntactic patterns.The clusters of semantically related words are a by-product that we obtain as part of the process. Since the focus of this work is the methodology used and the patterns obtained, the evaluation of all possible representation and clustering algorithms is outside the scope of this article. Nevertheless, we prepared a cluster evaluation experiment in order to justify our choice and show that the quality of the obtained vectors and clusters is at least comparable with other state-of-the-art methods. As a baseline, we use standard Word2Vec [Mikolov et al., 2013], representations with the recommended built-in k-means clustering algorithm. We evaluate the resulting clusters with respect to two criteria: a) the POS purity of each cluster, calculated automatically; and b) the semantic coherence of the lemmas in each cluster, evaluated manually by experts. The criteria applied had been to check if the words in a cluster hold one of the following semantic relations: synonymy, hypernymy or hyponymy.
CLUTO obtained much higher results in terms of both evaluation criteria. The POS coherence of the obtained clusters was 98%, compared to 70% obtained by Word2Vec. Manual evaluation shows that 99% of the clusters obtained by CLUTO were more semantically coherent than the corresponding ones obtained by Word2Vec. These results justify the representations and parameters as adequate for the task and as comparable with the state of the art. Kovatchev et al. [2016] present a more in-depth comparison of the clustering algorithms using corpora of different sizes.

Pattern evaluation
Obtaining high quality lexico-syntactic patterns is the main objective of the DISCOver methodology. In this section, we present two different evaluations of the obtained patterns: (1) an automatic evaluation, applying statistical association measures; and (2) a manual evaluation by expert linguists 33 . For these evaluations, we used the sum of the patterns of both the 15,000 and 10,000 word configurations.
First, we evaluated the patterns automatically using statistical association measures and a different, much larger, corpus (Diana-Araknion++). In Section 3.1, we define two main properties of constructions: 1) Syntactic-semantic coherence and 2) Generalizability. "Syntactic-semantic coherence" entails that the words in each pattern need to be syntactically and semantically related. The "syntactic coherence" of the patterns is not evaluated explicitly, as it is considered to be a by-product of the methodology: all linked clusters from which the patterns are derived have a plausible syntactic relationship and a high connectivity score (see Section 3.5.1). However, we need to evaluate the semantic coherence of the patterns, that is, whether there is a semantic relation between the two lemmas. Defining and evaluating "semantic relatedness" is a non-trivial task, which often requires the use of external resources, such as WordNet [Miller, 1995] and BabelNet [Navigli and Ponzetto, 2012]. However, these resources are built considering the paradigmatic relationship between words (such as synonymy, hypernymy, and hyponymy), while we are interested in evaluating syntagmatic relationships.
Evert [2008] and Pecina [2010] discuss the use of association measures for identifying collocations. They define collocations as "the empirical concept of recurrent and predictable word combinations, which are a directly observable property of natural language". In the context of distributional semantics, this definition corresponds to "semantic coherence".
In the DISCOver process, we obtained two qualitatively different types of candidates-to-be-constructions: Attested-Patterns, which are observed in the corpus and Unattested-Patterns, which are obtained as a result of a generalization process that includes clustering, linking and filtering. In order to evaluate the quality of these candidates-to-be-constructions, we formulate two hypotheses and disprove their corresponding null hypotheses.
-Hypothesis 1: The two lemmas in each construction are semantically related.
Null hypothesis 1 (henceforth 0 1): The degree of statistical association between the two lemmas in each of the Attested-Patterns, measured in a corpus other than the one they were extracted from, is equal to statistical chance.
-Hypothesis 2: Constructions can be generalized and/or derived from other constructions through generalization. Unattested-Patterns (derived through a generalization process) should be possible language expressions and have the property of semantic coherence.
Null hypothesis 2.1 (henceforth 0 2.1): Unattested-Patterns are not possible language expressions. They cannot appear in a corpus.
Null hypothesis 2.2 (henceforth 0 2.2): If Unattested-Patterns appear in a corpus, they will not have the property of semantic coherence. That is, they will have association scores equal to statistical chance.
In order to prove the two main hypotheses we needed to disprove the three null hypotheses.
For a baseline of 0 1, we extracted a list of all bigrams (BI-Patterns) from the original Diana-Araknion corpus. Each bigram contains at least one of the 15,000 most frequent words. We removed all bigrams containing non-content words. All of the Attested-Patterns and the BI-Patterns were found and extracted from the Diana-Araknion 100M token corpus.
For a baseline of 0 2.1, we generated patterns by combining frequent lemmas (FL-Patterns): FL-Patterns-15 contain all combinations of the most frequent 15,000 lemmas found in the Diana-Arakion corpus; FL-Patterns-30 contain all combinations in which one lemma is among the 15,000 most frequent lemmas and the other among the 30,000 most frequent ones; FL-Patterns-all contain all word combinations which contain at least one of the 15,000 most frequent lemmas 34 .
We use two different statistical methods [Evert, 2008]: simple Mutual Information (MI), which is an effect size measure, and the Z-score (Z-sc), which is an evidence-based measure. Effect-size measures and evidence-based measures are qualitatively different, and for evaluation can be used complementarily. Our final experimental setup includes the following: -Attested-Patterns, in five different test groups, based on their observed frequency in the Diana-Araknion corpus: -Att-Patterns-all with an original frequency of 1 or more -Att-Patterns-2 with an original frequency of 2 or more -Att-Patterns-3 with an original frequency of 3 or more -Att-Patterns-4 with an original frequency of 4 or more -Att-Patterns-5 with an original frequency of 5 or more -BI-Patterns, with an original frequency of 5 or more 35 -Unattested-Patterns -FL-Patterns-15, FL-Patterns-30, FL-Patterns-all Evaluating 0 1: We calculated the MI and Z-sc association scores of the two words in each of the Attested-Patterns and BI-Patterns in the Diana-Araknion++ 600M token corpus. The association score was calculated based on the sentential co-occurrence of the two words. Patterns that co-occurred less than 5 times obtained a score of 0. First, we compared the obtained association with standard thresholds, representing statistical chance: 0, 0.5, and 1 for MI; 0, 1.96, and 3.29 for Z-sc. Second, we compared the average association score of the Attested Patterns with those of the BI-Patterns. Table 6 shows what percentage of the Attested-Patterns in each group obtains scores higher than statistical chance. Overall, the majority of the Attested-Patterns outperform the statistical chance baseline. The results are consistent for both the measures and their thresholds, even though they measure the association in a qualitatively different manner. It is important to note that filtering out the Attested-Patterns with a frequency of 1 significantly improves the results. We believe this factor should be taken into consideration in future experiments. As a complementary evaluation, we directly compared the association scores of the Attested-Patterns with those of the BI-Patterns. Table 7 shows the average association scores for the two types of patterns 36 . The Attested-Patterns have a much higher degree of association than the BI-Patterns. In the case of MI, the Attested-Patterns obtain scores more than two times higher than the BI-Patterns. In the case of Z-sc, the Attested-Patterns obtain scores between 30% and 100% higher than the BI-Patterns. The obtained results disprove 0 1 and confirm Hypothesis 1. That is, we can conclude that the Attested-Patterns are semantically coherent.
Evaluating 0 2.1: We checked how many of the Unattested-Patterns were present in Diana-Araknion++. As a baseline we used the FL-Patterns. Both Unattested-Patterns and FL-Patterns are not directly obtained, but are rather a result of generalization and generation using different methodologies. For each group, we calculated the percentage of the patterns that appear once and the percentage of the patterns that appear at least five times. Table 8 shows the results obtained.
Unattested-Patterns appear much more frequently than the patterns generated by simply combining frequent lemmas. 56% of the Unattested-Patterns were observed in Diana-Araknion++. This is more than double the observance rate of the FL-Patterns-15 and five times higher than for FL-Patterns-30. 24% of the Unattested-Patterns appear in Diana-Araknion++ with a frequency of 5 or more. This is almost three times higher than FL-Patterns-15 and six times higher than FL-Patterns-30. The results of FL-Patterns-all are much lower, showing that unfiltered pattern generation is not effective. Unattested-Patterns are linguistic patterns given that they appear in a corpus with a much higher probability than patterns generated using a simpler frequency based methodology. These results disprove 0 2.1. Evaluating 0 2.2: We calculated the association score (MI and Z-sc) between the lemmas in each of the Unattested-Patterns that occurred at least 5 times 37 in Diana-Araknion++. We compared the scores with the same thresholds we used when evaluating 0 1. Table 9 shows the percentage of patterns with a score higher than the statistical chance thresholds. The observed degree of association is very high. Over 90% of the observed Unattested-Patterns obtained a positive association score with respect to both measures. When comparing them with the statistical chance thresholds, the obtained results are similar to those obtained by Attested-Patterns in 0 1. The Unattested-Patterns, when observed in a different corpus, are semantically coherent. This disproves 0 2.2.
In conclusion, the automated statistical evaluation of the patterns obtained by DISCOver shows that: (1) Attested-Patterns are semantically coherent, as they outperform two baselines: statistical chance thresholds and BI-Patterns. These results disprove 0 1.; (2) A significant percentage (56%) of the Unattested-Patterns can be found in Diana-Araknion++, which is much higher than the occurrence 37 Calculating this score for patterns with lower frequency is unreliable due to the low-frequency bias in some of the measures.
of FL-Patterns. These results disprove 0 2.1; (3) Whenever Unattested-Patterns occur in Diana-Araknion++, the statistical association between the lemmas in the patterns is much higher than the statistical chance baseline. This disproves 0 2.2.
As we have disproved all 3 of the null hypotheses, we can conclude that the patterns obtained by the DISCOver methodology have both properties of constructions: syntactic and semantic coherence and generalizability. Therefore they are good candidates-to-be-constructions.
We also performed a manual evaluation of the lexico-syntactic patterns. This complementary validation reinforces the results obtained in the two statistical evaluations. We prepared a dataset of 600 patterns for the manual evaluation: 300 patterns obtained by applying the DISCOver methodology (the patterns were randomly selected from all Attested and Unattested Patterns) and 300 of the FL-Patterns-15. Three experts were asked to classify each pattern as a correct or incorrect construction. The instructions given to them were: a) evaluate whether the pattern is a possible Spanish pattern in your judgement as a native speaker; b) in case of doubt, consult the Google Search engine to check whether it is used by users. Our research questions in this evaluation were: 1) How do the experts evaluate the patterns obtained by DISCOver?; 2) Are experts more likely to accept patterns obtained by DISCOver than random patterns of frequent words?
The average percentage of agreement between the three annotators was 81.67% (see Table 10), which is considered high for a semantic evaluation task. The corresponding Fleiss Kappa score is 0.602 with expected agreement of 0.539, which is statistically significant. The results of the evaluation are shown in Table 11. We use three pattern quality categories. "Strict Positive" includes patterns that were annotated as positive by all three annotators, "Positive" includes patterns that were annotated as positive by at least two annotators and "Negative" groups together patterns that were annotated as positive by one or none of the annotators. The experts accepted the majority of the DISCOver patterns as constructions. At the same time they rejected the majority of the FL-Patterns. We also want to highlight that the percentage of "Strict Positive" patterns is very similar to the percentage of patterns that obtain a high association score. These findings confirm the results that we obtained in the automatic evaluation (See Tables 6 and 9).

Conclusions and Future Work
This article describes DISCOver, an unsupervised methodology for automatically identifying lexico-syntactic patterns to be considered as constructions. We based this methodology on the pattern-construction hypothesis, which states that the linguistic contexts that are relevant for defining a cluster of semantically related words tend to be (part of) a lexico-syntactic construction. Following this assumption, we developed a bottom-up language independent methodology to discover lexico-syntactic patterns in corpora. The DSM developed allows us to model the contexts of words (lemmas) taking into account their dependency directions and dependency labels. We applied a clustering process to the resulting matrix to obtain clusters of semantically related lemmas. Then we linked all the clusters that were strongly semantically related and we used them as a source of information for deriving lexico-syntactic patterns, obtaining a total number of 220,732 candidates to be constructions. We evaluated the DISCOver methodology by applying different evaluations. First, the patterns were automatically evaluated using statistical association measures and a different, much larger, corpus. We evaluated whether the patterns we generated obtained a significantly higher association score than statistical chance. We also compared the asociation scores of the DISCOver patterns with a baseline of bigrams. DISCOver obtained better results with respect to both baselines. The patterns obtained by generalization were additionaly evaluated against a baseline of randomly generated patterns. DISCOver significantly outperforms these baselines. Second, the patterns were manually evaluated by expert linguists obtaining good results (89.33%).
This methodology only requires having at one's disposal a medium-sized corpus automatically annotated with POS tags and syntactic dependencies. Therefore, our methodology can be easily replicated with other corpora and other languages. For instance, the DISCOver patterns were also used in a text classification task [Franco-Salvador et al., 2015]. The patterns obtained using our methodology have been compared to other representations (i.e., tf-idf, tf-idf n-grams, and enriched graph). The use of these patterns results in an accuracy of 91.69%, which outperfoms the representations based on tf-idf (25.26%), tf-idf n-grams (79.26%) and an enriched graph (43.98%), proving to be the best option to represent the content of the corpus.
Furthermore, our methodology increases the descriptive power of the source corpus. First, the lexico-syntactic patterns generated constitute a structured and formalized semantic representation of the corpus. Second, the linking process enlarges the content of the initial data with new relationships not directly present in the corpus (i.e., a total of 167,443 Unattested-Patterns).
The Diana-Araknion-KB 38 can be used as a source of information to derive relevant linguistic information, such as the selection restrictions of verbs, nouns and adjectives; to disambiguate syntactic analysis in order to discard candidate parse trees; to provide a knowledge base of related words with a high degree of association measures for psycholinguistic research; and, to allow for a fine-grained corpus comparision.
The methodology presented and the results obtained, which are available in the Diana-Araknion-KB, open several lines of future research.
First, the Diana-Araknion-KB can be used as a source of information for the development of patterns at different levels of abstraction, in such a way as to obtain a hierarchy of patterns with components belonging to different levels of linguistic knowledge, that is, combining lexical, morpho-syntactic and semantic information. Second, since the same semantic category can be shared by more than one cluster, we could group them into metaclusters containing all the clusters with the same semantic category. Third, a further cluster linking process could be carried out allowing all members of a metacluster to combine with all the target clusters that are related with at least one of the members of the metacluster. Fourth, constructions could be linked in terms of transitivity to obtain larger structures. That is, if cluster A combines with cluster B, and B combines with cluster C, we have the candidate construction: A+B+C. Fifth, the methodology can be used to extract and study patterns in corpora from a specific area, such as the Biomedical domain.
To sum up, we consider that this methodology for discovering constructions outperforms the results of other proposals in the sense that it is fully automatic, language independent, and easily replicable in other corpora and languages. The quality of the results obtained and their wide range of possible applications confirm the DISCOver methodology as a promising line of research and DSMs as a good choice for discovering linguistic knowledge.