The challenges of statistical patterns of language: the case of Menzerath's law in genomes

The importance of statistical patterns of language has been debated over decades. Although Zipf's law is perhaps the most popular case, recently, Menzerath's law has begun to be involved. Menzerath's law manifests in language, music and genomes as a tendency of the mean size of the parts to decrease as the number of parts increases in many situations. This statistical regularity emerges also in the context of genomes, for instance, as a tendency of species with more chromosomes to have a smaller mean chromosome size. It has been argued that the instantiation of this law in genomes is not indicative of any parallel between language and genomes because (a) the law is inevitable and (b) non-coding DNA dominates genomes. Here mathematical, statistical and conceptual challenges of these criticisms are discussed. Two major conclusions are drawn: the law is not inevitable and languages also have a correlate of non-coding DNA. However, the wide range of manifestations of the law in and outside genomes suggests that the striking similarities between non-coding DNA and certain linguistics units could be anecdotal for understanding the recurrence of that statistical law.


INTRODUCTION
A ttempts to demonstrate that statistical patterns of language have a trivial explanation have a long history that goes back at least to the research by G. A. Miller and collaborators questioning the relevance of Zipf's law for word frequencies around 1960 [1][2][3]. Zipf's law states that the curve that relates the frequency of a word f and its rank r (the most frequent word having rank 1, the second most frequent word having rank 2, and so on) should follow f $ r 2a [4]. Miller argued that if monkeys were chained ''to typewriters until they had produced some very long and random sequence of characters'' one would find ''exactly the same 'Zipf curves' for the monkeys as for the human authors'' [3]. Under his view, Zipf's law would be an inevitable consequence of the fact that words are made of units, e.g., letters or phonemes. The typewriter argument has been revived many times since then [5][6][7][8]. However, rigorous analyses indicate that the curves do not really look the same and the parameters of this random typing model giving a good fit to real word frequencies are not forthcoming [9,10]. Here, we review a recent claim that the finding of another statistical pattern of language, Menzerath's law, is also inevitable [11]. P. Menzerath hypothesized that ''the greater the whole, the smaller its constituents'' (''Je größer das Ganze, desto kleiner die Teile'') in the context of language [12] (pp. 101). Converging research in music and genomes [13][14][15][16] suggests that Menzerath's law is a general law of natural and humanmade systems. In this article, we leave the term Menzerath-Altmann law for referring to the exact mathematical dependency that has been proposed by the quantitative linguistics tradition for the relationship between x, the size of the whole (in parts) and y, the mean size of the parts, i.e. [17], where a, b, and c are the parameters of Menzerath-Altmann law. In the pioneering research by Wilde and Schwibbe [14] and later work [15,20], Menzerath's law emerged as a negative correlation between L c and L g , where L c is the mean chromosome length (the size of the constituents) and L g is the chromosome number (the size of the construct measured in constituents). More recently, the law has been found in the dependency between mean exon size (the size of the constituents) and the number of exons of human genes (the size of the construct) [16].
However, it has been argued that this negative correlation is trivial [11]: the definition of L c as a mean, i.e. L c 5 G/L g leads (according to Ref. [11]) unavoidably to L c $ L g b with b 5 21, which is supported by the fact that mammals and plants give values of b that are very close to b 5 21 (b 5 21.04 for mammals and b 5 21.07 for plants [11]). In the present article, $ is used to indicate proportionality. Furthermore, it has also been argued that a proper connection between human language and genomes cannot be established a priori using genomes as wholes and chromosomes as parts, due to the fluid nature of chromosomal arrangements and the vast dominance of noncoding DNA, which has no parallel in language [11]. Revising those arguments is critical for musicology, quantitative linguistics, and genomics. If they were correct, the relationship between the mean size of the constituents (y) and the number of constituents (x) which have been the subject of many studies [13,[16][17][18] would be a trivial consequence of the definition of the size of the constituents as a mean. Following Miller's argument, producing Menzerath's law would be as easy as producing Zipf's law by monkeys chained to a typewriter. More precisely, the inevitability of L c $ 1/L g [11] predicts that Menzerath-Altmann law must always be Eq.
(1) with b 5 21 and c 5 0 when defining the size of the parts as a mean. If such inevitability is correct, exponents deviating significantly from b 5 21 should be the exception, not the rule in language, music and genomes.
Here we address the challenge of Menzerath's law in genomes [14][15][16] and beyond [13,17,18] by reviewing Sol e's criticisms [11]: his mathematical and statistical arguments, essentially the inevitability of L c $1/L g (Section 2), as well as his conceptual arguments, mainly the mismatch between human language and genomes (Section 3). Finally, we will discuss some general questions that are crucial for understanding the recurrence of Menzerath's law (Section 4).
2. The mathematical and statistical debate.

Mixing Angiosperm and Gymnosperm Plants
Sol e does not distinguish between angiosperm and gymnosperm plants [11]. However, our analyses have been revealing important differences between them: (1) concerning the relationship between L g and L c , Menzerath's law is only found in angiosperms [15], (2) G tends to increase as L g increases in gymnosperms but G increases as L g decreases in angiosperms [19] and, (3) the fit of L c $ L g b yields b 5 20.95 6 0.05 for angiosperms and b 5 20.3 6 0.2 for gymnosperms [20], the latter being statistically inconsistent with b 5 21 as Sol e predicts [11]. As his division of plants differs from that of Ferrer-i-Cancho and Forns [15] and gymnosperms do not follow Menzerath's law, we proceed assuming that his notion of plant is equivalent or can be reduced to angiosperms.
2.2. L c 5 G/L g does not Imply L c $ 1/L g .
It has been argued that the definition of L c as G/L g unavoidably leads to an inverse proportionality dependency between L g and L c , i.e. L c $ 1/L g [11]. This can be refuted in two ways: empirically and mathematically.

Empirical Refutation
Amphibians exhibit a positive correlation between L c and L g that is incompatible with L c $ 1/L g [15].
c Menzerath's law (a significant negative correlation between L c and L g ) was not found for gymnosperm plants and ray-finned fishes [15].
c Many empirical studies of Menzerath-Altmann law compute the size of the parts as an average as Ferreri-Cancho and Forns did [15] but the fit of Eq. (1) gives parameters that deviate from b % 21 (see Table 1 for a summary of research).
c b 5 20.6 is reported for ants in the pioneering work by Wilde and Schwibbe [14] that is cited by Ferrer-i-Cancho and Forns [15].
c Sol e reports estimates of b only for mammals and plants (according to his analysis b 5 21.04 and b 5 21.07, respectively) [11], whereas Ferrer-i-Cancho and Forns [15], con-sidered a total of 11 major groups [15] (see also Ref. [19]). Thus, nine groups have not been considered. |b 1 1| is a measure of the deviation from his prediction, i.e. L c $ 1/L g . |b 1 1| 5 0 means a perfect matching with his prediction. |b 1 1| indicates that mammals and angiosperm plants are among the three groups with the smallest value of |b 1 1| (Table 2).
c A careful statistical analysis reveals that b deviates significantly from b 5 21 in fungi, gymnosperm plants, insects, reptiles, jawless fishes, rayfinned fishes, and amphibians, groups for which Sol e reports no result [11]. Furthermore, the parameter b of L c $ L g b contributes significantly to improve the quality of the fit with regard to that of L c $ 1/L g for the same groups [20]. Put differently, if b is let free, then the error of the model is reduced significantly for these groups with regard to keeping it equal to 21.
c In a recent study of Menzerath-Altmann law in genomes at the geneexon level, the relationship between the mean exon size in bases and the number of exons of a human gene yields b % 20.5 [16].

Mathematical Refutation
it is argued by Sol e [11]. Yet, if G is not constant, then b 5 21 is not necessarily expected: (1) the exponent may change (e.g., if G $ L g 22 then b 5 23) and (2) the power-law L c $ L g b could be lost (e.g., if G $ L g e 2Lg then we would have L c $ e 2Lg ).
A mathematical analysis indicates that L c $ 1=L g needs that G and L g are uncorrelated [19]. Therefore, L c $ 1=L g is rejected if G and L g are correlated. The empirical evidence for such correlation is the following: (1) G tends to increase as L g increases in gymnosperm plants and mammals while G tends to decrease as L g decreases in angiosperm plants [19] and (2), from the major taxonomic groups considered by [15], only birds and cartilaginous fishes show no significant correlation between G and L g [19].
(with the possibility of b 5 21 and/or c 5 0, following Sol e's arguments) is the best, or simply the most suitable for modeling the actual relationship between L c and L g in genomes. When preparing our original article [15], we were already aware of the challenge of designing biologically realistic equations and evaluating the goodness of their fit rigorously. Therefore, we decided to use a simple correlation analysis between L c and L g to stay neutral about the actual dependency. While our original approach was nonparametric (based on a Spearman rank correlation test), Sol e followed the parametric track with the assumption that genomes follow L c $ L g b [11]. Our approach to test Menzerath's law [15] and our approach to reject L c $ 1/L g are both nonparametric [19]. In sum, our analysis requires fewer assumptions than his. However, we have had to follow a parametric approach in one of the branches of our genome research to show that even when strong assumptions are made about the actual dependency, his arguments do not stand, even for mammals and plants [20]. The summary is based upon the pioneering work of G. Altmann and collaborators. N.A. means that the two parameter version of Eq. (1), with c 5 0, was fitted.

The Unsupported Fluid Nature of Chromosomal Rearrangements
Sol e states that ''the fluid nature of chromosomal rearrangements through time rules against any special multiscale link between genome-level and chromosome-level patterns'' [11]. If the mathematical interpretation of this statement is that the genome and the chromosome level are statistically independent, then a large amount of research indicates that G and L g are not independent in real genomes and that independence is in conflict with chromosome well-formedness (see Ref. [9] and references therein).

Languages also have ''Dark Matter''
Sol e argues that the dominance of noncoding DNA (what he also calls ''information-lacking DNA'', ''informa-tion-lacking DNA'' or ''junk DNA''), should prevent us from using large-scale structures such as genomes as meaningful information-related units [11]. However, the view of non-coding DNA as ''dark matter'' or ''junk'' in a strict sense is outdated from the point of view of molecular biology [21][22][23][24]. Some researchers have suggested that ''there is in fact much less, if any, 'junk' in the genomes of the higher organisms than has previously been supposed'' [25].
Linguistic sequences and genomes are not so radically different concerning real or apparent ''junk,'' ''dark matter,'' or ''information-lacking DNA''. In general, words are classified into content, e.g., verbs, nouns, and function words, e.g., prepositions, conjunctions. While content words are said to have lexical meaning, function words are said to have grammatical meaning [26], i.e. function words lack lexical meaning [27] (pp. 55). For this reason they are called ''empty words'' by cer-tain scholars [26]. Similarly, noncoding DNA is empty, in the sense that it does not code for specific proteins. The term ''junk words'' has also been used for referring to function words and particles in language sciences [28]. However, the closest analogy for the term ''junk'' in human language are the so-called filler words such as ''um,'' ''oh,'' ''well'' (Searls DB, Personal Communication, 2011).
Function words such as prepositions and conjunctions have an inherently relational meaning [29] and they are very important nodes in word networks: they are hubs or ''authorities'' in a network theory sense [30,31]. The logic structure of the sentence ''Mary bought an apartment in spite of the economic crisis'' is radically different from that of ''Mary bought an apartment thanks to the economic crisis''. The conjunctions ''in spite of'' and ''thanks to'' regulate the relationship between ''Mary bought an apartment'' and ''the economic crisis'' in the sentences above. In sum, lexical meaning and protein coding appear to be parallel terms, respectively, from the linguistic and genetic world. The same applies to grammatical meaning and regulation, the latter being a function served by noncoding DNA [22,24].
If we consider linguistic units with grammatical function as equivalent to noncoding DNA, then not only function words or particles parallel noncoding DNA, but also bound morphemes (e.g., the -ed ending of walked), as they also contain grammatical meaning. As linguistic sequences at many levels contain a mixture of elements with lexical and grammatical meaning (e.g., lexemes and bound morphemes in words), a DNA sequence may be a combination of coding and noncoding parts (e.g., exons and introns in genes). Words, phrases, clauses, sentences, i.e. units on which Menzerath's law has been reported (Ref. [33] and references therein), are ''polluted'' to some extent by ''dark matter''. A summary of jb 1 1j, the difference between the exponent b obtained from the fit of L c $ L g b and the exponent 21 that is expected from the arguments by Sol e [11]. Groups are sorted increasingly by jb 1 1j. b was estimated using nonlinear regression as in Ref. [20]. The dataset is the same as that of Refs. [19] and [20]. The values of jb 1 1j were rounded to leave only two significant digits. a Is used for the only two groups used by Sol e [11]. Two interpretations of Sol e's notion of plant are offered: angiosperms and a mixture of angiosperms and gymnosperms.
The statistics of the amount of function and content words provides us with an estimate of the amount of ''dark matter'' in language. Table 3 indicates that the proportion of a parallel of noncoding DNA in an English conversation is about 59%, which includes function and filler words, while it is about 37% in a news report. Therefore, languages also have a large proportion of elements reminiscent of non-coding DNA. But the true proportion of ''noncoding'' elements in languages could be higher if the grammatical morphemes that are attached to lexemes were included in the counts.
Interestingly, the evolution of the view of ''fillers'' in linguistics parallels the evolution of the view of noncoding regions in molecular biology. Progress in linguistic research indicates that ''fillers'' are more than mere ''fillers'' while progress in genomics indicates that ''junk'' DNA is more than mere ''junk''. As for linguistics, the understanding of filler words in linguistics has evolved from the term filler [32], as their meaning and their role in the sentence was gradually recognized, to particular kinds of discourse related particles or cue words (Ref. [34] and references therein). At present, the consensus is that ''words'' originally called fillers ''have no apparent grammatical relation to the sentences in which they appear'', and ''contrary to what prescriptivists' accusations, they do have a meaning, in that they seem to convey something about the speaker's relation to what is asserted in the sentence'' [34]. The view of other function words has also evolved similarly: function words believed to be empty contain indeed meaning [34,35]. As for molecular biology, the field is moving from the view of noncoding DNA as ''junk'' to that of functionally relevant material [21][22][23][24]. The view of repetitive segments in DNA sequences as mere ''fillers'' is being abandoned in molecular biology [36]. In both biology and linguistics, ''dark matter'' is becoming meaningful or functional matter, thanks to progress in core molecular biology and linguistics.

Misunderstanding of a Metaphor
Sol e's focus on noncoding DNA as an obstacle for a proper connection between human language and genomes [11] shows that he has misunderstood the ''metaphor that genomes are words and chromosomes are syllables'' (abstract of Ref. [15]).
Patterning consistent with Menzerath's law is found at many linguistic levels: morphemes (in the seminal work by G. Altmann [17] that he cites) or sentences [18]; see Table 1. Probably the most radical example is music (see also Table 1), where the whole and the parts lack a ''meaning'' equivalent to that of content words. This suggests that Menzerath's law is a manifestation of abstract principles as many have proposed (see Ref. [13] and references therein [15]). In contrast, Sol e shows a lack of abstraction when considering that language and genomes, in order to resemble statistically, must be practically identical [11]. Indeed, he interprets the linguistic metaphor that inspired our original article (genomes ''are'' words and chromosomes ''are'' syllables) not as a metaphor but as a narrow equivalence. We could have replaced words and syllables by other units: morphemes and syllables, sentences and clauses, or mr-segments, and F-motifs (Table 1). Words and syllables were probably the simplest metaphors for a general audience.

Discussion
We have seen that Menzerath's law is not inevitable in genomes and that it suffices that the number of parts (e.g., the number of chromosomes) and the size of the whole in the units of the parts (the size of chromosomes in bases) are correlated in order to reject a trivial case of the law [16,19]. However, we do not mean that the finding of a nontrivial Menzerath's law in the relationship between mean chromosome size and chromosome number [15,19] is due to the striking similarities between noncoding DNA and linguistic units with grammatical meaning that we have enlightened here but Sol e neglected [11]. We have never argued that the finding of the law in genomes is indicative of meaning, syntax, or any other important property of language. The finding of Menzerath's law both when noncoding DNA is excluded [16] and when noncoding and coding-DNA are mixed [15], and beyond, i.e. in language (see Ref. [33] for a review) and music [13], suggests that a higher level of abstraction is necessary for understanding the recurrence of the law.
To our knowledge, it has not been investigated yet if noncoding DNA alone could lead to Menzerath's law, or more interestingly, a nontrivial Menzerath's law. Without this research, it is not possible either to have a clearer understanding of the role of noncoding DNA in the emergence of Menzerath's law in genomes or to question the relevance of the law in genomes. Perhaps, rather than precluding the emergence of the law or leading to a trivial law, noncoding DNA may contribute to the emergence of the law in a way that defies a trivial explanation. Languages and genomes show a striking similarity at the semantic level: both possess units that have an arbitrary semantic reference of symbolic nature [37]. Our comparison goes further and suggests that genomes code for some abstract version of grammatical and lexical mean-ing, the former in noncoding regions and the latter in coding regions. However, the depth of the similarity and the possible DNA-specific properties must be investigated further. One of the challenges for language research is estimating the proportion of material with grammatical meaning including both free function words and bound morphemes.
Quantitative linguistics offers powerful tools for discovering and investigating nontrivial connections between human language and genomes [37,38]. However, the evolutionary mechanisms and the constraints that may underlie the recur-rence of Menzerath's law still must be understood.