Emergence of Zipf's Law in the Evolution of Communication

Zipf's law seems to be ubiquitous in human languages and appears to be a universal property of complex communicating systems. Following the early proposal made by Zipf concerning the presence of a tension between the efforts of speaker and hearer in a communication system, we introduce evolution by means of a variational approach to the problem based on Kullback's Minimum Discrimination of Information Principle. Therefore, using a formalism fully embedded in the framework of information theory, we demonstrate that Zipf's law is the only expected outcome of an evolving, communicative system under a rigorous definition of the communicative tension described by Zipf.


I. INTRODUCTION
Zipf's law is one of the most common power laws found in nature and society [1][2][3][4][5][6]. Although it was early observed in the distribution of money income [7] and city sizes [1], it was popularized by the linguist George Kingsley Zipf, who observed that it accounts for the frequency of words within written texts [2,3]. Specifically, if we rank all the occurrences of words in a text from the most common to the least, Zipf's law states that the probability q(s m ) that, in a random trial, we find the mth most common word (i = 1, . . . , n) falls off as with γ ≈ 1. The ubiquity of this scaling behavior suggested several mechanisms to account for the emergence of this distribution; among many others, see [4,[8][9][10][11][12].
Within the context of human language, G. K. Zipf conjectured early that this scaling law is the outcome of a tension between two forces acting in a communication system [3]. Following Zipf's proposal, speakers and hearers need to simultaneously minimize their efforts under what he called vocabulary balance-a particular case of the so-called Principle of Least Effort. This triggers a tension between the two communicative agents, while trying to simultaneously minimize their efforts. The speaker's economy would favor a reduction of the size of the vocabulary to a single word whereas the hearer's economy would lead to an increase of the size of a vocabulary to a point where there would be a different word for each meaning. The resulting vocabulary would emerge out of this unification-diversification conflict [3]. Although both numerical and theoretical studies have explored this idea [10,11,13], no truly analytic proof of unicity has been provided under realistic information-theoretic constraints. We can view the proposals made in [10,11,13] as static because they consider a fixed size of code.
A recent approach, which goes beyond the communicative framework, defined the key complexity properties of a system to display a statistics of events following Zipf's law: An open, unbounded number of accessible states and a linear loss of entropy due to generic internal constraints [12]. The linear loss of entropy grasps the intuitive idea that the studied systems are in an intermediate state between order and disorder-or that a possible informative tension is balanced, as we shall see-and the unbounded number of accessible states reflects their open nature. It was shown that, under a very general parametrization, and imposing properties of scale invariance to the solution, Zipf's law was the only possible outcome. Now we adapt and enrich the general framework proposed in [12] to the communicative context. As we shall see, Zipf's hypothesis can be interpreted in such a way that the system can be studied within the framework proposed in [12]. Moreover, the parameters that were arbitrary in the general mathematical framework mentioned above can now be naturally interpreted in the communicative framework as the key pieces of the mathematical statement of Zipf's hypothesis.
Beyond the mathematical formalization of the communicative conflict described by Zipf, we need another ingredient, pointed out in a different context in [14]; namely, the active role played by the evolutionary path followed by the code. As it occurs with other systems growing out of equilibrium, such as scale-free networks [15], we will consider the evolution of the communicative exchange under a system's growth.
Here the evolutionary component is variationally introduced by minimizing the divergence between code configurations belonging to successive time steps. This minimal change follows the so-called Minimum Discrimination Information Principle (henceforth MDIP), a general variational principle considered analogous to the Maximum Entropy Principle [16], from which statistical mechanics can be properly formalized [17,18]. The MDIP states that, under changes in the constraints of the system, the most expected probability distribution is the one minimizing the Kullback-Leibler divergence (also referred to as Kullback-Leibler entropy or relative entropy) from the original one [17]. Such a variational principle constrains the changes of the internal configurations of an statistical ensemble when the external conditions change in the same way that internal configurations of an statistical ensemble change when we introduce moment constraints in a Jaynesian formalism. In our context, this information-theoretic functional assumes the role of a Lagrangian whose minimization along the process defines the possible ensemble configurations one can observe at a certain point of an evolutionary path.
Using the MDIP and the framework provided in [12], we provide a proof of unicity for the emergence of Zipf's law in evolving codes. We stress that no arbitrary assumptions are made on the nature of solutions.
The remainder of the paper is structured as follows: In Sec. II we rigorously define the communicative tension intuitively defined by Zipf and explicitly characterize the evolutionary process in terms of the mathematical statement of such a tension. In Sec. III we apply the MDIP as the guiding, variational principle which accounts for the possible evolutionary paths of the code. Finally, we demonstrate that the consequences of the application of both the communicative tension and the MDIP account for the emergence of Zipf's law as the unique possible solution of the evolving code. In Sec. IV we discuss the implications of our results.

II. THE EVOLUTION OF THE COMMUNICATIVE SYSTEM
In this section we mathematically define (1) the communicative tension described by Zipf and (2) the evolution or growth of a given code subject to such a tension. We furthermore define the range of application of our formalism. As we shall see in Sec. III, the proposal made in this section defines a framework whose key piece to work with is Eq. (6).

A. The explicit description of the communicative conflict
The first task is to properly define the communicative tension between the coder and the decoder and how this tension is solved. Following the standard nomenclature used in studies of the evolution of communicating autonomous agents [19][20][21], in our system there are two agents: the coder agent P, encoding information from a set of external events , and the decoder or external observer, which infers the behavior of through the code provided by the coder agent P. In this way, is the set of external events acting as the input alphabet, and S = {s 1 , . . . , s n } is the set of signals or output alphabet. The coder module P [ Fig. 1(a)] is fully described by a matrix P (X s |X ), where X is a random variable taking values on the set following the probability measure p; with p(m k ) being the probability to have symbol m k as the input in a given computation. Complementarily, X s is a random variable taking values on FIG. 1. A growing communication system. In (a), possible meaning-signal associations made by the coder module P, in which Eq. (2) holds, is depicted. In (b), we summarize the evolution rules of our communicative system. Suppose that symmetry between coder and decoder [i.e., Eq. (2)] holds for the step n (above). At each step (below) a new element is added to the set and Eq. (2) holds again for this new configuration. Furthermore, the new configuration is constrained by the MDIP, which introduces a path dependency in the evolutionary process.
S and following the probability distribution q which, for a given s i ∈ S, reads that is, the probability to obtain s i as the output of a codification. We assume that For the decoder agent inferring the input set from the output set with least effort, the best scenario is a one-to-one mapping between and S. In this case, P generates an unambiguous code, and no supplementary amount of information to successfully reconstruct X is required. However, from the coding device perspective, this coding has a high cost. In order to characterize this conflict, let us properly formalize the above intuitive statement: The decoder agent wants to reconstruct X through the intermediation of the coding performed by P. Therefore, the amount of bits needed by the decoder of X s to unambiguously reconstruct X is which is the joint Shannon entropy or, simply, joint entropy of the two random variables X ,X s . 1 From the codification process, the decoder receives H (X s ) bits and, thus, the remaining uncertainty it must face will be that is, the entropy of the random variable X s , and is the conditional entropy of the random variable X conditioned to the random variable X s . The tension between the coder and the decoder is solved by imposing a symmetric balance between its associated efforts [see Fig. 1(a)]. That is, the coder sends as many bits as the additional bits the observer needs to perfectly reconstruct X : The above ansatz is the mathematical formulation of the symmetric balance between the efforts of the coder and the decoder. We will refer to this equation as the symmetry condition and, as pointed out in [11], it mathematically describes how the communicative tension is solved by using a cooperative strategy between the coder and the decoder agents. It is worth noting that different equations sharing the same spirit were formerly proposed within the framework of the so-called code-length game [10]. From Eq. (2), we can state that And knowing the classical inequalities we reach a general relation between the informative richness of the input variable X and the informative richness of the messages sent by the coder, constrained by Eq. (2): The first relation becomes equality only in the case of P performing a deterministic codification process. The second relation becomes equality when the coding device performs completely random associations. It is clear that Eqs. (2) and (3) alone cannot explain the emergence of Zipf's law, since one could tune the parameters of, say, an exponential distribution to reach the desired relation between entropies. Therefore, we need to introduce another ingredient to obtain Zipf's law as the unique possible solution to our problem.

B. Evolution
The unicity in the solution is provided by the evolution, which is now explicitly introduced [see Fig. 1(b)]. Let us suppose that our communicative success grows over time, thereby increasing the number of input symbols that P can encode. Formally, this implies that the cardinality of the set defined above increases. We introduce this feature by defining a sequence of 's (1), . . . , (k), . . . satisfying an inclusive ordering; that is, which is introduced, without any loss of generality, assuming that . . . At time step n, P will be able to process the n symbols of (n). The elements m 1 , . . . , m i , . . . are members of some infinite countable set˜ [i.e., (∀i)( (i) ⊂˜ )].˜ can be understood, using a thermodynamical metaphor, as a reservoir of information. Following this characterization, we say that, for every set (i), there is a random variable X (i), taking values in (i) following the ordered probability distribution p i . Furthermore, we assume that there exists a unique μ ∈ (0,1) such that (∀ > 0)(∃N ) : (∀n > N), This means that the entropy of the input set is unbounded when its size increases, which implies that the potential input set˜ acts as an infinite reservoir of information.
The behavior of the output set at the stage n is described by a random variable X s (n), which follows the ordered probability distribution q n , as defined in Eq.
The above equation depicts two crucial facts in the forthcoming derivations: If the potential informative richness of the input set is unbounded, so is the informative richness of the output set, under the constraints imposed by the symmetry condition [see Eq. (2)].

III. THE EMERGENCE OF ZIPF'S LAW UNDER THE MDIP
The MDIP is presented in this section as the variational principle guiding the evolution of the code. As we shall see  at the end of this section, the consequences of its application result in a proof of unicity for the emergence of Zipf's law in evolving codes.

A. The MDIP and its consequences for the evolution of codes
The question is thus how the probability distribution q n evolves along the growth process. Under the MDIP we face a variational problem which is stated as follows: During the growth process, the most likely code at step n + 1 is the one minimizing the distance with respect to the code at step n, consistent with the MDIP. Furthermore, the evolution of the code must satisfy, along all the evolutionary steps, the symmetry condition depicted by Eq. (2). The crucial contribution of the MDIP is that it naturally introduces the footprints of the path dependence imposed by evolution. Following the thermodynamical metaphor, this variational principle acts, in our context, as a principle on energy minimization acting over the transitions of successive codes. Putting it formally, let be the Kullback-Leibler divergence of the distribution q n+1 with respect to the distribution q n [22]. Therefore, the MDIP is achieved by minimizing the following functional [17]: We observe that this functional has a role equivalent to the one attributed to the Lagrangian function in a given continuous, differentiable system; therefore, the trajectories minimizing it will govern the evolution of the system. Furthermore, the symmetry condition on coding-decoding [Eq. (2)] imposes that the solutions must lie in the region defined by Eq. (6). The minimum of L is found when q n+1 satisfies where λ is the Lagrange multiplier, which is a positive unique constant for all elements of the probability distribution q n+1 . We observe that, for λ = 1, D(q n ||q n+1 ) = 0, but, in this case, H (X s (n)) = H (X s (n + 1)), in contradiction to the assumption provided by Eq. (6), according to which informative richness grows during the evolutionary process. Now we want to find the asymptotic behavior of q n , n → ∞ under the above-justified conditions (6) and (7). The key feature we derive from the path dependency in the evolution imposed by the MDIP is that the following quotient does not depend on n. Therefore, along the evolutionary process, as soon as q n (s k ), q n (s k+j ) > 0,

B. The emergence of Zipf's law
The asymptotic behavior of quotient f and, thus, the tail of q n is strongly constrained by the entropy restriction provided by Eq. (6) [12]. As we shall see, the key of the forthcoming derivations will be the convergence properties of the normalized entropies of a given random variable X having n possible states whose (ordered) probabilities follow a power-law distribution function; namely, g(s i ) ∝ i −γ . The explicit form of these entropies is Consistently, Z γ is the normalization constant. The first observation is that it can be shown that the convergence properties of the Riemann ζ function on R + [23], strongly constrain the convergence properties of a given probability distribution [12]. Indeed, we find that, if (∀δ > 0, n > m)(∃N ) such that: , then (∃C < ∞ ∈ R + ) such that (∀n)[H (X s (n)) < C], which contradicts the assumptions of the problem depicted by Eq. (6). Indeed, primarily, one can observe that the above statement implies that q n is dominated by a power law having exponent 1 + δ; that is, that q n decays faster than q n , which is defined as where Z 1+δ is the normalization constant. Now, we write the explicit form of the entropy of X s (n) ∼ q n -to be written as H (X s (n))-when n → ∞ by multiplying the expression derived in Eq. (9) by log n: We observe that all the elements of the above equation are finite constants, since Thus, having q n as defined above, lim n→∞ H (X s (n)) < ∞.
Therefore, during the growth process, due to the constraint imposed by Eq. (6), with δ arbitrarily small, provided that n can increase unboundedly. Otherwise, its normalized entropy [see Eq. (9)] will have 036115-4 as an asymptotic value H (X s (n)) log n → 0, in contradiction to the assumption that ν > 0 as depicted in Eq. (6). Furthermore, we observe that, if (∀δ > 0, n > m) (∃N ) such that then lim n→∞ H (X s (n)) log n = 1, again in contradiction to Eq. (6), except in the extreme pathological case where ν = 1, when the coding process is completely noisy. To see how we reach this latter point we observe that statement (11) implies that q n is not dominated by a power-law probability distribution q n having exponent 1 − δ; namely, where Z 1−δ is the normalization constant. Putting explicitly the expression of the normalized entropy [see Eq. (9)] for the random variable X s (n), one obtains which is the desired result. Accordingly, since from Eq. (6) ν is generally different from 1, Thus, combining Eqs. (10) and (12), we have shown that the asymptotic solution is bounded by the following chain of inequalities: The crucial step is that it can be shown that, if n → ∞, we can set δ → 0.
(The mathematical technicalities of this result can be found in [12].) This implies, in turn, that, for n 1, and, from the definition of f provided in Eq. (8), we conclude that which leads us to Zipf's law as the unique asymptotic solution. In Fig. (2) we numerically explored the behavior of the rank probability distribution of signals belonging to a growing code under the assumption of symmetry in coding-decoding provided by Eqs. (2) and (6), and the MDIP whose consequences in the evolution of q n are depicted in Eq. (7). The outcome perfectly fits with the mathematical derivations, showing very well-defined power laws with exponents close to 1, although the convergence values ν diverge from 0.2 to 0.5. This numerical validation shows that the predicted asymptotic effects (i.e., the convergence of q n to Zipf's law) are perfectly appreciated even in finite simulations where 10 5 signals are at work.
We end this section with a remark on the boundary conditions needed for the emergence of Zipf's law. In the Sec. II B, we imposed that the potential information richness of the source must be unbounded. Such a condition is mathematically stated by (4). We observe that, more than an assumption, Eq. (4) is a boundary condition under which a growing code can (asymptotically) exhibit Zipf's law. 2 In this 2 We notice that Eq. (4) depicts a linear relation between H (X (n)) and log n; that is, H (X (n)) ∼ μ log n. There are strong reasons to believe that one could generalize this statement by saying that the only condition needed is that, in spite that lim n→∞ H (X (n)) log n = 0, if H (X (n)) is a monotonic, growing and unbounded function on n, then Zipf's law would emerge using similar arguments to the ones used in this paper. The lack of a rigorous demonstration for this latter point forces us to restrict our arguments to the region of application of Eq. (4).
way, since H (X s (n)) has a linear relation with H (X (n)), the divergence of the latter implies the divergence of the former. And it is a required condition, since the entropy of a system exhibiting a power law with an exponent equal to 1 diverges with n. Otherwise, exponents are higher, or other probability distributions can emerge.

IV. DISCUSSION
The results provided in our study define a general rationale for the emergence of Zipf's law in the abundance of signals of evolving communication systems. The variational approach taken here as a formal picture of the least effort hypothesis has two ingredients. First, starting from Zipf's conjecture, we reach a static symmetry equation to solve the communicative tension between coder and decoder. This is consistent with previous work, but reveals itself insufficient to derive Zipf's law as the unique solution, for it is easy to check that static equations of the kind of Eqs. (2) and (3) have infinite arbitrary solutions, even in the asymptotic regime, due to the possible parametrizations of the solutions. Secondly-and crucially-we consider that the code evolves through time and that, consistently, there is a path dependence in its evolution, which is mathematically stated by imposing a variational principle, the MDIP, between successive states of the code. It is only by imposing evolution (and thus, path dependence) that we reach Zipf's law as the only asymptotic solution. Therefore, the origin of the power law with exponent γ = −1 derives from three complementary and very general conditions: (1) the unbounded informative potential of the code, (2) the loss of information resulting from the symmetry condition, depicted in Eq. (2), and (3) evolution, and its associated path dependence, variationally imposed by the application of the MDIP over successive states of the evolution of the system.
There is another, very interesting point, intimately tied to a code exhibiting Zipf's law and, more specially, the consequences of the symmetry condition, the mathematical ansatz which abstractly encodes the Zipf's hypothesis of vocabulary balance: The presence of an inevitable ambiguity in the code. It is a common observation that natural languages are ambiguous; namely, that linguistic utterances or parts of linguistic utterances can be assigned more than one interpretation. If the principle of least effort is at work and, thus, a cooperative strategy exists between the coder and the decoder, then the presence of a certain amount of ambiguity is expected, provided that the speaker tends to assign more than one meaning to certain signals. Therefore, ambiguity is a byproduct of efficient communication rather than a fingerprint of poor communicative design.
The presented framework is general, and rigorously demonstrates that Zipf's law is a natural outcome of a broad class of communication systems evolving under coding-decoding tensions. In other words, Zipf's law emerges in a system where the coder and decoder coevolve under a general problem of energy minimization. The range of application to real-world phenomena, however, must be contrasted with the validity of data, for it has been pointed out that many supposed power-law behaviors show deviations when the statistical analysis is performed accurately [24,25]. It should be noted, however, that a deviation of the predicted behavior need not be necessarily attributed to a failure of the framework. One should take into account that other constraints, such as general memory limitations, can play a role in shaping the final distribution.