Self-similar community structure in organisations

The formal chart of an organisation is designed to handle routine and easily anticipated problems, but unexpected situations arise which require the formation of new ties so that the corresponding extra tasks can be properly accomplished. The characterisation of the structure of such informal networks behind the formal chart is a key element for successful management. We analyse the complex e-mail network of a real organisation with about 1,700 employees and determine its community structure. Our results reveal the emergence of self-similar properties that suggest that some universal mechanism could be the underlying driving force in the formation and evolution of informal networks in organisations, as happens in other self-organised complex systems.

Although the formal chart of an organisation is intended to prescribe how employees interact, ties between individuals arise for personal, political, and cultural reasons [1].The understanding of the formation and structure of such informal networks are key elements for successful management [1,2,3].The traditional way of investigating informal networks within organisations consists of conducting surveys using employee questionnaires.However, employees answers often contain subjective elements such as "political" motives and the worry about offending colleagues.Another significant limitation of the questionnaire based analysis is that time and effort costs make it prohibitively expensive to map the entire network even for medium sized organisations.The rapid development of electronic communications provides a powerful alternative to the traditional analysis of informal networks.Indeed, the exchange of e-mails between individuals in organisations reveals how people interact and allows mapping the informal network in a non-intrusive, objective, and quantitative way.
We surmise that the exchange of e-mails between individuals in organisations reveals how people interact [4,5] and therefore provides a map of the real network structure behind the formal chart.We analyse the complex e-mail network of a real organisation with about 1,700 employees and determine its community structure [6,7,8,9].Our results reveal the emergence of self-similar properties that suggest that some universal mechanism could be the underlying driving force in the formation and evolution of informal networks in organisations, as happens in other self-organised complex systems [10].
Every time that an e-mail is sent, the addresses of the sender and the receiver are routinely registered in a server.Therefore, an e-mail network can be built regarding each e-mail address as a node and linking two nodes if there is an e-mail communication between them.In particular, we take as a case study the e-mail network of University Rovira i Virgili (URV) in Tarragona, Spain, containing around 1700 users (Fig. 1).Bulk e-mails provide little or no information about how individuals or teams collaborate and, once they are removed, the connectivity distribution of the e-mail network is exponential, FIG.1: The e-mail network of URV.The network comprises approximately 1700 users, including faculty, researchers, technicians, managers, administrators, and graduate students.We consider e-mails exchanged between university addresses during the first three months of 2002.Each individual is represented by a node, with two individuals (A and B) being connected if A has sent an e-mail to B and B has also sent an e-mail to A. Bulk e-mails provide little or no information about how individuals or teams collaborate.To minimise their effect: (i) we eliminate e-mails that are sent to more than 50 different recipients and (ii) we disregard links that are unidirectional, that is we consider only e-mails that represent a real communication link, where e-mails flow in both directions.With these two restrictions, the network is undirected and is formed by a main component comprising 1133 nodes and many isolated nodes or pairs of nodes.These little islands are not plotted to keep the figure as simple as possible.The colour of each node identifies an individual's affiliation to a specific centre within the university.
2. This result is in contrast with recent findings indicating that some technology based social networks-such as rough e-mail networks [4], the Instant Messaging Network [11] or the PGP encryption network [12]-show heavily skewed degree distributions, but is consistent with the proposal of Amaral and coworkers that the truncation of the scale-free behaviour in real world networks is due to the existence of limitations or costs in the establishment of connections [6].Indeed, it seems plausible that there are costs to maintaining active social acquaintances and therefore active communications.However, it is relatively easy to keep many electronic acquaintances open, although 00 00 00 00 11 11   The betweenness of an edge is defined as the number of minimum paths connecting pairs of nodes that go through that edge [21,22].The GN algorithm is based on the idea that the edges which connect highly clustered communities have a higher edge betweenness-in this case, edge BE-and therefore cutting these edges should separate communities.The algorithm proceeds by identifying and removing the link with the highest betweenness in the network.After every removal, the betweenness of the edges is recalculated.This process is repeated until the 'parent' network splits, producing two separate 'offspring' networks.The offspring can be split further in the same way until they comprise of only one individual.b, In order to describe the entire splitting process, we generate a binary tree, in which bifurcations (white nodes) depict communities and leaves (black nodes) represent individual addresses of the e-mail network.
At the beginning of the process, the network is a single entity, represented by node 1 in the tree.After the removal of the edge BE, the network is split into two subnetworks, 2 and 3, containing addresses A to D and E to I respectively.The two offspring networks have no further internal community structure.Consider first, subnetwork 2 containing nodes A to D. When all the links are equivalent and have the same betweenness as in the present case, one of them will be selected at random for removal.It is straightforward to show that, iterating the link removal procedure, nodes will be separated one by one and randomly by the GN algorithm, generating a branch in the binary tree.As an example, the figure represents a situation in which B is separated first, then A, and finally D and C, but a different random selection of links would lead to a different separation order.Similarly, in subnetwork 3 nodes will be separated one by one and at random, except for the fact that the most central node, E, will always be separated last.In general, for large networks in which the probability of having two links with the same betweenness is very small, it will still be true that communities will appear as branches in the community binary tree and that the tips of the branches will correspond to the most central agents in the network.
most of them are probably inactive from a social point of view.
To understand the structure of the informal network of the organisation, we are interested in determining how individuals interact and form groups that, in turn, interact with each other giving rise to higher order groups, that is, groups of groups.In other words, we want to unravel the community structure of the network.To do so we use the algorithm proposed recently by Girvan and Newman (GN) [9] to identify communities in complex networks (see Fig. 2).The algorithm proceeds by splitting the network recursively until single nodes are left.The information about the community structure of the original network can be deduced from the topology of the binary tree that represents this splitting procedure and which leaves correspond to addresses of the e-mail network (Fig. 2b).The different communities of the original network appear as branches in this tree, which are easily identified by visual inspection.The community binary tree for URV is shown in Fig. 3.Each  2b) and branches are depicted so that they can be clearly differentiated.In particular, only the leaves of the tree, that correspond to e-mail addresses, are plotted, as shown in the detail that is zoomed.The colour of each of the leaves represents different centres within the university (five small centres containing less than 10 individuals are assigned the same colour).Nodes of the same colour (from the same centre) tend to stick together in the same branch meaning that individuals within the same department tend to communicate more, and that the algorithm is capable of resolving separate centres to a good degree of accuracy.The complicated branching structure resembles self-similar systems in nature such as river networks or diffusion-limited aggregates.b, Same as before but without showing the leaves.Branches are now coloured according to their Horton-Strahler index (see text) c, Binary tree showing the result of applying the GN algorithm to a random graph with the same size and connectivity than the e-mail network.The lack of community structure is reflected in the absence of branches in the tree, which contrasts with the intricate self-similar structure of a and b.Again, colours correspond to Horton-Strahler indices.
colour in Fig. 3a corresponds to one centre of the university, that is to a faculty or college, or to management units such as the office of the Rector of the university.Two properties of the tree are worth noting.First, a clear branching structure emerges, with branches essentially containing nodes of the same colour.This shows that the identification of communities is successful, despite the complexity of the network.Second, the branching structure is far from simple.Indeed, each branch is formed, in general, by a system of nested smaller subbranches that give rise to a complicated structure that visually resembles some self-similar systems in nature such as river networks [13] or diffusion-limited aggregates [14].For comparison, we also show the tree generated by the GN algorithm from a random graph of the same size and average connectivity as the e-mail network (Fig. 3c).In contrast to the tree for the URV e-mail network, the branching structure is almost trivial with most of the branches containing only 1 or 2 nodes.This is the expected result for a network that do not have any sort of community structure.
Once the binary tree has been obtained, we look for a quantitative characterisation of the community structure.First we consider the cumulative community size distribution, P (s), i.e. the probability of a community having a size larger or equal to s. Fig. 4a shows how to compute this probability, and the resulting distribution for the e-mail network is presented in Fig. 4d.The distribution is heavily skewed, following a power law behaviour P (s) ∝ s −α with α = 0.48 between s = 2 and s ≈ 100.Beyond this value, the distribution shows a sharp decay and, at s ≈ 1000, a cutoff that corresponds to the size of the system.The power law of the community size distribution suggests that there is no characteristic community size in the network (up to s ≈ 100).To rule out the possibility that this behaviour is due to the community identification algorithm, we also consider the community size distribution for a random graph with the same size and average connectivity as the e-mail network.
The characterisation of the community binary tree using the cumulative size distribution has its analogy in the river network literature [13,15,16].The equivalent measure is the distribution of drainage areas, that represents the amount of water that is generated upstream of a given point (see Fig. 4b).The similitude between the community size distribution of the current e-mail network in Fig. 4d and the area distribution of the Fella river network in Italy reported in Fig. 2 of Ref. [16] is striking.The exponent α = 0.45 for the power law region of this river and the average exponent for several rivers α river = 0.43 ± 0.03 respectively reported by [16] and [15], are very close to the current α = 0.48.Moreover, the behaviour shown in Fig. 4d with first a sharp decay and then a final cutoff is also shared by river networks, which are known to evolve to a state where the total energy expenditure is minimised [15,17,18].The possibility that communities within organisations might also spontaneously self-organise into a form in which some quantity is optimised is very appealing and deserves further investigation.
To further understand this point, it is pertinent to ask the question: does the similarity between community trees in organisations and river networks arise just by chance or are there other emergent properties shared by both?To answer this question we consider a standard measure for categorising binary trees: the Horton-Strahler (HS) index, originally intro- FIG.4: Self-similarity in the community structure.a, Calculation of the community size distribution for a binary tree generated by the community identification algorithm.Black nodes represent the actual nodes of the original graph while white nodes are just graphical representations of communities that arise as a result of the splitting procedure.Nodes A and B belong to a community of size 2, and together with E form a community of size 3. Similarly, C, D and F form another community of size 3.These two groups together form a higher level community of size 6.Following up to higher and higher levels, the community structure can be regarded as the set of nested groups.The size, si, of a community i is just the summation of the sizes of its two offspring j1 and j2: si = sj 1 + sj 2 .In this case there are three communities of size 2, three communities of size 3, one community of size 6, one community of size 7, and one community of size 10.Note that a single node belongs to different communities at different levels.b, Calculation of the drainage area distribution for a river network.The drainage area of a given point is the number of nodes upstream of it plus one.For a point i with offspring j1 and j2, si = sj 1 + sj 2 + 1. c, Calculation of the Horton-Strahler index.The index of a branch changes when it meets a branch with higher index, or when it meets a branch with the same value and both of them join forming a branch with higher index.In this case, there are 10 branches with index 1, 3 branches with index 2, and 1 branch with index 3. d, The distribution of community sizes, P (s), showing a power law region with the exponent -0.48, followed by a sharp decrease at s ≈ 100 and a cutoff corresponding to the size of the system at s ≈ 1000.The distribution of community sizes in a random network is shown with a dotted line for comparison.e, The number of branches with HS index i, as a function of i.From the definition of the branching ratio, it is straightforward to show that, when topological self-similarity holds, Ni = N1/B i−1 .A fitting of this function to the points obtained for the e-mail community tree yields excellent agreement with B = 5.76.A much worse agreement is obtained for the community tree corresponding to the random network, with Bi fluctuating around 3.46.
duced for the study of river networks by Horton [19], and later refined by Strahler [20].Consider the binary tree depicted in Fig. 4c.The leaves of the tree are assigned a HS index i = 1.
For any other branch that ramifies into two branches with HS indices i 1 and i 2 , the index is calculated as follows: Note that the index of a branch changes when it meets a branch with higher index, or when it meets a branch with the same value and both of them join forming a branch with higher index.In terms of communities, the interpretation of the HS index is the following.The index of a community changes when it joins a community of the same index.Consider, for instance, the lowest levels: individuals (i = 1) join to form a group (or team, with i = 2), which in turn will join other groups to form a second level group (or department, i = 3).Therefore, the index reflects the level of aggregation of communities.The number of branches N i with index i can be determined once the HS index of each branch is known .The bifurcation ratios B i are then defined by When B i ≈ B for all i, the structure is said to be topologically self-similar, because the overall tree can be viewed as being comprised of B sub-trees, which in turn are comprised of B smaller sub-trees with similar structures and so forth for all scales.River networks are found to be topologically selfsimilar with 3 < B < 5 [14].We find that the community tree as generated by the process described above is topologically self-similar with B i ≈ B = 5.76 (see Fig. 4e).The same analysis for the communities in a random graph shows that topological self-similarity does not hold, since the values B i are not constant; they fluctuate around a smaller 3.46 value.
The methods presented here open interesting doors regarding the possibility of mapping the informal network of large organisations in a non-intrusive, objective, and quantitative way.Moreover, the emergence of scaling and self-similarity in the community structure, as well as the similarity with river networks, raises important questions about the mechanisms underlying the interactions between individuals within an organisation.Self-similarity is a fingerprint of the replication of the structure at different levels of organisation, and could be the result of the trade-off between the need for cooperation and the physical constrains to establish connections at any organisational level.At the same time, the similitude with river networks suggests that a common principle of optimisationof flow of information in organisations or of flow of water in rivers-could be the underlying driving force in the formation and evolution of informal networks in organisations.

FIG. 2 :
FIG.2: Community identification according to the GN algorithm.a, The betweenness of an edge is defined as the number of minimum paths connecting pairs of nodes that go through that edge[21,22].The GN algorithm is based on the idea that the edges which connect highly clustered communities have a higher edge betweenness-in this case, edge BE-and therefore cutting these edges should separate communities.The algorithm proceeds by identifying and removing the link with the highest betweenness in the network.After every removal, the betweenness of the edges is recalculated.This process is repeated until the 'parent' network splits, producing two separate 'offspring' networks.The offspring can be split further in the same way until they comprise of only one individual.b, In order to describe the entire splitting process, we generate a binary tree, in which bifurcations (white nodes) depict communities and leaves (black nodes) represent individual addresses of the e-mail network.At the beginning of the process, the network is a single entity, represented by node 1 in the tree.After the removal of the edge BE, the network is split into two subnetworks, 2 and 3, containing addresses A to D and E to I respectively.The two offspring networks have no further internal community structure.Consider first, subnetwork 2 containing nodes A to D. When all the links are equivalent and have the same betweenness as in the present case, one of them will be selected at random for removal.It is straightforward to show that, iterating the link removal procedure, nodes will be separated one by one and randomly by the GN algorithm, generating a branch in the binary tree.As an example, the figure represents a situation in which B is separated first, then A, and finally D and C, but a different random selection of links would lead to a different separation order.Similarly, in subnetwork 3 nodes will be separated one by one and at random, except for the fact that the most central node, E, will always be separated last.In general, for large networks in which the probability of having two links with the same betweenness is very small, it will still be true that communities will appear as branches in the community binary tree and that the tips of the branches will correspond to the most central agents in the network.

FIG. 3 :
FIG.3: Communities in the e-mail network of URV.a, Binary tree showing the result of applying the GN algorithm to the e-mail network of URV.The position indicated by the arrow represents the root of the tree (equivalent to node 1 in figure2b) and branches are depicted so that they can be clearly differentiated.In particular, only the leaves of the tree, that correspond to e-mail addresses, are plotted, as shown in the detail that is zoomed.The colour of each of the leaves represents different centres within the university (five small centres containing less than 10 individuals are assigned the same colour).Nodes of the same colour (from the same centre) tend to stick together in the same branch meaning that individuals within the same department tend to communicate more, and that the algorithm is capable of resolving separate centres to a good degree of accuracy.The complicated branching structure resembles self-similar systems in nature such as river networks or diffusion-limited aggregates.b, Same as before but without showing the leaves.Branches are now coloured according to their Horton-Strahler index (see text) c, Binary tree showing the result of applying the GN algorithm to a random graph with the same size and connectivity than the e-mail network.The lack of community structure is reflected in the absence of branches in the tree, which contrasts with the intricate self-similar structure of a and b.Again, colours correspond to Horton-Strahler indices.