In silico evidence that protein unfolding is as a precursor of the protein aggregation

We present a computational study on the folding and aggregation of proteins in aqueous environment, as function of its concentration. We show how the increase of the concentration of individual protein species can induce a partial unfolding of the native conformation without the occurrence of aggregates. A further increment of the protein concentration results in the complete loss of the folded structures and induces the formation of protein aggregates. We discuss the effect of the protein interface on the water fluctuations in the protein hydration shell and their relevance in the protein-protein interaction.


Introduction
Proteins cover a range of fundamental functions in the human body: i) the enzymes and hormones are proteins; ii) proteins can carry other biomolecules within the cellular environment; iii) proteins are a source of energy; iv) proteins are necessary to build and repair tissues [1]. A protein is synthesized in the ribosome and, despite the fact that the cellular environment is very crowded, it is capable to reach its native conformation (mostly dictated by the protein sequence). This process is usually spontaneous-at least for small protein-or is driven by complex interactions with other biomolecules, like the chaperones.
Proteins can aggregate after they folded in the native state -through the formation of chemical bonds or self-assembling -or via unfolded intermediate conformations and their propensity to aggregate is related to a series of factors, like the flexibility of the protein structure [2] or the sub-cellular volume where the protein resides [3]. In particular, non-native protein aggregates are commonly formed through a multi-step process and are composed by nativelike-partially folded intermediate structures [4,5,6,7]. Inappropriate protein aggregation represents a crucial issue in biology and medicine, being associated to a growing number of diseases such as Alzheimer's and Parkinson's disease [8,9,10,11]. In order to guarantee the correct biological functions, proteins have evolved to have a low enough propensity to aggregate within a range of protein expression required for their biological activity, but with no margin to respond to external factors increasing/decreasing their expression/solubility [12,13,3]. Indeed, protein aggregation is mostly unavoidable when proteins are expressed at concentrations higher than the natural ones.
The mechanisms leading to the failure of the folding process and to the formation of potentially dangerous protein aggregates are matter of large scientific debate [14], where computational tools have largely contributed to elucidate some crucial aspects. Nevertheless, to date an extensive computational study of protein aggregation with all-atom simulations including the solvent explicitly remains not practicable, making the coarse-grain approach a valid tool to rationalize those complex systems [15,16]. In particular, lattice models have been largely exploited to address fundamental questions on protein folding and aggregation [17,18,19,20,21,22,23,24,25,26,27]. According to these studies, the presence of more than one chain leads to aggregate-although each protein contains a considerable fraction of native structure-with consequent loss of the funnel-like free-energy landscape [17,19,24].
All these studies, usually performed with a fixed sequence [17,15] of with Go-like models [19,20], miss the explicit contribution of water, which instead is supposed to play an important role in the protein-protein recognition and aggregation [28,29,30,31,32,33]. Moreover, works implicitly accounting for water show that proteins with hydrophobic amino acids on the surface are prone to aggregate [25], although in nature many proteins present a considerable fraction of hydrophobic and non-polar amino acids on their native surface.
Here we present a computational study on the folding, stability and aggregation of proteins optimized according to the environment. We consider a series of native protein structures and for each we determine one or more sequences designed to make the protein fold into the aqueous environment [34]. Each sequence exhibits a different ratio between the number of hydrophilic amino acids exposed to the solvent and the number of hydrophobic amino acids buried into the core of the protein in its native conformation. For each protein, we study its capability to fold as function of its concentration. We show that the propensity to aggregate is not strictly related to the hydrophobicity of the protein surface.

Simulation scheme
To perform this study we adopt a coarse-grained lattice representation of proteins which is computationally affordable and has been largely adopted in leterature [35,36,37,38,39,40,34]. A protein is represented as a self-avoiding heteropolymer, composed by 20 amino acids, interacting each other through a nearest-neighbour potential given by the Miyazawa Jernigan interaction matrix [41,42,43] The protein is embedded in water, explicitly modeled via the many-body water model which has been proven to reproduce, at least qualitatively, the thermodynamic and dynamic behavior of water [44,45,46,47,48], including its interplay with proteins [49,50,39,51,34]. The coarse-grain representation of the water molecules, adopted to describe water at constant number of molecules N , constant temperature T and constant pressure P , replaces the coordinates and orientations of the water molecules by a continuous density field and discrete bonding variables, respectively. The discrete variables describe the local hydrogen-bond (HB) formation and its cooperativity, leading to a local open-tetrahedral structure of the water molecules.
Since the protein is composed by hydrophilic ζ and hydrophobic Φ amino acids, we assume that the first interact with water decreasing the local energy, while the second affect the water-water HB in the Φ hydration shell 2 . In particular, we assume that i) the water-water HB at the Φ interface are stronger than HB formed in the bulk consistent with the observation that water-water HBs in the Φ hydration shell are more stable and more correlated with respect to the bulk HBs [52,53,54,55,56,57]; ii) the local density fluctuations at the Φ interface are reduced upon pressurization, as observed in [58,59,60,61].
A detailed description of the model is reported in the nexs section, and in Ref. [39,51,34].
We consider 8 different proteins, which we label as A 0 , A 1 , A 2 , B, C, D, E and F , which native states are shown in Fig. 1. Each capital letter in the protein label identifies a different native structure, while different subscript numbers refer to different sequences associated to the structure. Therefore, proteins A 0 , A 1 and A 2 share the same native structure, but have a different sequences. 1 The matrix has been scaled of a factor 2, increasing the effective amino acid-amino acid interaction, to account for a lower surface-volume ratio in two dimensions. 2 The hydration shell is defined by the water molecules which are first-neighbors of the amino acids.  amino acids on the surface (core) and the total number of amino acids exposed to the solvent (buried into the core) when the protein attains its native conformation.
All the native structures have been selected considering maximally compact conformations, composed by 36 or 49 amino acids. Then, for each native structure, the protein sequence has been established through a design scheme, based on the standard approach introduced by Shakhnovich and Gutin [62,63] and successfully adopted to design realistic off-lattice proteins [64,65,66,67], but accounting explicitly for the water properties in the protein hydration shell [34].
We perform a Monte Carlo simulations in the isobaric-isothermal ensemble at ambient conditions, keeping fixed the protein conformation in its native state and mutating the amino acids, to explore the phase space of sequences. For each sequence the surrounding water is equilibrated and the average enthalpy H of the hydrated protein (residue-residue energy plus the average enthalpy of the water molecules in the hydration shell) is computed. The sequence to whom correspond the minimum value of H is selected as best folder. The design scheme leads to sequences which are not perfectly hydrophilic on the surface and hy-drophobic into the core, consistent with what is observed in real proteins [68,69].
The hydropathy of the designed protein surface and core is shown in Fig. 2, while the full amino acid composition composition of each sequence is shown in Fig. 6. It is worth to be noted that all the sequences generated differ each other, exhibiting different values for the hydrophilicity (hydrophobicity) of the protein surface (core), irrespective of the native structure, and the maximum overlap between the sequences is of 10 amino acids 3 . Each designed sequence is folded alone at ambient conditions to prove its capability to reach the native state.
Once the proteins have been designed, we simulate the folding of multi-protein systems in a range of concentrations c ∈ [1%, 55%], considering homogeneous solutions, i.e. when all the sequences are equal. Along the simulations we compute the free energy landscape as function of the total number of native contacts N c and inter-protein contacts I c to study the folding-aggregation competition.

Water Model
The coarse-grain representation of the water molecules replaces the coordinates and orientations of the water molecules by a continuous density field and discrete bonding variables, respectively. The density field is defined on top of a partition of the volume V into a fixed number N of cells, each with volume v ≡ V /N ≥ v 0 , being v 0 ≡ r 3 0 the water excluded volume and r 0 ≡ 2.9Å the water van der Waals diameter. The size of a cell r ≥ r 0 is a stochastic variable and coincides, by construction, with the average distance between first-neighbour water molecules. The general formulation of the model envisages to each cell i an index n i = 1 or n i = 0 according to the size r (which varies a lot from the gas phase to the super-cooled one), to distinguish when the molecule can form hydrogen bonds (HBs) or not, respectively. Here, since we perform the study at ambient conditions, we assume that all the molecules can form HB, 3 The maximum overlap between two sequences is computed shifting and overlapping one sequence with respect to the other, and counting the number of amino acids on both sequences which coincides along the overlapped region.
The Hamiltonian of the bulk water is The first term, summed over all the water molecules i and j at oxygen-oxygen distance r ij , is given by 4 [(r 0 /r) 12 − (r 0 /r) 6 ] for r 0 < r < 6r 0 , U = ∞ for r ≤ r 0 , and U = 0 for r ≥ 6r 0 (cutoff distance). We fix = 2.9 kJ/mol. of this angle is associated to a bonded state. Fixing q = 6 we correctly account for the entropic loss due to the HB formation.
The third interaction term in Eq. (1) corresponds to the cooperative interaction of the HBs due to the oxygen-oxygen-oxygen correlation. This effect originates from quantum many-body interactions of the HB [73] and in the bulk leads the molecules toward an ordered tetrahedral configuration [74]. This term is modelled as an effective interaction-with coupling constant J σ -between each of the six different pairs of the four indexes σ ij of a molecule i. Hence, we have coop ≡ ikl δ σ ik ,σ il which defines the cooperativity of the water molecules. By assuming J σ J we guarantee the asymmetry between the two terms [44]. HB /v 0 represents the average volume increase between high-density ices VI and VIII and low-density ice Ih [44]. Hence, the volume of bulk molecules is given by The water-water hydrogen bonding in the protein hydration shell depends on the hydrophobic (PHO) or hydrophilic (PHI) nature of the hydrated amino acids, and is described by the Hamiltonian where N PHO HB , N PHI HB and N MIX HB indicate respectively the number of HB formed between two molecules hydrating two hydrophobic amino acids, two hydrophilic amino acids, one hydrophobic amino acid and one hydrophilic amino acid. Analogously N PHO coop , N PHI coop and N MIX coop represent the cooperative bonds at the hydrophobic, hydrophilic and mixed interface.
The hydrophobic interface strengthens the water-water hydrogen bonding in the first hydration shell [75,76,56,54] and increases the local water density upon pressurization [54,77,78,79]. The first effect is included by assuming J PHO > J and J PHO σ > J σ . This condition guarantees that the solvation free energy of a hydrophobic amino acid decreases at low temperature T [80]. The second one is accounted assuming that the volume associate to the HB at the PHO interface decreases upon increasing P , v PHO HB /v PHO HB,0 ≡ 1 − k 1 P [39]. In this way, the density fluctuations at the PHO interface are reduced at high P . The volume contribution V PHO to total volume V due the HBs in the hydrophobic shell is V PHO ≡ N PHO HB v PHO HB . We assume that the water-water hydrogen bonding and the water density at the hydrophilic interface are not affected by the protein. Therefore, J PHI = J, HB . Finally, we fix J MIX ≡ (J PHO + J PHI )/2 and J MIX σ ≡ (J PHO σ + J PHI σ )/2. Lastly, we assume that the protein-water interaction energy is −ε PHO or −ε PHI , depending if the residue is hydrophobic or hydrophilic, respectively. As reported in Ref. [34], we express the model parameters in units of 8 , and fix the value with respect to a three-dimensional system.
All the results presented in this work have been tested under the change of parameters. In particular, we have decreased the effect of the protein interface on the water-water interaction observing a decrease in the concentration thresholds at which the proteins unfold and aggregate, but the phenomenology observed is substantially the same.

Folding vs Aggregation in homogeneous protein solutions
In Fig. 3 we show the free energy landscape of proteins A 0 , B and C as function of N c and I c simulated in a concentration range c ∈ [1%, 55%]. In all the cases we observe that for low concentrations, c 5%, the minimum of the free energy correspond to N c = 1 and I c = 0, i.e. all the proteins reach their native folded state and, on average, are not in contact to each other.
By looking at the separate free energy profile as function of N c (Fig. 4a,b,c) and I c (Fig. 4d,e,f) (obteined integrating the free energy profiles shown if Fig.   3 along the axes I c and N c respectively), respectively indicated with F (N c ) and    part of its native contacts leading to F min (N c ) for 0.8 N c < 1 while the aggregated state is still less favourable being F min (I c ) for I c = 0. The peculiar characteristic of the U N F state is that there are no inter-protein contacts (I c = 0 remains by far the lowest free energy minima Fig.4b).
In Fig. 5 we prove that, for protein isolated pairs, the unfolding starts before the residues can interact directly. Since the proteins are not interacting directly, and there are no long-range interactions in the model the logical conclusion is that the water is mediating the interaction that stabilises the misfolded states compared to the folded one. When we switched off the water terms in the model the U N F state disappears, and the systems go directly into the AGG state at even lower concentrations c (see Fig. 8 in the Supplementary Information).
Hence, it is clear that the water is creating a barrier against aggregation.
Such an unexpected role of the water has to the best of our knowledge never been observed before.
The U N F state holds for quite large values of c, where protein gradually unfold by increasing c. Eventually, at very high concentrations (c ≥ 20% for protein A 0 ), we observe the appearance of a clear minimum in the free energy (I c > 0 in Fig.3.d-f) signifying that we reached an aggregated state AGG. The occurrence of aggregates AGG comes with a loss of the native conformations (F min (N c )| Nc<0.8 ) consistent with previous observations [19].
It is important to stress that the concentration thresholds of the F OL → U N F and U N F → AGG transitions, which we indicate with symbols c (i) F OL→U N F and c (i) U N F →AGG with i = A 0 , B, C, depend on the specific sequence (Fig. 4). By comparing the U N F → AGG transition points for proteins A 0 and B (Fig. 4d,e), which have the same fraction of hydrophilic amino acids on the surface and hydrophobic amino acids into the core (Fig. 2), we observe that the protein A 0 is less prone to aggregate with respect to B, since c (A0) On the other hand, by comparing the same transition points between proteins B and C (Fig. 4b,c), we find that both transitions occur at similar values of c within the numerical error, although their surface and core composition are quite different, being the protein C more hydrophobic on the interesting result points out that the propensity to aggregate of proteins is not strictly related to the hydrophobic content of its surface, as long as this amount has been "designed" according to the environment [34].
Similar F OL → U N F and U N F → AGG transitions are observed also in the proteins A 1 , A 2 , D, E and F (not shown here). It is interesting to observe that, although proteins A 0 , A 1 and A 2 share the same native structure (the sequence of each proteins has been obteined with an independent design procedure), the concetration threshold fot the F OL → U N F and U N F → AGG transitions are different in each case.

Water-mediated protein-protein interaction
In this section we focus on the protein-protein interaction mediated by water molecules, for binary systems. In particular, we consider the cases A 0 -A 0 proteins and C 0 -C 0 proteins (homogeneous systems). In Fig. 5 we report the average number of native contact N c 4 as function of the minimum protein distance 5 . We observe that the value of N c is constant for a wide range of of protein distances, with higher or lower values (respectively for the systems A 0 -A 0 and C-C) reflecting the width of the free energy minima and hence the intrinsic stability of the native conformation. The interesting feature in Fig. S5 occurs when N c starts decreasing linearly when the protein gets close to each other. These results demonstrate that the proteins start to unfold before interacting directly. Moreover, the transition distances correlate with the protein stability as A 0 (red square points), being overall more stable than the protein C proaching each other. It is also important to notice that the transition distances are close to the distance between proteins at the F OL → U N F transition concentrations. Finally, the transition distances correlate with the protein stability as A 0 (red square points), being overall more stable than the protein C (blue circle points), show a an interaction radius smaller (∼ 3r 0 ) with respect the one of protein C (∼ 5r 0 ). 4 The average is calculated over all conformations hence the maximum average value will smaller compared to the global minimum that is 1 for all proteins. 5 The minimum protein distance is the minimum value between all the possible distances among any amino acid of the first protein and any amino acid of the second protein.

Conclusions
We have presented a computational study on the competition between folding and aggregation of proteins in homogeneous solutions. However, when globular proteins replace crowding agents, the behaviour of the system becomes difficult to explain because of the influence of protein-protein.
Our results offer a qualitative description of such an influence, separating the role of water, protein and steric interactions at different concentrations.   for the proteins A 3 . We designed the sequence of protein A 3 switching off all the water-water interaction terms in the hydration shell. Protein A 3 is not surprisingly less stable than the sequence designed with explicit water [34]. The data show the disappearance of the U N F state and the direct transition to the AGG state. Moreover, the F OL → AGG transition takes place at much lower concentrations with respect to the case where the hydration water is explicitly accounted for (in the present case as low as 2%). Hence, the hydration water acts as a barrier against the aggregation.