Chromophore-protein coupling beyond non-polarizable models: Understanding absorption in green fluorescent protein

The nature of the coupling of the photoexcited chromophore with the environment in a prototypical system like green fluorescent protein (GFP) is to date not understood and its description still defies state-of-the-art multiscale approaches. To identify which theoretical framework of the chromophore-protein complex can realistically capture its essence, we employ here a variety of electronic-structure methods, namely, time-dependent density functional ∗To whom correspondence should be addressed †MESA+ Institute for Nanotechnology, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands ‡Universitat de Barcelona, Av. Joan XXIII, s/n 08028, Barcelona, Spain ¶University of Siena, Via A. Moro, 2, 53100 Siena, Italy §University of Pisa, Via Giuseppe Moruzzi 3, 56124 Pisa, Italy ‖MESA+ Institute for Nanotechnology, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands


Introduction
The development of multiscale methods to treat complex environments is at the forefront of research in computational chemistry. [1][2][3][4][5][6] An important application of these approaches is the description of the excited states of large photosensitive systems, where the excitation of a limited active region is modeled at a high-level electronic structure method and a lower level of theory describes the rest of the system. While several possible combinations of computational ingredients are possible in such hierarchical models, the simplest embedding approach of a static classical en-vironments in so-called quantum mechanics in molecular mechanics (QM/MM) calculations has seen widespread use for more than a decade. 1,3 There is however mounting evidence that this approach is inadequate for the treatment of excitation energies: Several recent papers using classical point charges in combination with a variety of quantum approaches [7][8][9][10][11][12][13] have reported significant errors in the computed excitation energies with respect to experimental absorption maxima in different photoactive proteins.
There are essentially two main paths one may follow to alleviate the shortcomings stemming from the inadequacy of classical point charges. One may either employ a more sophisticated but still classical environment or attempt to treat parts of the environment at the quantum level. In the first class of models, the use of a polarizable embedding (MMpol) [14][15][16][17] is gaining popularity: Within this framework, the introduction of induced dipoles appears to be a quite effective strategy in describing electronic excitations 13,[18][19][20][21] and offers the appealing feature that different polarization effects induced in the ground state or responding to the excitation of the embedded system can be systematically studied. 22 The second class of models, seeking to treat more atoms at a quantum level, has several possible members. In the so-called "mechanical embedding" schemes, the high-level quantum calculation is only performed on an isolated part of the total system and determines a correction to a low-level supermolecular quantum calculation. 23 A self-consistent quantum method is sub-system density functional theory (DFT) embedding, where the environment density creates a one-electron embedding potential to be used in a high-level calculation of the active subsystem. Similarly to the case of polarizable dipoles, ground-state 24,25 and state-specific 26 schemes are available to study the effect of the environment response. Finally, a more "brute-force" approach 10,13,27 is to simply increase the quantum region to include part of the protein environment and treat the whole cluster at the same level of theory. The clusters might however be much larger than what is viable with most quantum methods: A recent study 8 reported that more than 700 quantum atoms are necessary to converge the excitation energy of the photoactive yellow protein.
Here, we focus on wild-type Aequorea green fluorescent protein (GFP), as a particularly ideal case, which challenges our understanding of how the excitation of the photoreceptor (the chro-mophore) couples to the protein environment, and of what the necessary ingredients in a computational method are to describe its photophysics. The importance of GFP as photobiological system stems from being the progenitor of the large family of intrinsically fluorescent proteins 28,29 which have launched a revolution in cell biology being compatible with non-invasive imaging in living cells. More recently, the development of fluorescent proteins with targeted photochemical properties has also enabled remarkable advances in super-resolution bio-imaging techniques capable of beating the diffraction limit. 30,31 Relevantly to the purpose of the current sudy, the environment surrounding the chromophore in these proteins is known to play a critical role in tuning its excitedstate behavior: Through the mutagenis and continuous discovery of GFP-like proteins in different sea organisms, 28 this class of proteins now covers almost the entire visible spectrum both in emission and absorption. On the other hand, predicting the relation between relatively fine structural changes in the chromophore pocket and the spectral properties of the chromophore-protein complex remains a very demanding task for multiscale approaches. Several recent studies 12,19,32 have attempted a computational characterization of these photoactive systems but reproduced trends in absorption between the various proteins with limited degree of success.
For wild-type GFP, in particular, many computational articles have appeared in the literature 7,9,12,13,[18][19][20][21]27,[32][33][34][35][36][37][38][39][40] due to the importance of the system and abundance of experiments characterizing its structural and spectral properties. The bulk of these studies produced however excitation energies blue-shifted with respect to absorption experiments when using a classical static treatment of the protein. 7,9,12,32,[34][35][36][37][38][39][40] Consequently, the focus has recently shifted towards improving the description of the environment either through polarizable dipoles [18][19][20][21] or simply by increasing the quantum region and therefore treating all atoms at the same level of theory. 10,27 While both approaches appear to improve on experimental agreement, many fundamental questions remain unanswered that transcend the choice of GFP as system of study: Does the excitation of the chromophore polarize the environment and how long range is this effect? Is the coupling between the active site and the protein of electrostatic nature or do we ultimately need a fully quantum mechanical description? Are there clear signatures indicating which computational route one must follow in modeling other photobiological systems?
Here, we seek to provide robust answers to these questions and, to this aim, employ a large battery of tools to describe the chromophore-protein complex. We will begin with extensive QM/MM molecular dynamics simulations and generate a large number of representative frames (about 100 in total) to investigate the interplay between structure and excited states of both protonation states of GFP. Analyzing numerous frames allows us to faithfully assess temperature effects, explore the main factors that determine the spread of absorption energies, and avoid the danger of drawing far-reaching conclusions on a single configuration of the system. We will consider the three main approaches to multiscale modeling of excited states, namely, static point charges, polarizable dipoles, and DFT embedding (the latter being here applied for the first time to fluorescent proteins) and, for many configurations sampled from the dynamics, either compare their results to supermolecular calculations on very large clusters or change the quantum method within the same hybrid scheme to understand how the response varies with the sophistication of the quantum treatment. For this purpose, we will perform calculations with several different electronic-structure methods: Time-dependent density functional theory (TDDFT) with two different functionals and three wave-function methods, that is, the complete active space perturbation theory (CASPT2), n-electron valence state perturbation theory (NEVPT2), and quantum Monte Carlo (QMC). In particular, we will explore the nature of polarization effects on the excitation energies of GFP using a frozen and a polarizable environment both at the level of DFT and MMpol embedding to identify the origin of the failure of a static classical description. Finally, we will establish the validity of representing the whole protein as a small cluster in several different ways; this analysis has been lacking in the bulk of previous literature.
The remainder of this paper is organized as follows: In Section 2, we describe the computational details and, in Section 3, we present our main results. We discuss them and draw our conclusions in Section 4.

5
For the QM/MM simulations, 41,42 we use the CPMD 3.17.1 43 and Gromos96 44 codes with the Amber 03 45 force field and the TIP3P 46 water model for the MM atoms. We treat the QM region with density functional theory (DFT) and employ the PBE 47,48 exchange-correlation functional, the Martins-Toullier pseudopotentials 49 and a plane-wave cutoff of 70 Ry. The size of the quantum box is such that the distance between periodic replicas is at least 8 Å. We perform the Car-Parrinello molecular dynamics (MD) simulations with a time step of 4 a.u (about 0.1 fs) and a fictitious electron mass of 400 a.u. For the first 0.25 ps of the room-temperature (300 K) runs, we use the Berendsen thermostat 50 with a time constant of 10 a.u. We then switch to a Nosé-Hoover thermostat 51,52 with a characteristic frequency of 2000 cm −1 and default settings for the rest of the 27 and 25 ps MD simulations for B and A form, respectively. In the annealing runs, we rescale the velocities at each step by 0.999 until the temperature is below 1 K for both forms.
We use Molcas 7.4 and 7.8 53 for the CASPT2 calculations and a module adapted from the Molcas-Embed interface developed by Carter and co-workers 54,55 for the CASPT2/DFT embedding runs. We employ the ANO-S-VDZP 56,57 basis set and, in the convergence tests, the aug-cc-pVDZ. 58 We use the Cholesky decomposition of the two-electron integrals 59 with a threshold of 10 −4 . In the CASPT2 calculations, we always adopt the default IPEA zero-order Hamiltonian 60 with an additional constant imaginary shift 61 of 0.1 a.u. and, unless otherwise noted, report the excitation energies computed at the single-state level. We also freeze as many σ orbitals as there are heavy atoms. In the MM calculations, we describe the protein with the Amber 99 62 force field and set the charges of the charge group closest to the MM boundary to zero, redistributing them to the next charge group, weighted by nuclear charge. All CASSCF calculations are carried out in state average over two states. We list the CAS spaces employed in Section S4 of the SI together with additonal tests on the inclusion of a third state in the the state average.
We also perform CASPT2/MMpol calculations in the presence of a polarizable environment using the Amber pol12 parameters (AL model in Ref. 63,64 ) to describe the protein and the water solvent. For water, we derived the point charges from a standard RESP fit using the electrostatic potential computed at the MP2/aug-cc-pVTZ level on the TIP3P water geometry (q O = −0.726, q H = 0.363). These CASPT2/MMpol results are estimated using a two-step strategy recently presented, 65 where the MM induced dipoles and charges from a CASSCF/MMpol calculation performed using the Gaussian code are later used in Molcas as a static external potential to obtain the CASPT2 results. Such CASSCF/MMpol calculations, performed using a locally modified version of Gaussian09, revision A.02, 66 are obtained using the state-average procedure for the ground and first excited states and adapting the MM polarization either to the ground state (polGS), or to the ground and excited states in two separate calculations (polSS).
The time-dependent density functional theory (TDDFT) runs are also performed using a locally modified version of Gaussian09, revision A.02, 66 and with the CAM-B3LYP 67 and LC-BLYP 68 functionals. We use the AL Amber pol12 parameters 63,64 for the polarizable dipole calculations and Amber 99 for the non-polarizable runs for consistency with CASPT2/MM. We employ the 6-31G 69 basis set on the hydrogens and the 6-31+G(d) [69][70][71] for the other atoms. In all QM/MMpol calculations, the polarizability of the MM atoms located at distances greater than 20 Å to any QM atom is set to zero in order to reduce the computational cost, whereas all partial charges were included. This value of polarization cutoff has been shown to provide highly converged excitation energies. 72 To generate the embedding potentials for the CASPT2/DFT and TDDFT/DFT calculations, we use the ADF 2013.2 code 73-75 with the Slater DZP 76 basis set. We employ the M06HF 77 exchange-correlation functional for the intramolecular calculations, and the PW91k 78 kinetic functional and the PW91 79 exchange-correlation potential for the non-additive part of the embedding potential. For frozen-density embedding TDDFT/DFT calculations, we use the ADF program with the same embedding potentials used in the CASPT2 runs and the CAMY-B3LYP functional (where the separation of 1/r is based on a Yukawa potential). 67,80,81 The impact on the excitation energies of the use of CAMY-B3LYP/DZP instead of CAM-B3LYP/6-31+G(d) is discussed in Section S8 of the SI.
For the NEVPT2 calculations in the strongly contracted formulation, 82-84 we use ORCA 3.0.2 85 with the ANO-L-VDZP 56,57 basis set and the aug-cc-pVTZ 58 as auxiliary basis set for the resolution of identity. 86 The RIJCOSX approximation 87 is employed in the CASSCF step. The orbital energies for the doubly occupied and virtual orbitals appearing in the Dyall Hamiltonian 88 (used for the definition of the zero order Hamiltonian in NEVPT2) were obtained by the diagonalization of a generalization of the Fock operator 89 (canonical orbital option in ORCA). In the construction of the third-and fourth-order density matrices, the CASSCF wave function is truncated so that configurations with lower weight than a threshold of 10 −8 are discarded. The NEVPT2/MM runs are performed in the presence of the same point charges as in the CASPT2 calculations.
For the QMC calculations, we use CHAMP 90 with scalar-relativistic energy-consistent Hartree-Fock pseudopotentials and the corresponding cc-pVDZ basis sets 91,92 augmented with diffuse s and p functions on the heavy atoms. 93 We employ a two-body Jastrow factor to account for electron-electron and electron-nucleus correlations and use different Jastrow factors to describe different atom types. 94 The starting orbitals are obtained in a state-average CASSCF calculations with Molcas 7.8 (in the presence of the MM charges, if appropriate) and are not optimized in the QMC runs. We truncate the CAS expansion in state-average natural orbitals with a cutoff on the CI coefficients as detailed in Section S4 of the SI. The CI coefficients and Jastrow parameters are optimized with the linear method 95 in a state-averaged fashion 96 within variational Monte Carlo (VMC). The pseudopotentials are treated beyond the locality approximation 97 and an imaginary time step of 0.05 or 0.075 a.u. is used in the diffusion Monte Carlo (DMC) calculations. The potential generated by the external MM charges is put on a grid which is padded to have at least 5.0 Å distance from any atom in the chromophore and has a step size of 0.2 a.u. A finer grid with a step size of 0.1 a.u. yields VMC excitation energies compatible to better than 0.02 eV. The electron loss ratio (attempted evaluations outside the cubic grid) is less than 10 −6 in all runs. All QMC excitation energies presented below are computed within DMC and the VMC results are reported Section S11 in the SI.

Models
To prepare the structures of the A and B forms, we start from the protein models obtained in our previous study. 9 For the A form, we remove one GFP barrel from the previously used dimeric system as well as all water molecules which are more than 5 Å from the remaining monomer.
We then solvate the system and bring the added water in equilibrium by performing a 5 ns NPT cluster is populated by frames that are within a certain cutoff distance of the central frame of the same cluster. As distance between two frames, we employ the mass-weighted fitted RMSD where the frames are first aligned to obtain the lowest possible RMSD. In the computation of the distance between two frames, we include all the atoms in the protein but not the water molecules, and use a cutoff ditance of 0.5 Å for the A form and 0.6 Å for the B form to define a cluster. The central frames of the most populated 10 clusters from the cluster analysis are then used in the calculation of the excitation energies. For further analysis, we isolate 50 equidistant frames sampled every 400 fs from the QM/MM MD runs and also determine a final annealed structure for both forms of the protein.
For the B form, we also consider a larger QM region, which includes the chromophore and the residues Gln94 and Arg96, and the boundary between the QM and MM regions also includes the single bonds C OOH -C α of Val93, N-C α of Thr97, and C α -C β of Glu95 (the backbone of Glu95 is modeled at the quantum level but its sidechain classically). With the resulting QM region, containing 93 atoms, 5 of which are capping hydrogens, we perform a QM/MM simulation of 1.5 ps and determine then an annealed structure.
Finally, we construct QM cluster models for the TDDFT, CASPT2/DFT, and TDDFT/DFT calculations. The smallest cluster contains 168 atoms from 10 residues (with deep cuts at C β or even deeper) and 8 water molecules, and is similar to the models used in Ref. 12 For the three larger clusters of 279, 345, and 529 atoms (4 waters and 8 residues, 7 waters and 10 residues, and 7 waters and 19 residues, respectively), we use capping methyl groups and cut the residues at the neighboring C α atoms in the backbone. The list of residues included in the clusters is given in Table S1 of the SI.

Results
In equilibrium, the chromophore of wild-type GFP exists in either a neutral A form or an anionic B form. Upon excitation in the blue, the A form of the chromophore loses the phenolic proton converting to an intermediate anionic I form, which in turn infrequently transforms into the thermodynamically stable B form. 101 The neutral chromophore is known to transfer its proton to Glu222 through the wire CRO-Wat-Ser205-Glu222 102,103 but other differences in the surroundings are difficult to assess experimentally: Since the predominant form of wild-type GFP at room temperature is the neutral one, the structure of the anionic form cannot be established through X-ray diffraction. As workaround, one can adopt as starting model a suitably modified X-ray structure of the mutant S65T which stabilizes the anionic form as also done in earlier computational work. 9,104 The available X-ray structures of this mutant (PDB entries 1EMA, 1EMG, and 1Q4A 105-107 ) suggest that the hydrogen-bond network of the B form is similar to that of the A form except for a different orientation of Thr203. A recent theoretical study 40 challenges the accepted orientation of Thr203 and also suggests a new conformation of the anionic chromophore. We will explore this possibility and its consequences in Subsection 3.5.
The solvated chromophore of GFP does not exhibit fluorescence at room temperature (quantum yield of less than 0.001 108 ) but fluoresces in solution at 77 K 109,110 and also in the protein, being rigidly held by the environment which inhibits its numerous available modes of relaxation. [110][111][112] Temperature effects on the chromophore and its surroundings have nevertheless an impact on the excited-state properties of GFP: When temperature is reduced, the absorption maximum moves from 2.59 to 2.63 eV for the B form, and from 3.12 to 3.05 eV for the A form and, consequently, the distance between the peaks of two forms decreases from 0.5 to 0.4 eV. Furthermore, the width of the peaks is much smaller in the low-temperature than in the room-temperature spectrum. 113 We begin our study by investigating the effects of temperature on the chromophore-protein complex of both forms of GFP and explore the range of conformations visited by the system in our QM/MM MD trajectories.

Thermal stability of hydrogen-bond network
For the A form, the hydrogen-bond network around the chromophore is quite stable during the 27 ps of QM/MM dynamics. In particular, the CRO-WAT-Ser205-Glh222-CRO wire is always connected except for a very short instance in the equilibration part where the water loses the Hbond with Ser205 (see Figure 1 for the position of the residues with respect to the chromophore and the labeling). His148 is always bonded to the water close to the phenolic oxygen and also the imidazolinone ring has a stable environment. In particular, O12 is bonded to Arg96, Gln94, and a water molecule, while N2 is weakly bonded to a water molecule. Only the O12-water interaction is sporadically broken. The annealed frame has the same hydrogen-bond network as the one in the dynamics.
For the B form, we observe instead that the hydrogen-bond network adopts several configurations even during the relatively short 25 ps dynamics as can be seen in Figure 2. The CRO-WAT-Ser205-Glu222-CRO wire is occasionally broken as the water molecule loses the bond to the phenolic O1. There are in fact four residues around O1 that compete to form hydrogen bonds with it: Tyr145, Hid148, Thr203, and a water molecule as illustrated in Figure 1. Through the trajectory, either two or three of these four residues are bonded to the phenolic oxygen. In particular, in most of the trajectory, exclusively one of the two residues Tyr145 and His148 (see Figure 2) is bonded to the oxygen, while Thr203 is bonded most of the time only rarely losing the bond due to the Figure 1: B form: The hydrogen-bond network around the chromophore (CRO). The phenolic oxygen O1 is bonded to the water molecule (orange), His148 (red), Thr203 (gray), while Tyr148 (red) does not participate in this particular snapshot. The hydrogen-bond network is completed by Ser205 (purple) and Glu222 (green). The hydrogen-bond network of the A form is similar but Glu222 loses its proton to O1 and Thr203 is no longer bonded to CRO owing to a change in conformation. movement of the H atom (the heavy atoms of the residue are always at a distance consistent with a hydrogen bond). Finally, we note that the water molecule close to the phenolic oxygen O1 forms a bond with Thr203 for a short time interval, losing the one with O1 but keeping the other hydrogen bonded with Ser205. The annealed frame of the B form has Tyr145 bonded to the phenolic O1 even though, for most of the dynamics, it had His148 instead. This simply indicates that multiple close minima are available to the system. In fact, the annealing run with the larger quantum region results instead in Tyr145, His148, and the water bonded to the phenolic ring, while Thr203 forms a bond with the carbonyl group in His148, mainly due to the movement of the gamma hydrogen atom in the threonine residue. Consequently, the commonly used approach of using a single frame obtained through annealing or optimization can lead to a configuration which is not representative of the average behavior.

Relationship between structure and excitation energies
To analyze the excitation energies over a wide range of representative conformations of the protein, we choose 50 equidistant frames from the last 20 ps of the QM/MM trajectories of both the A and B forms (i.e. one frame every 500 fs). Given the significant number of frames, we perform a first screening of the behavior of the excitation energies over these frames using the relatively cheap TDDFT/MM method. We choose CAM-B3LYP as exchange-correlation functional since it has been shown 114,115 to ameliorate problems with spurious charge-transfer effects that may plague conventional hybrid methods. 116,117 Later in this paper, we will also employ more sophisticated methods to treat either the chromophore or the protein environment and focus then on 10 representative frames obtained from cluster analysis for each protonation state. Our findings on the 50 frames for the A and B forms are summarized in   within the chromophore. The BLA was then defined as a linear combination of the 15 heavy-atom bonds in the chromophore, whose weights were fitted to the excitation energies in various ways.
We also perform such an analysis but use a much simpler definition of the BLA than in previous works, namely, where the atom labeling is given in Figure 2. The significance of this parameter can be understood in terms of the two resonance structures of the anionic chromophore shown in Figure 4: The BLA defined above is positive when the chromophore is in the dominant benzenoid structure and negative in the quinoid one. Such a simple definition avoids the danger of over-fitting our data and, as shown in Figure 3, suffices to describe a sizable part of the tuning in the A form. In this case, since the chromophore adopts exclusively the benzenoid resonance structure, when the BLA approaches zero through thermal fluctuations, the electronic structure of the ground state is severely perturbed, thereby closing the gap. On the other hand, for the anionic B form, small values of the BLA are more typical since the quinoid structure acquires importance and the ground state is more capable of adjusting also to conformations with a negative BLA and a dominant quinoid resonance structure. The dependence on the BLA of the excitation energy is therefore much less steep as shown in Figure 3.
We also note that the quality of the fit for the B form is lower, indicating that other factors than the BLA affect its excitation energy. We can in fact improve the fit if we account for environmental effects by adding for instance fitting parameters which measure the presence or lack of a bond between O1 and His148 and/or Tyr145 (see section S3 of the SI). Even though more data would be required to elucidate the precise role of the different amino acids in the surrounding of the chromophore, our analysis suggests that the spread of the absorption energies has different causes in the two forms: For the A form, it is mostly determined by the internal coordinates (which in turn are coupled to the external conformations) and, for the B form, the flexible hydrogen-bond network around the chromophore seems to be a more directly relevant factor.
While the average BLA is smaller for the B form than the A form, this fact alone is not enough to explain why the A form has a higher excitation energy. If we attempt to fit the dependence of the excitation energies on the BLA for both forms together as done for instance in Ref., 34 we observe that the excitation energies of the A form are systematically higher than those of the B form as is evident from Figure 3 and that a single fit is inadequate (we reach the same conclusion if we allow arbitrary weights for the bonds in the fitting procedure). We therefore add an energy shift between the two forms in the linear model to account for this offset, which is found to be 0.22 eV and improves the quality of the fit as shown in Figure 5. Consequently, of the total shift of 0.29 eV between the average excitation energies of the A and B forms, only 0.07 eV is accounted for by a lower average BLA in the B form than in the A form and the remaining 0.22 eV is caused by the different electronic structure of the two protonation states. We note in passing that, for both forms, the excitation energy displays the strongest dependence on the bridging distance C6-C7 (double bond in the A form and predominantly double bond in the B form) when we search for correlations between the excitation and any single bond length, while the other bridging distance, C5-C6, has a less significant influence. This is in line with the Hückel model description 37 for the excitation energy of the A form, which is approximated as a simple ethylene-like HOMO-LUMO transition on the C6-C7 bond. However, the same model assumes the bonds C5-C6 and C6-C7 to be equivalent in the B form, which does not fit our findings. Further insight on the electronic structure of the chromohore and relation to its geometry can be obtained through the development of alternative simplified models as in Refs. [119][120][121] In agreement with previous studies, 12,19,39,122   analysis (the BLA dependence is qualitatively identical for CASPT2 energies), the CASPT2/MM average excitation energy of 2.80 ± 0.01 eV is still clearly blue-shifted compared to experiment.
This means that thermal sampling by itself is not sufficient to overturn the strong indications coming from previous studies 9,11,39,123 that describing the protein environment as only classical static point charges produces an unphysical blue shift in the excitation energies of related photoactive systems.
To verify that the choice of quantum method is not responsible for the blue shift, we extend our investigation to other computational approaches for 10 representative frames obtained through cluster analysis. We employ the LC-BLYP functional in the TDDFT calculations and also use quantum Monte Carlo and NEVPT2 as alternative wave function-based methods. To pinpoint the effect of the MM environment, we consider the difference between the QM/MM description and the excitation energies computed on the isolated chromophore at the geometry obtained within the protein. As shown in Figure 7, the trends across the frames are qualitatively similar at all levels of theory but the shifts induced by the protein are larger for the wave function approaches. As regards the absolute excitation energies with the other WF methods, we find that NEVPT2 and QMC are rather close and about 0.2 eV blue-shifted compared to CASPT2 (see Table S6 in the SI for the excitation energies on all frames), thereby further worsening the agreement with experiments.
This finding renders even more pressing the search for an alternative, better description of the environment, which will be the focus of the following subsection.

Improved description of the environment
To further investigate whether the MM description of the protein is responsible for the signifi- This specificity has made polLR the standard one in combination with TDDFT methods where transition densities are directly available. In previous analyses 125, 126 of the different physical nature of the responses introduced in the two models in combination with polarizable continuum models, it has been shown that while polSS recovers the "exact" electrostatic response of the environmment, polLR introduces an effect that in the old literature of solvatochromism was described as part of the dispersion interactions. In the same analyses it was also shown that a full response should include both terms in addition to a purely dispersion term.
As summarized in Figure 8 Figure 9. From this analysis, we conclude that the chromophore-environment interactions in the ground state are almost equally described by the non-polarizable and the polarizable models.
To account for full response to the excitation, we need instead to move beyond a non-polarizable model and include a direct coupling between the transition density of the chromophore and the corresponding polarization of the protein environment.
An alternative method for treating the environment is DFT embedding, 127 which one would expect to be superior to a classical treatment. To generate the embedding potential for the chromophore, we use here either the density of the isolated environment, relaxing the chromophore density, 128 or bring the environmental and chromophore densities at self-consistency after an appropriate number of so-called freeze&thaw cycles. 129 For the A form, we find that DFT embedding  To investigate the effect of polarizing the environment at the DFT level, we also apply the recently introduced state-specific formulation of DFT embedding 26,130 and find that the response to state-specific potentials at the TDDFT/DFT level is on average very small for both forms (see Table S7   A summary of the results obtained for all 10 frames by applying the different ways of treating the environment in combination with a TDDFT description of the excitations is reported in Figure 11 and in Table 1, where we also list the excitation energies obtained with the LC-BLYP functional. We find that LC-BLYP responds to the changes in the description of the environment very similarly to CAM-B3LYP and that the agreement between the polLR and cluster excitation energies is equally good (see also Figures S6-S8 of the SI). We finally note in Table 1 and Figure 8 that   As already discussed in the previous section, all WF methods considered here agree with TDDFT in predicting that the MM description of the protein leads to significantly blue-shifted excitation energies of the B form with respect to the experimental absorption maximum: The average excitation energy over the 10 frames is 2.79 ± 0.02 and 3.02 ± 0.02 eV for CASPT2/MM and NEVPT2/MM, respectively. The QMC/MM excitation energies on a subset of frames are very close to the NEVPT2 values as illustrated in Figure 12 and detailed for a subset of frames in Table 2 (see Table S7 of the SI for the NEVPT2 and CASPT2 results on all frames). Also the anomalous blue shift observed with TDDFT/DFT for the B form with respect to the MM embedding is confirmed at the wave function level, with CASPT2/DFT predicting an even larger blue shift of 0.25 ± 0.02 eV and QMC/DFT a similar correction of 0.11 ± 0.04 eV. The further localization of the excitation energy induced by DFT embedding is possibly due to a too high/steep barrier in the embedding potential induced at the edge of the chromophore by the approximate kinetic functional. 130 Increasing the cluster size worsens the situation, causing further blue shifts at the CASPT2 level as shown in Figure 13. As in the case of TDDFT/DFT, we find that the use of state-specific DFT potentials has on average almost no effect on the excitation energies at the CASPT2 level (less than 0.01 eV as reported in Table S7 in the SI). Additional tests on the use of different basis set or functionals in the construction of the potentials in the DFT embedding scheme are provided in Table S8 of the SI.   All findings with WF methods for the B form are summarized in Figure 12, which clearly illustrates how the trends for the different embedding methods mirror the TDDFT ones of Figure 11, with the difference that the shifts are generally slightly larger for the WF methods. One might therefore hope that the large red-shift between TDDFT/MM and TDDFT/cluster or TDDFT/MMpol would be even larger between WF/MM and WF/MMpol or a possible WF/cluster run. Unfortunately, WF/MMpol in the polLR formulation is not yet available for the chosen WF methods (an estimate of such effects at the CASPT2/MMpol level will be given in the Conclusions), while a calculation on a cluster of almost 300 atoms is out of reach for all but the simplest WF approaches due to prohibitive costs. For the A form, we encounter severe difficulties in the WF calculations due to the presence of multiple minima in the CASSCF reference calculations and a strong sensitivity of the CASPT2 excitation energies on the size of the active space, the minimal but yet large expansion not being sufficient. Therefore, in Table 2, we only present CASPT2/MM and QMC/MM results for four frames (the annealed and three room-temperature ones) where the CASPT2 calculations are robust at both the single-and multi-state level as discussed in detail in Section S4 of the SI. The available WF excitation energies are in line with the TDDFT results also for the A form with QMC/MM being a bit higher and CASPT2/MM a bit lower than the corresponding TDDFT/MM excitation energies on the same frames, and both being blue-shifted with respect to the experimental absorption maximum of 3.12 eV also of this form.
If we focus on the polarizable (MMpol) embedding in the polLR formulation which all tests indicate as the most reliable for this system, we can only attempt a comparison with experiments at the TDDFT level. Even though a direct comparison of the CAM-B3LYP results to experimental absorption maxima is tenuous as previously noted, we can use the data for each form to validate our account of thermal effects. Since we have performed MMpol calculations on both room-temperature and annealed frames as listed in Table 1, comparing shifts between the excitation energies of the two data sets to experimental shifts in the absorption maxima is possible. We find that for the A form, the annealed (frozen) frame has a lower excitation energy than the average of the room-temperature frames by 0.05 ± 0.02 eV, while for the B form, the annealed frame has a higher excitation energy than the room-temperature average (0.04 ± 0.02 eV if we focus on the annealing performed with the same QM region). These two shifts are rather close to the experimental values 113 of 0.07 eV red-shift between room-and low-temperature absorption maxima for the A form and 0.04 eV blue-shift for the B form. The precise values of the shifts are difficult to ascertain, as especially the A form has a wide peak, but reassuringly, the direction and order of magnitude is well reproduced by combining an MMpol embedding with our thermal sampling in the QM/MM MD simulations.

On the use of a cluster representation of the protein
In the previous subsection, we employed cluster calculations to validate the MMpol approach and to perform the DFT embedding calculations. Quite often in the literature, a cluster representation of the protein is instead used to compute the excitation energies for a direct comparison with experiments and the question therefore arises on how large a cluster must be for a converged calculation of the excitation energies. Distressingly, a recent article 8 suggested that clusters as large as 723 atoms are necessary for reliable results on the photoactive yellow protein.
In Table 3

Alternative H-bond network for the B form
Our results strongly indicate that a poor description of the environment via classical point charges is mainly responsible for the blue shift observed in the excitation energies with respect to experiments. Nevertheless, we will here also explore the possibility that a drastically different configuration of the chromophore might lead to the desired agreement with experiments. In particular, we will investigate the impact on the excitation energies of the formation of an internal hydrogen bond between O γ in the Ser65 side chain and N11 of the imidazolinone ring. The extra hydrogen bond on the imidazolinone side of the chromophore is expected to stabilize the excited state as the excitation has a partial charge transfer character from the phenolic to the imidazolinone ring, and therefore red shift the excitation energy. (We note that the an analogous internal hydrogen bond network between the Thr65 side chain and the chromophore is observed in the X-ray structure 1Q4B 107 of the S65T mutant.) The distortion of Ser65 however significantly perturbs the surroundings of the chromophore since it is also coupled to a twist of Glu222 from the anti to the syn conformation and, consequently, to a breaking of the proton-transfer wire, so the combined effect on the excitation energy is in fact hard to predict.
Such a structure was recently put forward as the "true" conformation of the B form of wild-type GFP while the standard hydrogen-bond network we have adopted above was assigned to the I form of GFP. 40 This assignment was motivated by the the small blue shift of 0.02-0.04 (depending on the quantum method) computed between the excitation energies of their B and I forms and the fact that, experimentally, the 0→0 transition of the B form is also blue shifted (albeit by 0.1 eV) with respect to the I form.
To prepare this alternative B form, we appropriately twist Ser65 and Glu222, but keep the S65T-like orientation of Thr203 bonded to the phenolic oxygen for consistency with our standard B form and with the bulk of the literature (Thr203 is twisted away from the phenolic oxygen in Ref. 40 ). We then optimize a small cluster within DFT with fixed surrounding residues and, starting from this structure, perform a QM/MM simulation of 15 ps followed by annealing. During the 15 ps run, the hydrogen-bond network around the phenolic oxygen behaves similarly to what observed for the standard setup. In particular, the water molecule and Thr203 are constantly bonded to the phenolic oxygen but His148 and Tyr145 intermittently bond and detach. The final structure resembles the one of Ref. 40 but displays a hydrogen bond between His148 and the phenolic oxygen of the chromophore instead of having the water molecule bridging the histidine to the chromophore (see Figure S5 of the SI).
Finally, we compute the excitation energies on this structure at the level of CASPT2/MM, TDDFT/MM, and TDDFT on a cluster model. Disappointingly and in agreement with Ref., 40 we obtain a small blue shift of 0.04-0.08 in the excitation energies with respect to the annealed structure of our standard B form (see Table S5 of the SI). Our results therefore suggest that the conformational changes of Ser65 and Glu222 do indeed have some effect on the excitation energy but in the opposite direction than what a simple analysis would have suggested. We stress however that we do not agree with Ref. 40 in assigning these structural changes to the difference between the I and B forms based on this small difference in excitation energies. The only available experimental data for the I form 113 is the 0→0 transition at room temperature and not the absorption maximum at low temperature, we should compare our vertical excitation energy for the annealed structure.
Furthermore, we have shown above that temperature effects on the excitation energies are of this same order of magnitude. Finally, temperature effects were neglected altogether in Ref., 40

Discussion and conclusions
In this paper, we have thoroughly investigated the multi-scale computation of the excitation energies of wild-type GFP, examining the causes of the spectral spread and identifying the most suitable protocol for an accurate and affordable calculation of the absorption properties of this prototypical fluorescent protein.
Through extensive QM/MM molecular dynamics simulations, we explored the possible conformations of the protein in both the neutral A and anionic B forms, and found that the A form displays a very stable hydrogen-bond network, while the B form exhibits more significant deviations from the average structures even in the 20 ps time interval we examined. Analyzing 50 equidistant frames for each protonation form at the TDDFT/MM level, we found remarkably large spreads in the excitation energies of up to 0.5 eV in both cases and, for the B form, confirmed the correctness of the TDDFT excitation-structure relation through CASPT2/MM calculations on the same 50 frames. Surprisingly, the structurally more stable A form has a larger standard deviation in the energies, which we find to be more sensitive to the internal coordinates of the chromophore, while the environment plays a bigger influence on the excitations of the B form. Including several frames, as done here, ensures a faithful sampling of the possible conformations of the chromophore and its surroundings. While there is clearly a spread in the values stemming from the choice of excited-state quantum method, the coarseness of the MM description is responsible for a systematic blue shift. To demonstrate this, we used 10 frames from cluster analysis for each protonation form and compared the MM results at the TDDFT level to calculations done on clusters containing almost 300 atoms, which we showed to be sufficiently large to account for most of the tuning by the protein environment. Within TDDFT, which is the only method allowing the treatment of such large clusters, the MM description causes a blue shift of about 0.03-0.19 eV when compared to the reference cluster calculations of both the A and the B forms. Even though the use of large clusters does not greatly ameliorate the agreement of TDDFT with experiments, the measure of success/failure of a particular description of the environment is its ability/inability to reproduce the excitation energies on large clusters treated at the same level of theory. Given the evident inadequacy of static point charges, we explored two alternative approaches to improve the description of the environment.
Remaining within classical methods, we investigated the inclusion of induced dipoles (MMpol) and performed TDDFT/MMpol in linear response (polLR) on ten frames for each protonation form. The cluster TDDFT excitation energies and the MMpol calculations on exactly the same cluster agree remarkably well for both the A and B forms.
We note that, when comparing with previous studies with a similar MMpol methodology, 19,21 we find a good agreement for the B form but not for the A form, which displays a higher sensitivity in the excitation to the internal coordinates of the chromophore. As discussed in Section S9 of the SI, if one accounts for the use of a different DFT functional in optimizing the quantum chromophore and the differences in the structural relaxation, the discrepancy reduces to about 0.1 eV also for the A form. Importantly for our discussion, the shifts with respect to the MM values induced by their multipoles obtained in a LoProp construction 131 are in very good agreement with our findings. This shows that the more cost-effective choice of using semi-empirical force-field parameters and neglecting the static quadrupoles included in Refs. 19,21 is adequate for the description of these systems as also apparent from the agreement of our TDDFT/MMpol results with the cluster calculations.
The good performance shown by the TDDFT/MMpol approach allows us to better understand the role of the environment on determining the excitation energies of the two forms of the GFP chromophore. By comparing three different formulations of the embedding model, namely, the ground-state (polGS), state-specific (polSS) and linear-response (polLR) flavors, it is apparent that, in GFP, the response of the environment to the excitation cannot be represented in terms of purely electrostatic effects induced by the change in the electronic density upon excitation: In fact, the polSS result which accounts for the electrostatic response of the environment to the excitation is almost equivalent to that obtained by assuming that the polarization of the environment is frozen in the configuration corresponding to the chromophore in its ground state (polGS). On the contrary, if we switch on the polLR interaction, a quite different picture is obtained as regards the nature of the chromophore-protein coupling. This interaction comes from the dynamical response of the environment (here represented by the induced dipoles) to the transition density of the chromophore and it can be described as an environment polarization oscillating at the frequency of the chromophore excitation. Such a "resonance" term, which has been classified as a part of dispersion, leads to a significant red-shift with respect to polGS (or the non-polarizable MM scheme) and a good agreement with the reference supermolecular calculations is finally recovered.
In view of these findings, it is perhaps not so surprising that the other route we followed to improve over static MM charges, namely, sub-system DFT embedding was not successful: This scheme turned out to be at best equivalent to the non-polarizable MM description and, in the case of the anionic B form, to produce further unphysical blue shifts which persist across several quantum methods (CASPT2, QMC, and TDDFT) if the environment is relaxed through freeze&thaw cycles.
The recently introduced state-specific formulation of DFT embedding 26,130 did not capture any response of the environment, and increasing the quantum DFT environment brought further blue shifts. While displaying a somewhat worse performance than the polGS and polSS formulations of MMpol probably due to the quality of the approximate kinetic functional, the behavior of DFT embedding confirms that the state-specific formulation of sub-system DFT can recover electrostatic effects but that these are not dominant in this system. Similarly, one expects that other alternative DFT-based embedding techniques that eliminate the need for kinetic-energy functionals either by reconstructing the embedding potentials 132 or imposing orthogonality of the orbitals of the subsystems 133 will not possess the necessary ingredients to describe the excited states of GFP: They will improve the ground-state description of the environment and, even if they included statespecificity, will not easily capture the coupling with the environment beyond electrostatic effects.
While the MMpol embedding scheme has allowed us to identify the necessary ingredients in the description of the excited states of GFP, we have been able to include either the state-specific (electrostatic) or the linear-response (resonance) polarization of the environment but not both in a united formulation without resorting to a complete "brute-force" quantum calculation on a large Exp. quantum cluster. Furthermore, we have done so only at the TDDFT level. Since both the SS and the LR responses depend on the quantum method used to describe the excitation, what can we achieve with a correlated method considering that a large quantum cluster is out of reach and an LR formulation often not available? In an effort towards a more complete and accurate simulation of the excitation process in GFP, we attempt here to estimate the best possible excitation values using a CASPT2 description of the excitation process instead of TDDFT. At the CASPT2 level, one can evaluate the excitation energy in a polSS scheme using the state-specific induced dipoles obtained within CASSCF/MMpol (see Computational Details). Then, we can estimate the polLR correction as the TDDFT/MMpol (polLR) shift with respect to the corresponding polGS value and rescaling it with the ratio squared of the CASPT2 and TDDFT transition dipole moments.
The final picture emerging from this approximate CASPT2 estimate of the effects of the environment is illustrated for the B form in Figure 14 where the different CASPT2 excitation energies computed with static point charges, ground-state and state-specific induced dipoles are compared to the extrapolated polLR+polSS value. As expected from the results in the previous Section, the response at the correlated level is stronger than in TDDFT when the MM description is added and then enhanced with induced dipoles in a polGS and, subsequently, polSS formulation. The estimated polLR correction is instead rather similar to the TDDFT shift since the ratio of the CASPT2 and TDDFT transition dipole moments is in this case very close to one. That the extrapolated CASPT2 energy correlates very well with experiments is an appealing result, which should however not be overstated. One should apply a similar analysis to the other correlated methods used in this study, which further blue-shift the excitation energy with respect to CASPT2 but also respond more strongly to changes in the description of the environment (a perfect agreement between vertical excitation energies and absorption maximum is anyhow a misguided expectation also for the fluorescent B form). However, our extrapolation clearly shows that environmental effects on the excitation processes in GFP are the result of different interactions in the ground and the excited state; only by adopting a polarizable approach which can account for both state-specific relaxation and "resonance" coupling we can get a semiquantitatively correct description.