Prediction of the n-octanol/water partition coefficients in the SAMPL6 blind challenge from MST continuum solvation calculations

The IEFPCM/MST continuum solvation model is used for the blind prediction of n-octanol/water partition of a set of 11 fragment-like small molecules within the SAMPL6 Part II Partition Coefficient Challenge. The partition coefficient of the neutral species (log P) was determined using an extended parametrization of the B3LYP/6-31G(d) version of the Miertus–Scrocco–Tomasi continuum solvation model in n-octanol. Comparison with the experimental data provided for partition coefficients yielded a root-mean square error (rmse) of 0.78 (log P units), which agrees with the accuracy reported for our method (rmse = 0.80) for nitrogen-containing heterocyclic compounds. Out of the 91 sets of log P values submitted by the participants, our submission is within those with an rmse < 1 and among the four best ranked physical methods. The largest errors involve three compounds: two with the largest positive deviations (SM13 and SM08), and one with the largest negative deviations (SM15). Here we report the potentiometric determination of the log P for SM13, leading to a value of 3.62 ± 0.02, which is in better agreement with most empirical predictions than the experimental value reported in SAMPL6. In addition, further inclusion of several conformations for SM08 significantly improved our results. Inclusion of these refinements led to an overall error of 0.51 (log P units), which supports the reliability of the IEFPCM/MST model for predicting the partitioning of neutral compounds.


Introduction
Lipophilicity is a crucial physicochemical property to understand the biological and pharmaceutical properties of drugs, as it reflects the differential solubility of solutes in aqueous and organic environments [1][2][3][4]. Generally, the lipophilicity is estimated from the partitioning of compounds between aqueous and n-octanol phases [5][6][7], which have been also used as reference systems in the development of a variety of computational approaches [8][9][10][11][12].
The logarithm of the partition coefficient of a neutral solute (log P) can be determined by combining the transfer free energy of the solute between water and n-octanol ( ΔG o∕w ), which in turn is related to the difference in the solvation free energy upon transfer from the gas phase to the two solvents ( ΔG w sol and ΔG o sol ; Scheme 1). Therefore, a robust prediction of the partition coefficient for (bio)organic compounds depends on the accuracy of the calculated solvation free energies. In this context, quantum mechanical (QM) self-consistent reaction field models (SCRF) have demonstrated to be a powerful approach for the calculation of the solvation free energy, especially when one considers the tradeoff between chemical accuracy and computational cost [13][14][15][16]. Thus, after a careful parametrization of QM-SCRF methods, the solvation free energy of neutral compounds can generally be predicted with an uncertainty < 1 kcal/mol [17,18].
In this work we report the results obtained for the SAMPL6 Part II Partition Coefficient Challenge using the B3LYP/6-31G(d) version of the integral IEFPCM/MST solvation model [19], which relies on the Integral Equation Formalism of the Polarizable Continuum model [20,21]. The reliability of the IEFPCM/MST model for predicting hydration and tautomerism free energy changes was already checked in the first two editions of the SAMPL blind test [22,23], yielding predictions of hydration free energies with a root-mean square error (rmse) of 2.3 kcal/mol compared to the experimental values. Recently a series of refinements have been introduced in order to improve the accuracy for predicting the solvation free energy of nitrogen-containing heterocyclic compounds [24]. In this context, the current edition of the SAMPL challenge is thus an excellent opportunity to reassess the performance of the MST method.
In the following a critical assessment of the results obtained from IEFPCM/MST continuum solvation calculations for the set of molecules proposed in the SAMPL6 challenge is made. The reader is addressed to Ref. [25] for a comprehensive analysis of the performance of the different methods used in the blind challenge. The IEFPCM/ MST results can be found under the identifier "kivfu" in the web page of the SAMPL6 challenge [26]. The overall performance of the MST model is discussed, with especial attention to the compounds that exhibit the largest deviations between experimental and calculated log P values. A critical review of the results estimated for these compounds, including both the extension of the calculations to new chemical species and the potentiometric determination of the log P of compound SM13, leads to an improvement in the agreement between MST results and experimental data, as will be detailed below.

The IEF-MST model
Since the IEFPCM/MST model has been widely described in the literature, the reader is addressed to Refs. [14,27] for details about the formalism of this continuum solvation model. Here, we limit ourselves to provide a succint description of the MST method. The IEFPCM/MST model computes the ΔG sol from the addition of electrostatic (ΔG ele ), cavitation (ΔG cav ) and van der Waals (ΔG vW ) components, which are calculated using a double molecular-shaped cavity for the solute embedded in the polarizable continuum medium.
The ΔG ele term measures the work needed to build up the solute charge distribution in the solvent. To this end, a solvent-excluded surface is obtained by scaling the atomic radii by a solvent-dependent factor (λ), which varies from 1.25 for the solvation in water to 1.80 for the solvation in carbon tetrachloride. The value of this scaling factor for solvation in n-octanol was adjusted to be 1.50 [28]. In contrast, no scaling is used for the non-electrostatic terms (ΔG cav , ΔG vW ), which are evaluated using a van der Waals surface.
where N is the number of atoms, w i is a weighting factor determined from the ratio between the solvent-exposed surface of atom i ( S i ) and the total surface of such an atom, and ΔG P cav,i is the cavitation free energy of the sphere associated to atom i.
Finally, the ΔG vW term is computed using a linear relationship to the solvent-exposed surface of each atom (Eq. 2).
where i denotes the atomic surface tension of atom i, which is determined by fitting to the experimental free energies of solvation in the parametrization procedure.

Dataset
The SAMPL6 dataset consists of 11 fragment-like small molecules (see Fig. 1) endowed with kinase inhibitory Thermodynamic cycle used to determine the transfer free energy of a compound (X) between two immiscible solvents 1 3 activity. The dataset includes nine basic compounds, an acidic molecule, and a zwitterionic one, with experimental log P values in the range of 1.95-4.09 [31].

Computational details
The structures of the 11 molecules were generated from the SMILES codes given in the SAMPL6 log P webpage [32] using the online SMILES translator and structure file generator of the National Cancer Institute [33]. Starting from these structures, the molecular geometries of the compounds were fully optimized at the B3LYP/6-31G(d) level of theory, without exploration of additional conformations. The solvation effect of water and n-octanol on the geometrical parameters of solutes was taken into account in geometry optimizations, which were performed using the IEFPCM version of the MST model. Single-point calculations in the gas phase and in solution were performed for the optimized geometries of the compounds to estimate the ΔG sol in the two solvents. All calculations were performed using Gaussian 09 [34].
For one of our outliers, SM08, a more comprehensive conformational analysis was performed using FRee Online druG conformation generation (Frog, version 2) [35], including both keto and enol tautomeric forms (see Fig. S1).
For all conformers, the geometries were fully optimized at the B3LYP/6-31G(d) level, and the resulting minima were verified by vibrational frequency analysis, which gave positive frequencies in all cases. Then, the relative energies of the whole set of conformational species of both tautomers were refined from single-point computations performed at the MP2/aug-cc-pVDZ and MP2/aug-cc-pVTZ levels of theory, which were used to estimate the energy difference by extrapolation to the complete basis set (CBS) limit. Since the spin-component-scaled version of MP2 (SCS-MP2) provides a significant improvement in ground state energies by scaling parallel and antiparallelspin pair correlation energies [36], the CBS energy was determined upon extrapolation of the SCS-MP2 correlation energies computed using Dunning's aug-cc-pVDZ and aug-cc-pVTZ basis sets according to the formalisms proposed by Halkier [37] and Truhlar [38]. Finally, higherorder electron correlation effects were estimated from the difference between CCSD and SCS-MP2 energies calculated with the 6-31G(d) basis set.
The best gas phase estimate of the free energy difference was derived by combining the SCS-MP2/CBS results with the CCSD higher-order electron correlation correction, and the thermal contribution obtained in the B3LYP vibrational analysis (SCS-MP2/CBS The gas-phase free energies were then combined with solvation free energies in both water and n-octanol (at 298 K) computed using the B3LYP/6-31G(d) version of the IEFPCM/MST model. The partition coefficient was determined using a Boltzmann's weighting scheme to the relative stabilities of the conformational species determined for the tautomers of this compound in the two solvents. (3)

Determination of partition coefficient (log P) for SM13 by potentiometry
The compound 6,7-dimethoxy-N-(3-methylphenyl)quinazolin-4-amine (SM13) was purchased from MolPort with a purity grade of ~ 97.5% [31]. The log P values were obtained from the difference between the aqueous pKa of the species and the apparent pKa determined from dual-phase titrations (n-octanol/KCl 0.15 M) using PCA200 from Sirius Instruments (UK). The experimental aqueous pKa (5.77 ± 0.01) was taken from the value reported in the SAMPL6 part 1 [39].
Typically, ~ 3 mg of the samples were dissolved in the appropriate volume ratio of n-octanol (saturated with 0.15 M KCl aqueous solution) and 0.15 M KCl aqueous solution (saturated with n-octanol), followed by a preacidification of the sample with HCl 0.5 M and subsequent titration with KOH 0.5 M. Due to the low solubility of the sample, it was necessary to increase the temperature to approximately 30 °C for obtaining a totally homogeneous sample. Several phase ratios, R, (between 0.03 and 0.4 v o /v w ) were selected according the expected solubility and hydrophobicity of the sample [40,41]. The log P values were estimated and refined by a weighted non-linear least-squares fit, where the aqueous pKa values were used as unrefined contributions (see Fig. S4).

Results and discussion
The partition coefficients determined from IEFPCM/MST B3LYP/6-31G(d) calculations in water and n-octanol are given in Table 1. On average, the IEFPCM/MST results deviate from the experimental values by a root-mean square error (rmse) of 0.78 log units, which places the IEFPCM/ MST results among the most accurate values provided by physical methods, keeping in mind the small differences observed between methods with rmse < 1 (see Fig. S2). Thus, the rmse of the other best ranked QM-based solvation models, the Cosmotherm version of COSMO-RS [16,42] and the Minnesota's solvation models SMD [43], SM8 [44] and SM12 [45], are in a narrow range comprised between 0.38 and 0.65 log units, whereas the use of 3D integral equation theory with a cluster embedding approach [46] yields a rmse of 0.47. The expected accuracy of the IEFPCM/MST for predicting log P values of nitrogen-containing heterocyclic compounds was estimated to be 0.80 log units, after refinement of the atomic surface tension for nitrogen atoms ( N and NH ) in heterocyclic aromatic compounds in the calculation of the van der Waals component of solvation free energy in n-octanol [24]. Thus, present results support the suitability of the latest refinements introduced in the IEF-PCM/MST model. The largest errors in the IEFPCM/MST results involve three compounds: the overestimation of the log P of compounds SM08 and SM13, and the underestimation of the log P of SM15 (see Figs. 2 and S3). Upon exclusion of these compounds, the statistical error is significantly reduced

Compound SM13
Both SM13 and SM09 share the 4-aminoquinazoline scaffold (Fig. 1), but the Δlog P error increases from 0.18 for SM09 to 1.36 for SM13 (Table 1). Compared to SM09, the additional methoxy and methyl substituents present in SM13 suggest that this compound would be more lipophilic than SM09, a trend not reflected in the experimental log P values of 2.92 and 3.03 for SM13 and SM09, respectively. In fact, comparison of the log P values calculated from the different computational approaches that participated in the challenge reveals that the log P of SM13 is in most cases predicted to be larger compared to the log P of SM09 (Fig. 3). Furthermore, a survey of the main outliers in the best-ranked physical methods and in the most accurate submissions in other categories points out that SM13 is an outlier in the most succesful methods (Table 2). On the basis of these considerations, we performed a potentiometric determination to determine the log P of compound SM13, obtaining a value of 3.62 ± 0.02 (see Fig.  S4), which seems in better agreement with chemical intuition. After insightful discussion with the organizers of the challenge, we think that the main difference between the two experimental measurements may arise from the sensitivity of the profiles of volumetric ratios between partition solvents, expressed as log R, to both the partition of neutral and ionic species of the compound, and the low solubility of SM13. In the original measurement [31], the log R range was comprised between − 1.20 and − 0.18 but with a maximum analysis vial volume of 3 mL. In our case, we have used a similar log R range (− 1.48 to − 0.40). However, the volumes of partition solvents were larger than those used in the original measurement, as our volume limit was 25 mL. We considered that these experimental conditions allowed both a better solubility of SM13 and partition of ionic species. According to the experimental data given for SM13 by the organizers of SAMPL6 [31], the partition of the ionic species was negligible in most replicas (− 4.35, − 7.57 and 0.09). On the other hand, for the structurally similar compound SM09, two replicas show a similar partition of the ionic species (1.05 and 0.91). The discrepancies in the partition of the ionic species may also affect the experimental determination of the log P for the neutral species of these compounds [47].
On the other hand, we do not think that temperature (25 °C in the experimental values of the challenge, and 30 °C in our measurement) might justify the difference Fig. 3 Representation of the relative log P prediction between SM13 and SM09 for all 91 submissions and classified by category and performance in the challenge between the two values, since according to previous studies one should expect an increase in hydrophobicity with a decrease in temperature [48,49].

Compound SM08
The error in the estimated log P of SM08 was partly due to a misassignment of the atomic surface tension for the nitrogen atom ( NH = − 0.295 kcal mol −1 Å −2 instead of NH = − 0.234 kcal mol −1 Å −2 ) in the original submission. Correction of this mistake would have led to a log P of 4.20. However, this value still deviates significantly from the experimental one (log P = 3.10), which could originate from the use a single conformational species of the keto tautomer (denoted SM08-T1 hereafter). This led us to carry out a more detailed analysis considering the involvement of alternative conformational and/or tautomeric species. Three main conformational species were found for SM08-T1, which are shown in Table 3 together with the estimated population in the gas phase, n-octanol and water (populations in the gas phase are presented with the aim to demonstrate the relevant effect played by the solvent on the conformational preferences). The results point out the decrease of the intramolecularly hydrogen-bonded conformation (III) upon transition from the gas phase to n-octanol to water. Furthermore, the results show the modulation of the relative weigth for the conformations with the solvent-exposed carboxylic acid.
Taking into account the weigth of the distinct conformers improves the prediction of the log P of SM08-T1, which is estimated to be 4.01 when the population distribution is determined at the DFT level. Moreover, weighting such conformations according to the SCS-MP2 energies corrected for CCSD high-order electron correlation effects yields an estimated log P of 3.48, which almost matches the experimental value (log P = 3.10). Finally, we also determined the partition coefficient of the enol tautomer (SM08-T2; Fig. S1), leading to a log P of 3.30 (detailed data not shown). However, the contribution of the enol tautomer is expected to be negligible according to the relative stability in water and n-octanol (Table S1).

Compound SM15
The IEFPCM/MST calculations lead to remarkable different errors in the predicted log P of the benzimidazole-containing compounds SM14 and SM15, as the Δlog P error is much higher in the latter (− 0.18 and − 1.26 for SM14 and SM15, repectively). In fact, all the submissions predict that SM15 is less lipophilic compared to the experimental value (log Table 3 Relative populations (%) of the major conformational species for the keto tautomer of compound SM08 estimated from SCS-MP2/ CBS + ΔCCSD calculations Only one of the symmetry-related conformations is shown  Fig. 4. In addition, Table 2 also points out that SM15 can be viewed as an outlier in the best-ranked submissions. The reasons that may explain this behaviour are unclear, but the uncertainty might be attributed to the population of cationic and anionic species during the potentiometric determination of the log P for a compound with distinct titratable sites. In this case, according to previous studies [50], the use of a shake-flask technique would be valuable for comparison purposes with the potentiometric method, because these compounds are prone to exhibit a larger difference between the log P values obtained by using these experimental techniques.
On the basis of the preceding considerations, if one keeps in mind the new experimental determination of log P reported here for SM13 and the recalculated value that accounts for the conformational flexibility of SM08, a significant improvement is found in the ability of the IEFPCM/ MST model for predicting the experimental data of the compounds in the SAMPL6 challenge, as the rmse is reduced to 0.51 log units (0.35 upon exclusion of SM15; Fig. 5). Overall, the results support the suitability of the IEFPCM/ MST model, as well as other QM-based continuum solvation models, for predicting the partitioning of neutral compounds with excellent accuracy.

Conclusions
The SAMPL6 Part II partition coefficient challenge has allowed us to examine the refinements made in the IEF-PCM/MST continuum solvation model for the treatment of nitrogen-containing heterocyclic compounds in n-octanol. The results obtained with our submision ID "kivfu" support the suitability of the IEFPCM/MST model for predicting the n-octanol/water partitioning of neutral compounds. However, the analysis of the compounds with the largest uncertainties reveals the need to maintain a close cross talk between theoretical predictions and experimental measurements. Thus, the experimental evaluation of the partition coefficient for SM13 yields a value that is in better agreement with chemical intuition, as it reflects an increased lipophilicity compared with the structurally-related compound SM09, but also with the general behaviour predicted from theoretical calculations. In turn, the experimental and predicted log P values for SM08 can be reconciled after accounting for the solvent effect on the relative stability of the different conformational species. Overall, the availability of blind high-quality datasets, such as the set of compounds included in the SAMPL6 Part II log P challenge, is a powerful tool not only to calibrate the strengths and weaknesses of theoretical methods, but also to implement refinements to improve the chemical accuracy of predictions and to gain deeper insight into the physicochemical factors that regulate the partitioning of (bio)organic compounds.