Prediction of whole-genome DNA G+C content within the genus Aeromonas based on housekeeping gene sequences

Different methods are available to determine the G+C content (e.g. thermal denaturation temperature or high performance liquid chromatography, HPLC), but obtained values may differ significantly between strains, as well as between laboratories. Recently, several authors have demonstrated that the genomic DNA G+C content of prokaryotes can be reliably estimated from one or several protein coding gene nucleotide sequences. Few G+C content values have been published for the Aeromonas species described and the data, when available, are often incomplete or provide only a range of values. Our aim in this current work was twofold. First, the genomic G+C content of the type or reference strains of all species and subspecies of the genus Aeromonas was determined with a traditional experimental method in the same laboratory. Second, we wanted to see if the sequence-based method to estimate the G+C content described by Fournier et al. [7] could be applied to determine the G+C content of the different species of Aeromonas from the sequences of the genes used in taxonomy or phylogeny for this genus.

The DNA base composition is one of the most straightforward genomic characteristics to measure, and has been determined in thousands of bacteria, in which the genomic guanine plus cytosine content ranges from 25 to 77 mol% [8]. Many evolutionary mechanisms have been proposed to explain this G+C content diversity among bacteria, but most authors agree that the genomic G+C content of a species is set by a balance between selective constraints at the level of codons and amino acids and directional mutational pressure at the nucleotide level [33,8].
The determination of the base composition of deoxyribonucleic acid is a key parameter in prokaryotic genomes that is usually used in taxonomic classification. The current recommendation for the description of a novel bacterial species is based on a polyphasic approach, including the determination of the G+C content as well as other characteristics such as DNA-DNA relatedness and phylogenetic classification [32].
Several different methods are available to determine the G+C content (e.g. thermal denaturation temperature or High Performance Liquid Chromatography, HPLC), but obtained values may differ significantly between strains as well as between laboratories. The thermal denaturation temperature (T m ) method is one of the most common techniques for determining this value, based on monitoring the increase of absorbance at 260 nm during DNA denaturation [18]. The T m of DNA is influenced by the ionic strength of the DNA solution, and thus the value is difficult to reproduce from one laboratory to another. To minimize experimental errors, a reference DNA is used as a standard, and the G+C content is calculated by a formula reported by Mandel et al. [17]. However, this formula can not be applied to prokaryotes that have an extremely high or low G+C content, as the resulting value differs from those obtained by HPLC [5,34]. For all these methodological reasons, a variation of up to 5% is generally accepted in the G+C content value within a single species [9]. Currently, the thermal denaturation temperature method has almost been substituted by the HPLC technique [23]. The HPLC method is more rapid and sensitive, but has disadvantages in cost and methodological complexity.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  In this study, we developed a method to predict the genomic G+C content in the genus Aeromonas at the interspecific level. The genus Aeromonas Stanier 1943 comprises Gram-negative, non-sporing, oxidase-and catalase-positive, facultatively anaerobic bacilli that are resistant to vibriostatic agent O/129 and are generally motile by means of a polar flagellum. They reduce nitrate to nitrite and do not require NaCl for growth [1,19]. Taxonomically, this genus belongs to the family Aeromonadaceae and seems to form a monophyletic group in the -subgroup of the class Proteobacteria [19]. They are often associated with aquatic animals and frequently isolated from foods. There is strong evidence for the role of aeromonads as aetiological agents of a variety of infections in ectothermic animals (fish, frogs, turtles and snails). During the last 20 years the genus Aeromonas has been increasingly recognized as an agent of disease in humans, and associated with a variety of clinical manifestations. However, the correlation between species and disease remains to be elucidated and requires additional information about the taxonomy of these ubiquitous bacteria [19,6].
The classification of the genus Aeromonas remains complex from a taxonomical point of view due to the continuous description of novel species, the rearrangement of strains and species described thus far, and the discrepancies observed in different DNA-DNA hybridization studies [10,11,13,20,25].
Recent studies based on the partial sequences of cpn60, dnaJ, gyrB, rpoB, and rpoD genes have shown that the use of several housekeeping genes is an effective approach to the phylogeny and taxonomic identification of Aeromonas species [31,15,29,27,26].
Our aim in this current work was twofold. Few G+C content values have been published for the Aeromonas species described, and the data when available are often incomplete or only provide a range of values. Our first objective was thus to determine the genomic G+C content of the type or reference strains of all species and subspecies of the genus Aeromonas with a traditional experimental method in the same laboratory. Secondly, we wanted to see if the sequence-based method to estimate the G+C content described by Fournier et al. [7] could be applied to determine the G+C content of the different species of Aeromonas from sequences of the genes used in taxonomy or phylogeny in this genus .  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64

Gene sequences
We selected five conserved genes widely used in taxonomic classification and phylogeny of Aeromonas (cpn60, dnaJ, gyrB, rpoB and rpoD). The nucleotide sequences of these genes were obtained from the GenBank database for the strains used in this work. Nine sequences not included in the database were determined in our laboratory according to the methods previously described (cpn60, dnaJ, rpoB, rpoD) [31,15,27,26]. All GenBank accession numbers from the nucleotide sequences used in this study are indicated in Table 1.

Statistical analysis
All statistical analysis was carried out using R software [28] and EXCEL spreadsheet (Microsoft). The statistical significance of the regression analysis between the experimental genomic G+C content and the G+C content calculated from the sequences of the cpn60, dnaJ, gyrB, rpoB and rpoD genes was determined using the t-test [t = r (n-2)/ (1-r 2 )], where r is the Pearson's correlation coefficient, r 2 is the coefficient of determination and n represents the number of species analyzed [16]. As a measure of the goodness of each regression model we used the coefficient of determination (r 2 ) and Akaike's information criterion (AIC). AIC was obtained using the stats package for R software and calculated as AIC = n ln(RSS/n) + 2p + n ln(2 ), where n is the number of observations (31), p represents the number of parameters in the model (2) and RSS the residual sum of squares of the linear regression model. Given a data set, several competing models may be ranked according to their AIC, with the one having the lowest AIC being the best [16]. Observed differences were considered significant when

Experimental determination of G+C content
At present, the DNA G+C content has only been reported in a few species and subspecies of the genus Aeromonas (Table 2). In this study we experimentally determined the genomic G+C content of 31 type and reference strains of the species and subspecies of Aeromonas (Table 2). The variation in the G+C content for this genus was 5.3%, ranging from a minimum of 57.4% (A. sobria) to a maximum

Correlation between experimental and sequence gene methods
We performed a regression analysis between the experimental DNA G+C and the G+C content calculated from the sequence of each of the aforementioned five genes. The regression equations and the Pearson´s correlation coefficients (r) as well as their significance are shown in Table 3. Two of the five selected genes, dnaJ and rpoB, were later excluded from this study because of their low significance (r and AIC values). The average values obtained from the sequences of the three remaining genes (cpn60, gyrB and rpoD) were used to perform a regression analysis with the G+C content experimentally determined (Table 3). As the sequences of the three chosen genes differed in length, we weighed their average G+C content values with the mean length of the sequences (Table   3). However, the differences between the weighed average and the regression analysis performed with the simple mean were minimal (data not shown). The scatter plot, regression line as well as the regression equation and the coefficient of determination are shown in Figure 2. The value of the coefficient of determination obtained (r 2 = 0.8326) is reasonably good, and suggests that this method is a reliable way of estimating the G+C content of Aeromonas species. The results obtained using this regression equation (3 genes) for each of the analyzed strains are shown in Table 2. The difference between the experimentally determined and the predicted values did not exceed 3% (Table 2), thereby being within the range of variation observed in G+C content determination with conventional methods 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  As a way of checking the reliability of our approach, we inferred the G+C content of four strains of A.
molluscorum not included in the previous analysis, by using the regression equation shown in Table 3.
Those strains were chosen because we had previously experimentally determined their G+C content.
Similarly, we also calculated the G+C content of the two Aeromonas species (A. hydrophila ATCC 7966 T and A. salmonicida A449) whose genomes have been sequenced. The results obtained were very precise and the absolute differences did not exceed 1% (Table 4).
In order to examine the intraspecies variation, we calculated the G+C content from the sequences of cpn60, gyrB and rpoD genes in a collection of 50 strains belonging to A. bestiarum, A. hydrophila, A. molluscorum and A. salmonicida. As seen in Table 5 all the standard error values ranged between 0.1 and 0.2 mol%, except in the case of cpn60 for A. molluscorum (0.4 mol%). The higher variation observed in A. molluscorum is due to anomalous value (60.7 mol%) obtained from the strain 849T.
Despite this rather high value, all the data obtained are well below those obtained for this genus interspecifically.

Selection of cpn60
Since sequence determination of three genes might sometimes be cumbersome, we have investigated if one of these genes alone might be representative of the whole. Recently, we have demonstrated that cpn60, whose sequencing is simple and rapid, is a good genetic marker for the Aeromonas species identification [26]. In order to investigate if the cpn60 gene could be suitable candidate, we  Table 3. The value of the coefficient of determination obtained (r 2 = 0.8181) indicated that there is a good correlation between the cpn60 G+C content values and those obtained from the three genes, and allow us to suggest that the cpn60 sequences might be representative of all the genes studied. Table 2 shows the predicted G+C content using only cpn60 sequences for all the strains analyzed in this study. A mean difference of 0.66 mol% ± 0.53 was observed, which is only slightly higher than that obtained when using the regression model for all the three genes. These values are also within the range of variation observed in G+C content determination with conventional methods. Table 4 also shows the predicted values obtained with the same strains but using the regression equation of cpn60.