New Distance-Based approach for Genome-Wide Association Studies

With the rise of genome-wide association studies (GWAS), the analysis of typical GWAS data sets with thousands of single-nucleotide polymorphisms (SNPs) has become crucial in biomedicine research. Here, we propose a new method to identify SNPs related to disease in case-control studies. The method, based on genetic distances between individuals, takes into account the possible population substructure, and avoids the issues of multiple testing. The method provides two ordered lists of SNPs; one with SNPs which minor alleles can be considered risk alleles for the disease, and another one with SNPs which minor alleles can be considered as protective. These two lists provide a useful tool to help the researcher to decide where to focus attention in a first stage.


INTRODUCTION
G ENOME-WIDE association studies (GWAS) have been increasingly used thanks to the advances in highthroughput genotyping methods. A typical GWAS data set contains thousands of single-nucleotide polymorphisms (SNPs) and the aim is to identify genes involved in human disease, seeking SNP alleles that occur more frequently in subjects with a particular disease than in individuals without the disease. In case-control association studies, the frequency of SNP alleles among individuals diagnosed with the disease under study is compared with those in the control group. Association analysis typically involves regressing each SNP separately on a given trait, adjusted for patient-level clinical, demographic, and even environmental factors. The assumed underlying genetic model of association for each SNP (e.g., dominant, recessive, or additive) will impact the resulting findings; however, because of the large number of SNPs and the generally uncharacterized relationships to the outcome, a single additive model is typically used. In this case, each SNP is represented as the corresponding number of minor alleles (0, 1, or 2). Genome-wide association analysis typically includes data pre-processing with sample-level and SNP-level filtering to remove SNPs and samples that will not be included in the analysis [1]. Samples are generally filtered in relation to missing data, sample contamination, relatedness (for population-based investigations), and racial, ethnic, or gender ambiguity or discordance. SNPs are usually removed in relation to missing data, low variability, possible genotyping errors, or violations of Hardy-Weinberg equilibrium (HWE). In casecontrol association studies, this filtering is only considered for controls, as a violation in cases may be an indication of association. Furthermore, in the context of association studies the presence of population substructure can result in spurious associations. One approach is to stratify the analysis by ethnic groups; another approach is to account for the population substructure in the analysis of association. Usually, the first Principal Components (PC) are considered as covariate variables, as these PCs are intended to capture information of latent population substructure that is typically not available in self-reported variables [2], [3]. Once the data have been filtered, statistical analysis is performed to test for associations. Many methodologies for the identification of disease-related SNPs use univariate tests that individually measure the dependency between each SNP and the trait of interest [4], [5], [6], [7]. With univariate testing, single association analysis involves regressing each SNP separately on a given trait, adjusted for possible covariate variables and assessing the significance after correction for multiple comparisons using methods such as Bonferroni, Benjamini-Hochberg or false discovery rate (FDR) [8], [9], [10]. However, all the p-value adjustment methods lead to a loss of sensitivity, which reduces the chance of detecting true positives. Furthermore, as analysing SNPs one at a time can neglect information about the joint distribution, multi-association analysis may be more suitable [11]. One possibility is to group the SNPs over a moving window and look for associations of groups with the diseases, but the selection of the window is very subjective [12], [13]. Another approach, in this direction, is to consider stochastic search algorithms [14].
This article outlines a new method to identify relevant SNPs in case-control studies. The method provides two ordered lists of SNPs; one list with SNPs which minor alleles can be considered risk alleles favouring the presence of the disease in individuals, and another list with SNPs which minor alleles would be protective. These two lists provide a useful tool to help the researches decide where to focus their attention first.
The rest of the article is organized as follows. In the next section, we describe the proposed procedure. Then, we present the behaviour of the procedure using two published simulated data sets. Finally, we apply our method to an empirical data set of single-nucleotide polymorphisms related to attention deficit hyperactivity disorder (ADHD), a prevalent and highly heritable neurodevelopmental disorder that affects children and adults. We conclude with a brief discussion.

METHOD
We focus our attention on case-control studies. Let Y be a categorical variable indicating the presence (coded by 1 in cases) or absence (coded by 0 in controls) of the disease of interest (e.g., ADHD). Let X ¼ ðx y ij Þ be an n Â m data matrix containing the genotypes for the jth SNP (j ¼ 1; . . . ; m) on the ith (i ¼ 1; . . . ; n) individual, with n ¼ n 1 þ n 2 (n 1 cases and n 2 controls). We consider the single additive model as the underlying genetic model of association. In this case, each SNP with alleles A and a tested in the case-control study generates three genotypes (AA, Aa, aa) that are represented as the corresponding number of minor alleles (0, 1, or 2). The model assumes that a SNP will be related with the disease if the number of values equal to 1 or 2 is substantially different in the case group than in the control group; that is, having one or two copies of the a allele will increase the probability of presenting the disease. Let D ¼ ðd il Þ be the Manhattan n Â n distance matrix between all the individuals, defined by d il ¼ dðx y i ; x y l Þ ¼ P j jx y ij À x y lj j. Note that this distance differentiates between alleles with values 1 or 2. For each individual x y i ¼ ðx y i1 ; . . . ; x y im Þ 0 in the case or control group (i ¼ 1; . . . ; n), we consider its K-nearest neighbours among the n 1 cases, The method associates each SNP j with a value i j 1 obtained from variable I j 1 where with Bðp j ik Þ a Bernoulli distribution taking value 1 with probability p j ik if case i takes values 1 or 2 and its k control neighbour takes value 0 on the jth SNP; otherwise, it takes the value 0 with probability 1 À p j ik . Bðq j ik Þ follows a Bernoulli distribution taking value 1 with probability q j ik if the i control takes values 1 or 2, and its k neighbour control takes value 0 on the jth SNP; otherwise, it takes the value 0 with probability 1 À q j ik . In other words, A j 1;0 sums for each case i with values 1 or 2 in the fixed jth SNP, the number of times that the fixed jth SNP takes the value 0 among the control neighbours NN 0 ðx 1 i Þ. In a similar way, A j 0;0 sums for each control i with values 1 or 2 in the considered jth SNP, the number of times that the fixed jth SNP takes the value 0 among the control neighbours NN 0 ðx 0 i Þ.

Proposition :
Consider case i and its NN 0 ðx 1 i Þ control neighbours. Let p i be the probability of observing values 1 or 2 in SNP j for case i given that the jth SNP is related with the disease, and let w j the probability that the jth SNP is related with the disease. Then, p j ik ¼ w j p i ð1 À pÞ þ ð1 À w j ÞQ j ; and q j ik ¼ w j pð1 À pÞ þ ð1 À w j ÞQ j ; with p the probability of observing values 1 or 2 by chance, and Q j the probability that individual i (case or control) takes values 1 or 2 and its k control neighbour takes value 0, given that SNP j is not related with the disease. Proof : Let X 1 i j be the random variable representing SNP j for case i and X 0 i k j the corresponding variable for its k control neighbour. Notice that the superscript 1 or 0 stands for case or control individual and they are included to remember which is the class of the individual at hand, case or control.
The probability p j ik is a sum of the probabilities of the events: and with R j ¼ SNP j is related with the disease f g and R c j ¼ SNP j is not related with the disease f g Thus, Given that the SNP is related with the disease, it is expected that the control neighbours have value 0, and therefore the probability p of observing values 1 or 2 can be assumed to be due to chance and equal for all of them. If the jth SNP is not related with the disease, we expect a similar behaviour between cases and controls, so the value of the probability P{case i takes values 1 or 2 and its k neighbour control takes value 0 j j SNP is not related with the disease}, is expected to be equal for both, a fixed case or control, and equal to Q j .
For this reason, In a similar way, we obtain the value q j ik ¼ w j pð1 À pÞ þ ð1 À w j ÞQ j :

Proposition :
SNPs that favour the presence of the disease have positive and large I j 1 values. Proof : Bðq j ik Þ are sums of Bernoulli distributions with different parameters, supposing independence, these sums follow a Poisson Binomial distribution with mean P n 1 i¼1 P K k¼1 p j ik and P n 2 i¼1 P K k¼1 q j ik , respectively. Therefore, EðI j 1 Þ is equal to, ðw j pð1 À pÞ þ ð1 À w j ÞQ j Þ; and then, ðp i À pÞ: Thus, SNPs with I j 1 value positive and large are the ones that, broadly, show a lower probability of observing values 1 or 2 for the control neighbours than for case individuals along with large w j value and hence, they are the interesting SNPs to be identified to study further as SNPs that favour the presence of the disease.
For all the explained above, the decreasing ordered list with the i j 1 values provides a tool to focus the attention for a genetic study on those SNPs that favour the presence of the disease.
In a similar way, the method associates each SNP j with a value i j 2 obtained from variable I j 2 where with Bðp j ik Þ a Bernoulli distribution taking value 1 with probability p j ik if control i takes values 1 or 2 and its k case neighbour takes value 0 on the jth SNP; otherwise, it takes the value 0 with probability 1 À p j ik . Bðq j ik Þ follows a Bernoulli distribution taking value 1 with probability q j ik if the i case takes values 1 or 2, and its k neighbour case takes value 0 on the jth SNP; otherwise, it takes the value 0 with probability 1 À q j ik . That is, B j 0;1 sums for each control i with values 1 or 2 in the fixed jth SNP, the number of times that the fixed jth SNP has value 0 among the case neighbours NN 1 ðx 0 i Þ. And B j 1;1 sums for each case i with values 1 or 2 in the fixed jth SNP, the number of times that the fixed jth SNP has value 0 among the case neighbours NN 1 ðx 1 i Þ. The next results can be proved in a similar way. Proposition : SNPs that protect against the disease have positive and large I j 2 values. Therefore, the decreasing ordered list with the i j 2 values provides a tool to focus the attention for a genetic study on those SNPs that protect individuals against the disease.

Comments
(1) As functions for a thorough quality control (QC) of the data, such as Hardy-Weinberg equilibrium test and missingness have been well implemented in PLINK or GenABEL [15] we assume that the data have been cleaned by a standard QC process before applying our procedure. (2) It is obvious the importance of distinguishing between value 0 and values 1 or 2 in the coded SNPs. Besides, when it is also important to distinguish between 1 and 2 values of the SNPs we propose the use of the Manhattan distance. If it is not the case, the use of the Hamming distance is a good option. (3) In general, the distribution of I 1 is unknown and in order to determine a threshold for the SNPs selection, it is necessary to obtain the null distribution by permutation resampling. However, under some conditions as in the case that n 1 and n 2 are large ( > 2,000), p j ik and q j ik are small, and all the SNPs have the same probability w j ¼ w, Q j ¼ Q, then the Normal distribution is a good approximation [16] of the I 1 distribution. The same is true for the I 2 distribution. (4) An important point is the possible influence of the number of neighbours, K, on the results. It is clear that the value of K must be moderate, since otherwise the method could not retain, if it exists, information on the possible population substructure or the possible dependence between the SNPs. However, very low values, K < 5, may not be convenient especially if there is large variability between individuals. Among moderate values the method is stable. Consider, for instance, a toy simulated example with 30 cases, 69 controls and 9 SNPs. We have generated SNPs number 6, 7, 8 and 9 related to the case-Àcontrol situation and the other SNPs without association with the caseÀcontrol situation. Moreover, SNP.6 was highly correlated with SNP.7, and SNP.8 with SNP.9 ( Fig. 1). As shown in Table 1 with B = 500 resamples, only small values of K correctly identified as significant at a ¼ 0:05 SNP6ÀSNP.9; the I 2 was not significant for any of the SNPs. Furthermore, the Fisher's exact test, a standard test of association without population structure control, did not identify SNP.6 as significant. On the other hand, the gold standard method with population structure control based on regressing each SNP using the first Principal Components (PC) as covariates [17], [18] did not identify SNP.6, SNP.8 and SNP.9 as significant, which were generated with an odds ratio equal to 1.52, 5.86 and 6.16, respectively.

PUBLISHED SIMULATED DATA SETS
In this section, we describe the performance of our procedure on previous published simulated data sets. We also compare it with the two alternative methods for single variant analysis, Fisher's exact test and PCA. In all cases a ¼ 0:0001 was the significant level and the number of neighbours was K ¼ 10.

Simulated Data Set 1
Consider the simulated case-control data set simuCC included in the genMOSS R package [19]. It contains the genotype information for 6000 SNPs and the disease status for 2000 individuals, 1000 cases and 1000 controls. Two SNPs, rs4491689 and rs6869003, and a random environmental factor were associated with the presence of the disease. Both, Fisher exact test and the PCA approach with Bonferroni or BH correction, identified 4 SNPs as significant (see Table 3 top and middle, respectively), the two associate with the disease (rs6869003 and rs4491689) and two (rs6722027, rs6730761) located in the genetic regions around rs4491689 and rs6869003.
Our method, using the permutation distribution of I 1 with B ¼ 500 resamples, identified these two SNPs as the first and second SNPs in the ranked list of significant SNPs that favour the disease (see Table 3 bottom and Fig. 2). The method identified 8 SNPs as significant, the four identified   by the Fisher and PCA methods and four with a value of I 1 almost equal to the threshold value (threshold value = 0.051). These four SNPs with a very low nominal p-value for Fisher or PCA approaches, were lost after adjustment corrections by both, Bonferroni or BH methods (see Table 3). Furthermore, using the permutation distribution of I 2 with B ¼ 500 resamples, no protective SNPs were found as expected.

Simulated Data Set 2
The simulated data set simGWAS in the simGWAS package [20] contains 250 controls and 250 cases, with a 1000 SNPs. The variables SNP:1 till SNP:990 were simulated to have no association with the response and the variables SNP:991 till SNP:1000 have a population odds ratio showed in Table 4. The variables age and sex were two additional control variables without association with the response. Results of the two standard considered procedures are shown in Table 5. Both Fisher exact test or PCA procedure did not identified significant SNPs with Bonferroni or BH correction.
Our method, using the permutation distribution of I 1 and I 2 with B ¼ 500 resamples, detected 6 SNPs as associated with the disoder, with a ¼ 0:0001: SNP:1000, SNP:991, SNP:992 and SNP:993 as SNPs favouring the presence of the disease (see Fig. 3 top); and SNP:994 and SNP:998 as SNPs protecting from the disease (see Fig. 3 bottom).
The logistic regression performed using the 10 SNPs (SNP:991 À SNP:1000) confirmed that the role of the detected SNPs by the proposed procedure is correct. The corresponding coefficients in the logistic model for SNP:1000, SNP:991, SNP:992 and SNP:993 were positive, and for SNP:994 and SNP:998 coefficients were negative. Furthermore, our 6 selected SNPs (see Fig. 3) indicated that our procedure detected only the most important SNPs, as the contribution of the 4 SNPs that were not detected is very small. The logistic regressions and AUC values using the 10 SNPs and our 6 selected SNPs are shown in Table 6, and In the bottom, significant SNPs and the I 1 value according to the new procedure. The SNPs are showed ordered by the I 1 value. OR 1 : odd ratio for minor allele 1; OR 2 : odd ratio for minor allele 2; OR: odd ratio for minor allele 1 and 2.   Fig. 4 shows the corresponding ROC curves. When the gender of the individuals is known, we should separate the first term in I 1 , P n 1 i¼1 P 10 k¼1 Bðp j ik Þ, in two terms indicating the contribution for men and women, separately, and assess whether, on average, their contribution is equal or not. As expected, no differences were found between the average contributions made by men or women, indicating that the behaviour of the SNPs was not related to gender.

REAL DATA SET
Consider the following data set previously used in different case-control attention-deficit/hyperactivity disorder (ADHD) studies [21]. The sample consisted of Spanish    . Cases and controls were genotyped using the same platform (HumanOmni1-Quad BeadChip, Illumina Inc., San Diego, USA) and only those who reported Caucasian origin were recruited, as described in [21]. In addition, we assessed ancestry using genome-wide data, by estimating principal components (PC) of a dataset including individuals of the study population (cases and controls) and a reference panel of individuals with known ancestries (1000G phase 1, www.internationalgenome.org), and excluding those individuals with PC1 or PC2 values greater than three standard deviations from the mean obtained for European individuals. A total of 155,802 SNPs covering the whole genome were considered for the analysis. They were obtained after clumping an initial set of around 4 million SNPs from GWAS data produced in [21] to minimize genetic redundancy; for each clump of correlated SNPs (r 2 > 0.2) within in 500 kb windows only the SNP with the most significant p-value of association with case control status was kept. Considering two alleles for a SNP, A and a, we assume that having one or more copies of the A allele increases risk compared to a (i.e., Aa or AA genotypes coded by 1 and 2, respectively, have higher risk than aa coded by 0). Using the 155,802 SNPs, a Multidimensional Scaling using the Manhattan distance showed a perfect separation between cases and controls (see Fig. 5). The question is whether a smaller number of SNPs is enough to obtain a perfect discrimination between cases and controls.
First, and aiming at identifying SNPs with minor alleles that favour the presence of ADHD, we applied the proposed method to the whole data set (846 individuals and 155,802 SNPs). Once the SNPs have been selected, we will apply the DB-discriminant analysis, a discriminat analysis method based on distances [22], [23]. Fig. 6 shows that although the distribution of I 1 does not exactly follow a normal distribution, we can approximate the right tail of the distribution with a normal distribution with mean and standard deviation equal to 0.0002 and 0.025, respectively. We selected the 200 SNPs with higher I 1 values, corresponding  to a ¼ 0:0012 with a threshold value 0.07503 (P ½I 1 > 0:07503 ¼ 0:0012), and the DB-discriminant method obtained a 91.13 percent leave-one-out total correct classification, with high sensitivity, specificity and predictive values in both controls and cases (see Table 7). Fig. 7 shows the Manhattan plot scattering the positive I 1 values in the vertical axis and the physical position of SNPs along chromosomes.
To assess the possible influence of the sample size on the results, we split the sample 20 times at random into train (90 percent) and test (10 percent) data. Taking SNPs with minor alleles favouring the presence of ADHD allows a highly reliable assignation of cases and controls, reaching correct classification percentages over 90 percent with, again, only 200 SNPs (see Tables 8 and 9).
Looking with more detail the 200 SNPs selected from the whole sample, we observed that the top finding is SNP rs739465 in the VAV2 gene, encoding an angiogenic protein and previously associated with multiple sclerosis. Other findings point at the NF1 gene, encoding neurofibromin 1 and causal for a mendelian disorder, neurofibromatosis, but also associated with risk-taking behaviour, alcohol consumption or anxiety.
Furthermore, we browsed the National Center for Biotechnology Information (NCBI) website [24] to find information on the identified SNPs. For instance, for NCBI data on SNP rs6797465. This SNP is located in an intronic region of the FHIT gene, so we subsequently used the GeneCards: The Human Gene Database website [25] to explore possible connections of this gene with ADHD in the section Phenotypes from the GWAS Catalog or in the section Disorders. As a result, we obtained several literature items associating FHIT with attention-deficit/hyperactivity disorder. In this way, we found that nine SNPs identified by us are allocated in genes previously reported as related to ADHD (Table 10). For instance, RBFOX1, encoding a splicing factor, was found associated with depression and it was also highlighted in a recent GWAS meta-analysis of 8 psychiatric disorders, including ADHD [26]. Also, CDH13, encoding a protein with cell adhesion properties and high expression in brain, has been associated with ADHD and several comorbid psychiatric disorders [27].
Finally, as we know the gender of the individuals, we studied the contribution of men and women in the first term of I 1 . We observed that only for two of the 200 selected SNPs, the mean contribution is larger for women than for men. For this reason, we considered interesting to analyse the two genders independently. Thus, we calculated the I 1 values for men I 1M and woman I 1W , respectively. Considering the analysis for men, the DB discriminant analysis using the 200 top I 1M SNPs obtained a correct classification rate equal to 92.67 percent. This list of SNPs selected according   to the I 1M values contains six genotyped SNPs that are located in genes previously reported as related to ADHD (see Table 10) and it has 74 SNPs in common with the 200 SNPs selected when using all the sample. On the other hand, considering the analysis for women, the DB discriminant analysis using the 200 top I 1W SNPs obtained a good classification rate equal to 97.84 percent. In this list of SNPs selected according to the I 1W values, eleven genotyped SNPs are located in genes previously reported as related to ADHD (see Table 10) and there are only 8 SNPs in common with the 200 SNPs selected when using all the sample. On the other hand, the I 1W and I 1M lists did not present any SNP in common. Of course, these limited coincidence between the lists of 200 top selected SNPs was surprising.
In order to shed light into this subject, we performed a MDS analysis, including the representation of these three lists of selected SNPs. Figs. 8 and 9 show that the SNPs are highly correlated, and this is the reason why although the coincidence in the SNP lists is low, a high rate of correct classification is always achieved. When the Fisher exact test was performed, SNPs were found not significant (smaller nominal pÀvalue¼4.25e-07, 1.89e-08 and 3.14e-07 for all subjects, men and women samples, respectively). If we consider the usual Bonferroni-corrected significance threshold of 5e-8, only SNP rs9768620 for the men sample was significant. However, this SNP was not selected by our procedure. Using all individuals, men or women samples with PCA, the parameter estimates of the logistic regression using the method of maximum likelihood do not converge as the first PC produces a complete separation of individuals [28], [29]. Despite this fact, generally in statistical packages the results obtained are based on the last maximum likelihood iteration and the validity of the model fit is questionable.
We end by pointing out that, analogously, the method identified significant associations with protective SNP alleles. Several SNPs, such as rs11644983 or rs1962749, pointed as significant by our procedure were also identified as nominally associated with ADHD in a previous study [21]. To analyse in depth these lists of selected SNPs (with or without previously reported ADHD candidate genes) it is necessary to perform other analyses. For instance, identifying in which functional groups the identified genes fall or conducting enrichment studies. Another interesting issue is the possible connection between our findings and the ones from other psychiatric disorders, as there is an extensive genetic overlap between some of these diseases. However, all these questions are outside the aim of this work.

CONCLUSIONS AND FUTURE WORK
Within case-control genome-wide association studies, which interrogate hundreds of thousands of single-nucleotide polymorphisms (SNPs) this work proposes a new methodology to detect true signals of association with a phenotypic trait of interest. To accomplish this, we propose a method based on genetic distances between individuals that uses all the SNPs included in the data set. Thus, these distances contain all the information that it is possible to obtain from the observed genotype data as, for instance, the population substructure. This is particularly attractive and represents an advantage in front of other methodologies. Another advantage of the proposed procedure is that it does not requires paying attention to multiple testing issues, and the usual Bonferroni-corrected significance threshold of 5e-8 is not needed. Furthermore, linkage equilibrium is not required and the proposed procedure can handle missing data, so no imputations of missing values are required; however, it is advisable to retain only SNPs and individuals with less than 5 percent missing values, as usual. The method obtains two lists of SNPs which are deemed to be in statistically significant association with the categorical variable that indicates presence or absence of the disease. These lists rank the selected SNPs from most to less significant SNPs. These selected SNPs are candidates for a true disease association pending confirmation in the laboratory. One challenge is to analyse how to include, in the proposed methodology, covariates as age or other clinical information. One possibility is the use of a convenient distance capable of synthesizing all the information, like the Gower distance [30] or the weighted related scaling metric distance [31], although more research is needed in this direction. We hope that the proposed methodology will be helpful for GWAS researchers to get a better understanding of the genetic basis of complex diseases. The authors would like to thank Paula Rovira for providing the list of SNPs from GWAS data on persistent ADHD. Conflict of Interest: JAR-Q was on the speakers' bureau and/or acted as consultant for Eli-Lilly, Janssen-Cilag, Novartis, Shire, Takeda, Bial, Shionogui, Lundbeck, Almirall, Braingaze, Sincrolab, Medice and Rubi o in the last 5 years. He also received travel awards (air tickets + hotel) for taking part in psychiatric meetings from Janssen-Cilag, Rubi o, Shire, Takeda, Shionogui, Bial, Medice and Eli-Lilly. The Department of Psychiatry chaired by him received unrestricted educational and research support from the following companies in the last 5 years: Eli-Lilly, Lundbeck, Janssen-Cilag, Actelion, Shire, Ferrer, Oryzon, Roche, Psious, and Rubi o.

ACKNOWLEDGMENTS
Concepci on Arenas received the PhD degree in mathematics, specializing in statistics, from the University of Barcelona, Spain, where she is now a research professor with the Department of Genetics, Microbiology and Statistics, Statistics section. Her research interests include multivariate analysis as applied to bioinformatics, with emphasis on DNA sequence analysis and microarray interpretation. She also works in biomedical statistics.
Bru Cormand received the PhD degree in biological sciences from the University of Barcelona, where he is now a full professor. He is head of the Department of Genetics, Microbiology and Statistics at the University of Barcelona. He leads the Neurogenetics research group, with focus on the etiology of neuropsychiatric disorders, including attention-deficit/hyperactivity disorder, autism spectrum disorder and substance use disorders.
Itziar Irigoien received the PhD degree in informatics from the University of the Basque Country, Donostia, Spain, where she is now a research professor with the Department of Computation Science and Articial Intelligence. Her research interests include the development of new statistical methods and software to solve bioinformatics and biomedical questions.