Kernel conditional embeddings for associating omic data types

. Computationalmethodsareneededtocombinediversetypeofgenome-wide data in a meaningful manner. Based on the kernel embedding of conditional probability distributions, a new measure for inferring the degree of association between two multivariate data sources is introduced. We analyze the performance of the proposed measure to integrate mRNA expression, DNA methylation and miRNA expression data.


Introduction
Modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of big volumes of heterogeneous information in the deep understanding of biological systems and the development of predictive models. For example, in current biomedical research, it is not uncommon to have access to a large amount of data from a single patient, such as clinical records (e.g. age, gender, medical histories, pathologies and therapeutics), high-throughput omics data (e.g. genomics, transcriptomics, proteomics and metabolomics measurements) and so on. How data from multiple sources are incorporated in a learning system is a key step for successful analysis.
Some of the most powerful methods for integrating heterogeneous data types are kernel-based methods [1]. Kernel-based data integration approaches can be described using two basic steps. Firstly, the right kernel is chosen for each data set. Secondly, the kernels from the different data sources are combined to give a complete representation of the available data for a given statistical task.
In this paper we propose a new measure (to the best of our knowledge) for inferring the degree of association between two multivariate data sources based on the embedding of conditional probability distributions in the framework of kernel methods.

Kernel conditional embeddings
The Reproducing Kernel Hilbert Space (RKHS) methods provide a general and rigorous foundation to learn predictive models, where models are determined by specifying a kernel function, a loss function and a penalty function [2]. Representer theorem [2] shows that solutions of a large class of optimization problems in RKHS can be expressed as kernel expansions over the sample points. A question that arises in a natural manner in the context of inference refers to the representation of a probability distribution in a RKHS. With this goal Smola et al. [3], Fukumizu et al. [4] among others, have introduced the RKHS versions of the fundamental multivariate statistics, the mean vector and the covariance matrix. These RKHS-counterparts of the mean vector and the covariance matrix are called mean element and covariance operator, respectively.
Let  be an RKHS on the separable metric space , with continuous feature mapping ( ) ∈  for each ∈ . The inner product between feature mappings is given by the kernel function ( , ) ∶= ⟨ ( ), ( )⟩. Let be a probability distribution on . We can represent ( ) for an element in the RKHS associated with a kernel : It has been shown that if [ ( , )] < ∞, is guaranteed to be an element of RKHS. The embedding of ( ) enjoys two attractive properties. First, if the kernel is characteristic, the mapping from ( ) to is injective, which means that different distributions are mapped to different points in a RKHS. An example of characteristic kernel is the gaussian kernel. Second, the expectation of any function ∈  can be evaluated as a scalar product in  Let ( , ) be a random variable taking values on  ×  and (, ) and (, ) be RKHSs with measurable kernels on  and , respectively. Let ( ) = (⋅, ) and ( ) = (⋅, ) denote the feature maps. According to the definition of the kernel embedding of a probability distribution ( ), for the kernel embedding of a conditional distribution ( | ) we have Given a data set  = {( 1 , 1 ), ..., ( , )} drawn i.i.d from ( , ), and where Φ ∶= ( ( 1 ), ..., ( )) and Υ ∶= ( ( 1 ), ..., ( )) are implicitly formed feature matrix, and = Υ ⊺ Υ is the kernel matrix for samples from variable , Song et al. [5] estimate the conditional embedding aŝ where and ∶ = ( ( , 1 ), ..., ( , )) ⊺ . The empirical estimator of the conditional embedding is similar to the estimator of the ordinary embedding from equation (1). The difference is that, instead of applying uniform weights 1 , the former applies non-uniform weights, ( ), on observations which are, in turn, determined by the value of the conditioning variable. These non-uniform weights reflect the effects of conditioning on the embeddings.

Measuring the discrepancy between conditional embeddings
Conditional embeddings allows us to quantify the differential effect on the response vector , when the values of the conditioning vector varies. For instance, the conditioning values on which the vector is fixed, may correspond to the mean vector of measured in different experimental conditions. We propose the quantity || | 1 − | 2 || 2  for measuring the differential effect on when conditioning to 1 or when conditioning to 2 . From (2) we can estimate this quantity by using the statistic: To assess the significance, we generate a null distribution by taking permutation of the rows of but keeping the rows of . Thus, after permutations we have datasets 1 = ( , 1 ), ..., = ( , ), where each results from a random permutation of the rows of . Thus we get 1 , ..., and we can estimate a p-value by computing the number of times that , = 1, ..., , are greater than .
We aim to determine the degree of association between methylation and mRNA expression. To this goal, we measure the effect on mRNA ( ) when conditioning on different conditions of DNA methylation ( ). In particular, DNA methylation conditions are fixed by the centers of the clusters discovered by using spectral clustering of DNA methylation data. According with [7], we set the number of clusters to be three. Patients were grouped in three clusters with 18, 140 and 57 patients each one. Using (3) we computed ( ), where , = 1, 2, 3, denotes the mean vectors (centroids) of the clusters. Then, from (4) we computed = ||̂ | −̂ | || 2  where indexes and denote on which pair of vectors and the conditional embeddings were compared.
We used gaussian kernel for both and , kernel parameters were adjusted using the sigest function in Kernlab package [8]. In Figure 1 it is shown the heatmap of the kernel matrix corresponding to the methylation data. We observe that the kernel matrix also reveals the same patterns of similarities found by spectral clustering. In fact, when samples are ordered according the clusters found by spectral clustering, identified in the heatmap by the upper color bar, we observe that the similarity values in the kernel matrix shows three homogenous groups that coincide with clusters. A small group of samples, left bottom corner, we call this group as group 1. The largest group, in the central part of the heatmap, we call this group as group 2, and a group of samples, upper right corner, that we identify group as group 3. Figure 2 shows vectors ( ), = 1, 2, 3 and (̄ ) in the last column, that define the weights of the conditional embeddings (3). Samples, grouped according cluster they belong, are in rows. For each sample, row-normalized weights are displayed. Observe that the normalized weights change consistently across conditions (cluster centroids). That is, samples with highest weights belong to the same cluster on which we are conditioning. To asses the statistical significance of the empirical values we applied a permutation based test, using 5000 permutation samples. We observe ( Table 1) that are significant pairwise comparisons that involve group 3. On the other hand, comparisons with respect the conditional embedding on the overall mean are only significant in clusters 2 and 3. Table 1 also includes a summary of the null distribution of the test.   In addition, we study the association between gene expression ( ) and miRNA ( ). In analogy with the previous analysis, the miRNA conditions were determined by the centroids of the clusters from the spectral clustering of the miRNA dataset. In accordance with [7], we set the number of clusters to be three. Clusters have 70, 84 and 61 patients each one. Next, from (4) we computed = ||̂ | −̂ | || 2  where indexes and denote on which pair of vectors and the conditional embeddings were compared. Figure 3 shows vectors that define the weights of the conditional embeddings (3). Samples in rows are grouped according cluster they belong. For each sample, row-normalized weights are displayed. Normalized weights change almost consistently across conditions (cluster centroids). We applied a permutation based test, using 5000 permutation samples, to evaluate the significance of the empirical values . We observe ( Table 2) that are significant only the comparison between groups 1 and 2. Any other comparison is not significant neither comparisons between conditional embeddings and mean embedding.

Conclusions
We propose a measure to integrate data in the framework of kernel methods. This methodology is based on the kernel embedding of conditional probability distributions. Our measure allows us to infer the degree of association between two types of multivariate measurements by measuring the effect on the mean element associated with the response vector when it is conditioned on different values of the explanatory vector, representing different experimental or clinical conditions.  Table 2. Gene expression and miRNA analysis. Summary of the permutation test.