Identification of Novel Loci and New Risk Variant in Known Loci for Colorectal Cancer Risk in East Asians

Background— Risk variants identified so far for colorectal cancer explain only a small proportion of familial risk of this cancer, particularly in Asians. Methods— We performed a genome-wide association study (GWAS) of colorectal cancer in East Asians, including 23,572 colorectal cancer cases and 48,700 controls. To identify novel risk loci, we selected 60 promising risk variants for replication using data from 58,131 colorectal cancer cases and 67,347 controls of European descent. To identify additional risk variants in known colorectal cancer loci, we performed conditional analyses in East Asians. Results— An indel variant, rs67052019 at 1p13.3, was found to be associated with colorectal cancer risk at P = 3.9 × 10 −8 in Asians (OR per allele deletion = 1.13, 95% confidence interval = 1.08–1.18). This association was replicated in European descendants using a variant (rs2938616) in complete linkage disequilibrium with rs67052019 (P = 7.7 × 10 −3 ). Of the remaining 59 variants, 12 showed an association at P < 0.05 in the European-ancestry study, including rs11108175 and rs9634162 at P < 5 × 10 −8 and two variants with an association near the genome-wide significance level (rs60911071, P = 5.8 × 10 −8 ; rs62558833, P = 7.5 × 10 −8 )in the combined

Asians differ significantly from European descendants in genetic architectures. Genetic studies in Asians may provide an opportunity to explore the genetic architecture of colorectal cancer including identification of novel variants. We established the Asia Colorectal Cancer Consortium (ACCC) in 2010 to identify new genetic risk factors for colorectal cancer. Over the past 10 years, we have identified about 30 novel colorectal cancer risk loci (6-8, 11, 14). To further increase the statistical power of uncovering novel susceptibility loci for colorectal cancer, we utilized data from studies of 58,131 cases and 67,347 controls of European ancestry (10) to replicate promising risk variants identified in GWAS of 23,572 cases and 48,700 controls recruited from 15 studies conducted in three East Asian Countries (China, Japan, and Korea). Furthermore, we performed conditional analyses to identify potential independent signals at each of the colorectal cancer risk loci identified in previous studies in Asians.
Supplementary Data). To increase the statistical power, we used data from European descendants to replicate promising findings from the ACCC.
On the basis of the results from the meta-analysis of all Asian studies, we selected 60 promising variants (P < 5 × 10 −3 , Supplementary Table S2) that are 500 kb away from any established colorectal cancer risk loci (10,11) at the time of study design for replication using data from 58,131 cases and 67,347 controls of European descent (10). These cases and controls were derived from three colorectal cancer study consortia: the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO), the Colorectal Transdisciplinary (CORECT) Study, and the Colon Cancer Family Registry (CCFR). The details of each included study have been reported previously (9,10). For this analysis, the Haplotype Reference Consortium panel without indel variants was used as reference for imputation.
All study protocols were approved by the relevant Institutional Review Boards (10,11) and informed consents were obtained from study participants. Research was conducted in accordance with the Belmont report.

Genotyping and imputation
Details of genotyping, genotype calling, quality control and imputation for the ACCC have been reported previously (Supplementary Data;11,14). The genotypes for samples from six studies (Shanghai-4, Aichi-2, Korea-NCC, Korea-NCC2, Korea-Seoul, and HCES2-CRC; Supplementary Table S1) genotyped with Illumina MEGA-Expanded Array (Illumina Inc.) were updated on the basis of the manually reclustered genotype cluster files (for 123K SNPs). Little evidence of population stratification was found in these studies (6,8,11), based on principal component (PC) analysis, using EIGENSTRAT (Supplementary Fig.  S1; ref. 15). To increase the genome coverage and facilitate the meta-analysis, the 1000 Genomes Project phase III mixed reference haplotypes (version 5) were used to impute untyped genotype data with Michigan Imputation Server (minimac3 for imputation and SHAPEIT for prephasing).
We evaluated the quality of imputation using whole-genome sequencing data for the five variants (rs67052019, rs60911071, rs62558833, rs11108175, and rs9634162; Supplementary Table S3) that were highly significantly associated with colorectal cancer risk. Data for these five variants were extracted from whole-genome sequence datasets for 290 Shanghai colorectal cancer samples and compared with the genotypes derived from the imputationbased approach. The whole genome sequence was performed using the BGISEQ-500 sequencing platform with paired-end reads in length with 2 × 100 bp (mean read depth ~50M). The sequencing reads for each sample were mapped to the human reference genome (hg38) using the Burrows-Wheeler Aligner BWA program (version 0.75). The aligned reads were processed using the Genome Analysis ToolKit (GATKv3.7). Variant calling was performed individually for each sample with the GATK HaplotypeCaller tool and all samples together with GenotypeGVCFs to create a complete list of SNPs and indel VCFs. The Variant Quality Score Recalibration (VQSR) was then applied to filter variants of low quality.

Statistical analysis
We used the score test implemented in Rvtest (16) to associate genotype dosages with colorectal cancer risk after adjusting for age, sex, and the first five PCs of each individual study. SNPs with a low imputation quality (R 2 < 0.3) or a low MAF (<0.1% in the combined samples) were excluded from the downstream analysis. Summary statistics from each of 15 ACCC case-control studies were meta-analyzed using METAL (17) with the inverse variance-weighted fixed effect model. Associations with a P < 5 × 10 −8 in the Asian studies alone or in combination with European studies were regarded as genome-wide significant. Each independent locus was defined as ±500 kb on either side of the most significant SNP that reached a genome-wide significant threshold (P < 5 × 10 −8 ). The Cochran Q test (18) was used to evaluate the heterogeneity across studies and subgroups. We did not observe any apparent inflation in association statistics from the ACCC (Supplementary Fig. S2; refs. 6,8,11).
We performed approximate conditional analyses based on meta-analysis summary statistics to identify additional independent association signals at each locus using the GCTA-COJO method (19). The linkage disequilibrium (LD) matrix used in the analyses were based on 6,684 unrelated East Asian samples (interindividual genetic relationships < 0.025; ref. 11). Considering relatively small sample sizes of colorectal cancer studies in populations of East Asian ancestry (10,11), variants that are conditionally independent of the index SNPs within a region and reached an empirical locus-wide significance of P < 5 × 10 −5 were considered to be distinct association signals.

In silico functional characterization of novel Ioci
To identify potential functional variants and target genes at novel risk loci identified in this study, we annotated all variants that are in LD (r 2 ≥ 0.8, EAS in the 1000 Genomes Project; 131 variants in total; Supplementary Table S4) with each of identified GWAS index variants within 500 kb using ANNOVAR (20). We used PolyPhen2 (21) and SIFT (22) to assess potential functional impact of coding variants. We further used regulatory information from the Roadmap Epigenomic project (epigenomic profiling) and the ENCODE project (regulatory protein binding and regulatory motifs) to characterize these 131 selected variants with the web-based HaploReg v4 (https://pubs.broadinstitute.org/mammals/haploreg/ haploreg.php) as described previously (11). We tested for enrichment of colorectal cancer index variants with functional domains using the software of Genomic Regulatory Elements and Gwas Overlap algoRithm (GREGOR;ref. 23). This method tests for an increase in the number of colorectal cancer-associated index variants (94 independent GWAS index variants identified in populations of East Asian ancestry at P < 5 × 10 −8 or replicated in populations of East Asian ancestry at P < 5 × 10 −2 for those variants originally identified in populations of European ancestry; Supplementary Table S5), or their LD proxies (r 2 0.7, EAS in the 1000 Genomes Project), overlapping with the regulatory features (peaks from H3K4me1 and H3K27ac as enhancers, peaks from H3K4me3 and H3K9ac as promoters, and open chromatin as measured by DNase hypersensitivity) more often than expected by chance by comparing to permuted control sets in which the variants are matched on variant frequency, number of LD proxies, and distance to the nearest gene. A saddle-point approximation was used to estimate the P value based on the distribution of permuted statistics (23).

cis-expression quantitative trait loci analysis
We conducted cis-expression quantitative trait loci (cis-eQTLs) analyses for each of identified novel risk variants. The RNA was extracted from tumor-adjacent normal tissues obtained from 133 East Asian patients with colorectal cancer (7,8,11). We profiled gene expression using RNA sequencing with total mapped reads > 14M for each sample (11). We quantified gene expression levels using FPKM (Fragments Per Kilobase of transcript per Million mapped reads) values. For cis-eQTLs analyses, we defined as a region ±1 mb within each risk variant. These patients were genotyped using the Illumina MEGA array as described before (11). We used linear regression models to associate gene expression levels with SNP genotypes with adjustment for sex and the top two PCs. We also evaluated associations of novel risk variants with gene expressions in 246 transverse colon tissues included in the Genotype-Tissue Expression (GTEx) database (24).
Of the remaining 59 variants for replication in populations of European descent (58,131 colorectal cancer cases and 67,347 controls; Supplementary Table S2), 12 variants were replicated in the same direction as observed in East Asian populations at P < 0.05. Three variants (rs4308634, rs11108175, and rs9634162) were identified to be associated with colorectal cancer risk at the genome-wide significance level (P < 5 × 10 −8 ) in the combined meta-analysis results of East-Asians and Europeans (Table 1; Fig. 1; Supplementary Figs. S3 and S4). However, the variant of rs4308634 [OR for the G allele = 1.04 (95% CI = 1.03-1.06) and P = 6.5 × 10 −9 in the combined meta-analyses] at the 7p12.3 was in high LD (r 2 EAS = 0.68) with a nearby variants (rs10951878) recently identified in populations of European ancestry (13). In addition, two variants (rs60911071 and rs62558833) were associated with colorectal cancer risk near the genome-wide significance level (P = 5.8 × 10 −8 and 7.5 × 10 −8 , respectively; Table 1; Fig. 1; Supplementary Fig. S4). Little heterogeneity of the effect size was observed between these two populations for these variants (P ≥ 0.01, Table 1). Stratification analyses of these newly identified risk variants (rs67052019, rs60911071, rs62558833, rs11108175, and rs9634162) by tumor site (colon or rectum) did not identify any significant heterogeneity (P > 0.05; Supplementary Table S6).
The average imputation quality for these five highly significant variants (rs67052019, rs60911071, rs62558833, rs11108175, and rs9634162) for colorectal cancer risk ranged from 0.78 to 0.99 in Asian cohorts and from 0.98 to 0.99 in European cohorts (Supplementary Table S3). The overall genotype concordance rates for these five common variants (minor allele frequency ≥0.18; Table 1) between imputed data and whole-genome sequencing data were high (>0.90) based on 290 Shanghai colorectal cancer samples, indicating that the imputation quality was excellent for these SNPs (Supplementary Table  S3).

Functional characterization of risk loci and cis-eQTL analyses
We used functional genomics data to annotate each of the identified five novel variants (Table 1) as well as their correlated variants (r 2 EAS ≥ 0.80).Aligning these risk variants with histone methylation/acetylation marks and DNase hypersensitivity sites (26) revealed that variants at three loci (1p13.3, 8p21.2, and 12q22) overlapped with the promoter/enhancer histone marks or DNase hypersensitivity sites in gastrointestinal tissues (Supplementary  Table S4). This suggests that these variants may be involved in regulating gene expressions in gastrointestinal tissues. We further tested for tissue regulatory element enrichments of 94 colorectal cancer-associated variants identified or replicated in populations of East Asian ancestry (Supplementary Table S5 and Materials and Methods) using GREGOR (23). We found that colorectal cancer-associated variants were strongly associated with regulatory elements in gastrointestinal tissues (fetal stomach, fetal small intestine, fetal large intestine, small intestine, sigmoid colon, colonic mucosa, rectal mucosa, and stomach mucosa; >2.0 × enrichment; Supplementary Tables S7 and S8). Interestingly, monocytes were also enriched as indicated by DNase I hypersensitive sites (P = 8.1 10 −13 , 2.3 × enrichment; Supplementary Tables S7 and S8). Similar results were observed on the basis of colorectal cancer-associated variants identified separately in populations of East Asian ancestry (fetal stomach, fetal small intestine, fetal large intestine, sigmoid colon, rectal mucosa, stomach mucosa, and monocytes; >2.0 × enrichment and P < 7.2 × 10 −6 ) or populations of European ancestry (fetal stomach, fetal small intestine, fetal large intestine, sigmoid colon, colonic mucosa, and rectal mucosa; >2.0 × enrichment and P < 6.7 × 10 −10 ).
We performed cis-eQTL analyses using transcriptome data from tumor-adjacent normal colon tissues from 133 patients with colorectal cancer of East Asian ancestry (Supplementary Table S9) and transverse colon tissues from 246 individuals predominantly of European ancestry in the GTEx (Supplementary Table S10). Significant correlations at P < 0.05 were found for 3 and 6 SNP-gene expression pairs in the East Asian and GTEx datasets, respectively. The colorectal cancer risk (deletion) allele of rs67052019 was associated with reduced expression of UBL4B and GPR61, but increased expression of KCNC4-AS1 (consistent between the East Asian and GTEx datasets) and TMEM167B. The colorectal cancer risk T allele of rs62558833 was associated with increased expression of SMU1, DCAF12, and NUDT2. The CRC risk A allele of rs11108175 was associated with increased expression of VEZT.

Discussion
In this meta-analysis of 23,572 colorectal cancer cases and 48, 700 controls in East-Asians and follow-up replication analyses of 58,131 colorectal cancer cases and 67, 347 controls in individuals of European descent, we identified three novel risk loci and two highly suggestive loci for colorectal cancer risk. In addition, we identified 13 secondary signals at 11 known colorectal cancer risk loci in East Asians. Using functional genomics data, we showed that three of the newly identified risk variants, or their highly correlated variants, are located in regulatory regions of the genome. It indicates that these variants potentially regulate the expression of nearby genes in gastrointestinal tissues. In addition, monocytes were implicated in colorectal cancer carcinogenesis for the first time based on tissue regulatory element enrichment analyses. Our cis-eQTL analyses provide additional supports for a possible role of several risk variants identified inour study in regulating expression of cancer-related genes. Our study provides novel information toward the understanding of the genetic and biological basis of colorectal cancer.
At the 1p13.3 locus, the deletion allele of rs67052019 was associated with increased colorectal cancer risk. The variant rs67052019 was 58.8 Kb upstream of the EPS8L3 gene. The function of the encoded protein by EPS8L3 is unknown. Interestingly, the deletion allele of rs67052019 was associated with increased KCNC4-AS1 (KCNC4 antisense RNA) expression in both Asian and GTEx datasets. The encode protein by KCNC4 is the voltage gated Kv3.4 potassium channel protein involved in regulating mammalian cellcycle (27,28). Theroles of Kv channels in cancer development and progression have been well established, and they are not only involved in cell proliferation and tumor growth, but also in cell migration and metastasis (29,30).
At the 8p21.2 locus, rs60911071 is 35 kb downstream of STC1. Another variant (rs2928679, r 2 EAS or EUR = 0.0, between rs60911071 and rs2928679) in this region was reported to be associated with prostate cancer risk in populations of European ancestry (31). The encoded protein by STC1 is a secreted glycoprotein, and is expressed ubiquitously, including in the gastrointestinal tract. STC-1 is reported to mediate the metastatic effect of platelet-derived growth factor signaling in colorectal cancer-associated fibroblasts (32). High expressions of STC1 are correlated with poor postoperative survival in patients with colorectal cancer (33).
At the 9p13.3 locus, rs62558833 is an intronic variant of UBAP2. The encoded protein by UBAP2 functions in the ubiquitination pathway. It inhibits the invasion of hepatocellular carcinoma cell by ubiquitinating and degrading Annexin A2 (34). The colorectal cancer risk T allele of rs62558833 was associated with an increased expression of SMU1 (Supplementary Table S9), DCAF12, and NUDT2 (Supplementary Table S10). The encoded protein by SMU1 is suggested to be involved in genome stability maintenance by negatively regulating DNA synthesis (35). The encoded protein by DCAF12 is required for developmental apoptosis (36). The encoded protein by NUDT2 is suggested to be a tumorpromoting factor, and high expressions of NUDT2 are associated with poor prognosis and an increased risk of breast cancer recurrence (37,38).
At the 12q22 locus, rs11108175 is a downstream variant of NTN4. Another nearby variant of rs17356907 (r 2 EAS or EUR = 0.0,between rs11108175 and rs17356907) was reported to be associated with breast cancer risk in European populations (39). NTN4 encodes a member of the netrin family of proteins that functions in various biological processes, including axonal guidance, tumorigenesis, and angiogenesis (40). NTN4 overexpression is observed to suppress primary and metastatic colorectal tumor progression through inhibiting tumor growth and angiogenesis (41)(42)(43). The colorectal cancer risk A allele of rs11108175 was nominally associated with an increased VEZT expression (Supplementary Table S10). The encode protein by VEZT is a ubiquitous transmembrane protein that is localized to adherens junctions in epithelial cells (44). How the aberrant expression of VEZT affects colorectal cancer risk warrants further investigations. At the 12q24.21 locus, rs9634162 is 9.9 kb downstream of the TBX3 gene. Two nearby variants, rs59336 (P = 3.7 × 10 −7 on colorectal cancer risk, r 2 EAS = 0.58 and r 2 EUR = 0.66 between rs9634162 and rs59336) (45) and rs1427760 (P = 2.5 × 10 −7 on colorectal cancer risk, r 2 EAS = 0.78 and r 2 EUR = 0.89 between rs9634162 and rs1427760; ref. 10) were reported to be associated with colorectal cancer risk in populations of European ancestry. We established this locus as one of the bona fide colorectal cancer risk loci at genome-wide significance. The protein encoded by TBX3 belongs to the evolutionarily conserved T-box family of transcription factors that play critical roles in early embryonic development. Overexpression of Tbx3 is associated with multiple cancers potentially by modulating cell proliferation and survival, tumor formation and metastasis, and drug resistance (46). Emerging evidence suggests that Tbx3 is not only important for stem cell self-renewal, but also is extensively involved in cancer stemness (46,47).
Independent secondary signals were observed at colorectal cancer risk loci in both East-Asian populations (7,8,11,14,25) and European populations (10,13,(48)(49)(50)(51). We observed 13 independent secondary signals at 11 additional colorectal cancer loci in East Asians. Conditional joint analysis could refine association signals and uncover additional GWAS loci for colorectal cancer. The loci of MYL2/SH2B3 at 12q24.11 and LRP1/ARHGAP9 at 12q13.3 that were reported in European populations for colorectal cancer risk (10,52) only reached the genome-wide significance in the conditional joint analysis in East Asians. The missense variants of rs78894077 in SH2B3 at 12q24.11 only exist in East Asian populations (Table 2) and is predicted to be highly "deleterious" in both PolyPhen2 (21) and SIFT (22). This strongly implicates that SH2B3 is the underlying causal gene for colorectal cancer risk at this locus. The low-frequency intronic variant of rs368674461 in MYHAS at 17p12 also only exists in East Asian populations (Table 2); however, the function of MYHAS is currently unknown.
In summary, we identified three novel variants and two highly suggestive loci for colorectal cancer risk in this large GWAS of colorectal cancer. Combining information from functional annotations, cis-eQTL analyses and literature review, we propose the putative candidate genes for these loci: KCNC4-AS1 at 1p13.3, STC1 at 8p21.2, NUDT2 at 9p13.3, NTN4 at 12q22, and TBX3 at 12q24.21. Some of the putative target genes suggested by the results from our studies are located in established pathways for colorectal tumorigenesis, such as the maintaining of colon stem cells (TBX3). Multiple independent signals exist at many colorectal cancer loci; therefore, more familial relative risk of colorectal cancer could be explained at a locus when multiple independent signals were considered. However, extensive fine-mapping and functional follow-up studies are needed to identify the causal variants and target genes at each of the identified regions.
Victoria. Cases and their vital status were ascertained through the Victorian Cancer Registry (VCR) and the Regional association plots of five colorectal cancer risk loci in Asian descendants (A, rs67052019; B, rs60911071; C, rs62558833; D, rs11108175; and E, rs9634162). Each dot represents the P value (on a log 10 scale) of an SNP with colorectal cancer risk based on the meta-analysis results in East Asians only and is presented according to its genomic position (NCBI Build 37). The most significantly associated SNP in the combined meta-analyses is represented by purple. The color of all other SNPs indicates the level of LD with the lead SNP (estimated by ASN r 2 from the 1000 Genome Project data). Recombination rates were also estimated from 1000 Genomes Project data, and gene annotations within the 2 Mb regions that are centered on the newly identified risk variants were obtained from the UCSC Genome Browser. Lu  1.69 × 10 −9 1.10 (1.07-1.14) 1.06 × 10 −9 Abbreviations: CHR, chromosome; CI, confidence interval; EA, effective allele; Freq, frequency of the effective allele; Pos, position (hg19); r, linkage disequilibrium correlation between an SNP and the next adjacent SNP at a locus. a LD correlation between an SNP and the next adjacent SNP at a locus. b Known: loci known to contain independent secondary signals for CRC risk in East Asians; new: loci not reported before to contain independent secondary signals for CRC risk in East Asians.