Please use this identifier to cite or link to this item:
Title: Comprehensive identification and characterisation of germline structural variation within the Iberian population
Author: Valls Margarit, Jordi
Director/Tutor: Torrents Arenales, David
Keywords: Genètica de poblacions humanes
Teràpia genètica
Medicina personalitzada
Human population genetics
Gene therapy
Personalized medicine
Issue Date: 8-Jun-2021
Publisher: Universitat de Barcelona
Abstract: [eng] One of the central aims of biology and biomedicine has been the characterisation and understanding of genetic variation across humans, to answer important evolutionary questions and to explain phenotypic variability concerning the diseases. Understanding genetic variability, is key to study this relationship (through imputation and GWASs) and to translate the results into improved clinical protocols. Different initiatives have emerged around the world to systematically characterise the genetic variability of specific human populations from whole-genome sequences, usually by selecting geographical regions. Examples such as 1000 Genomes (1000G)1, GoNL2, HRC, UK10K3 or Estonian population4, have already identified and characterised millions of genetic variants across different populations. In combination with imputation analysis, these sequenced-based projects allow increasing the statistical power and resolution of Genome-Wide Association Studies (GWAS), identifying and discovering new disease-associated variants5. Additionally, genetic variability among population groups is associated with geographic ancestry and can affect the disease risk or treatment efficacy differently6,7. For this reason, population- specific reference panels are necessary to characterise their genetic diversity and to assess its effect on human phenotypes, improving GWAS studies, as one of the cornerstones of precision medicine7. Existing genetic variability panels include Single Nucleotide Variants (SNVs) and indels (<50bp) but are limited in large Structural Variants (SV) (≥50bp). Technical and methodological limitations hindered the discovery of SVs using Next-generation Sequencing (NGS) technologies, as it produced False-Discovery Rates between 9-89% and recall 10-70%, depending on the SV type and size8. On average, the genomic variation between two human genomes is around 0.1%, but this difference increases to 1.5% with SVs8. The SVs also affect 3-10 times more nucleotides than SNVs9 (4M SNVs per genome10), showing their potential effect on human phenotypes. For this reason, including a complete catalogue of SVs in reference panels will increase the power in GWAS studies and provide opportunities to find new disease-associated variants. To overcome these limitations, in this thesis, we have generated the first genome-wide Iberian haplotype reference panel, mainly focused on Structural Variants, using 785 samples whole-genome sequenced (WGS) at high coverage (30X) from the GCAT-Genomics for life project. We designed a complete strategy, including an extensive benchmarking of multiple variant calling programs and by building specific Logistic Regression Models (LRM) for SV types, as well as phasing strategies to come up with a high quality and comprehensive genetic variability panel. This strategy was benchmarked using different controlled sets of variants, showing high precision and recall values across all variant types and sizes. The application of this strategy to our GCAT whole-genome samples resulted in the identification of 35,431,441 genetic variants, classified as 30,325,064 SNPs, 5,017,19 small indels (< 50bp), and 89,178 larger SV (≥ 50bp). The latter group was further subclassified into 33,244 deletions, 6,269 duplications, 12,782 insertions, 10,115 inversions, 18,779 transposons and 7,989 translocations, covering all ranges of frequencies and sizes. Besides, 60% of the discovered SVs were not catalogued in any repository, thus increasing the insights of SV in humans. Additionally, 52.44% of common and 71.63% of low-frequency SVs were not included in any haplotype reference panel. Thus, new SVs could be used in GWAS, adding more value to the Iberian-GCAT catalogue. The prediction of the functional impact of the SVs shows that these variants might have a central role in several diseases. Of all SVs included in the Iberian-GCAT catalogue, 46% overlapped in genes (both protein-coding genes and non-protein-coding genes), highlighting their potential impact on human traits. Besides, 92.7% of protein-coding genes were located outside low-complexity (repeated) genomic regions, expecting short-reads from NGS to capture the most interpretable SVs in humans11. Moreover, 32.93% of SVs affected protein-coding genes with a predicted loss of function intolerance (pLI) effect, further supporting the potential implication of these variants on complex diseases and therefore enabling a better explanation of missing heritability. Importantly, taking advantage of high coverage (30X), we accurately determine the genotypes of SVs, enabling to phase together with SNVs and indels, and increasing the SV phasing accuracy, in contrast to 1000G and GoNL. Besides, high coverage allowed to use Phasing Informative Reads (PIRs), increasing the phasing performance. The overall strategy enables the community to expand and improve the imputation possibilities within GWAS. The Iberian-GCAT haplotype reference panel created in this thesis, imputes accurately common SVs, with near ~100% of agreement with sequencing results. Although the Iberian- GCAT haplotype reference panel can be used in all populations from different continental groups, due to closer ancestries, the imputation performance is high in European and Latin American populations, reflected in the amount of low-frequency (1% ≤ MAF < 5%) and rare (1% > MAF) variants imputed at high info scores. These results demonstrated the versatility of our resource, increasing their performance in closer ancestries. In general, we observed that when the allele frequency decreases, the imputation accuracy drops too, highlighting the necessity to include more samples in reference panels, to impute low-frequency and rare variants efficiently, which normally are expected to have more functional impact on diseases. Finally, we compared the imputation possibilities of the 1000G and GoNL reference panels, with our Iberian-GCAT reference panel. We observed that the Iberian-GCAT reference panel outperformed the imputation of high-quality SVs by 2.7 and 1.6-fold compared to 1000G and GoNL, respectively. Also, the overall imputation quality is higher, showing the value of this new resource in GWAS as it includes more SVs than previous reference panels. The combination of different reference panels will improve the resolution and statistical power of GWAS, thus increasing the chances to find more risk variants in complex diseases, and ultimately, translated this insight to precision medicine.
Appears in Collections:Tesis Doctorals - Facultat - Biologia

Files in This Item:
File Description SizeFormat 
JVM_PhD_THESIS.pdf7.91 MBAdobe PDFView/Open

This item is licensed under a Creative Commons License Creative Commons