A genome-wide association study of rheumatoid arthritis without antibodies against citrullinated peptides

Introduction Rheumatoid arthritis (RA) patients can be classified based on presence or absence of anticitrullinated peptide antibodies (ACPA) in their serum. This heterogeneity among patients may reflect important biological differences underlying the disease process. To date, the majority of genetic studies have focused on the ACPA-positive group. Therefore, our goal was to analyse the genetic risk factors that contribute to ACPA-negative RA. Methods We performed a large-scale genome-wide association study (GWAS) in three Caucasian European cohorts comprising 1148 ACPA-negative RA patients and 6008 controls. All patients were screened using the Illumina Human Cyto-12 chip, and controls were genotyped using different genome-wide platforms. Population-independent analyses were carried out by means of logistic regression. Meta-analysis with previously published data was performed as follow-up for selected signals (reaching a total of 1922 ACPA-negative RA patients and 7087 controls). Imputation of classical HLA alleles, amino acid residues and single nucleotide polymorphisms was undertaken. Results The combined analysis of the studied cohorts resulted in identification of a peak of association in the HLA-region and several suggestive non-HLA associations. Meta-analysis with previous reports confirmed the association of the HLA region with this subset and an observed association in the CLYBL locus remained suggestive. The imputation and deep interrogation of the HLA region led to identification of a two amino acid model (HLA-B at position 9 and HLA-DRB1 at position 11) that accounted for the observed genome-wide associations in this region. Conclusions Our study shed light on the influence of the HLA region in ACPA-negative RA and identified a suggestive risk locus for this condition.


INTRODUCTION
Rheumatoid arthritis (RA) is a common autoimmune disease that is associated with a progressive loss of the joints induced by a chronic inflammation of the joint synovium. 1 In this inflammatory environment, different products of cell apoptosis and necrosis accumulate, and citrullinated proteins can be detected. 2 The production of anticitrullinated peptide/protein antibodies (ACPA) is a common but not essential characteristic of RA patients, which is thought to be influenced by the genetic background. Indeed, a strong correlation exists between ACPA and alleles of the HLA-DRB1 gene at chromosome 6 known as the shared epitope (SE). 3 4 Hence, the presence or absence of ACPA divides RA patients into two serological and clinical subgroups, ACPA positive (ACPA+) and ACPA negative (ACPA−) RA.
ACPA are highly specific for RA and appear years before the first clinical manifestation of RA. 5 Moreover, a number of studies have shown that ACPA+ patients suffer more aggressive radiological joint damage. [6][7][8][9] It is worth mentioning that, the most recent RA diagnostic criteria proposed by the American College of Rheumatology (ACR) includes ACPA among the classification factors. 10 Remarkably, different studies suggested the existence of a partial genetic overlap between the ACPA− and the ACPA+ phenotypes, but evidence supporting important genetic differences is increasing. 11 In line with this, a study by Viatte et al 12 reported that the SE, a HLA-DRB1*0401 tag-single nucleotide polymorphism (SNP) and variants located in TNFAIP3, GIN1/C5orf30, STAT4, ANKRD55/ IL6ST, BLK and PTPN22 showed significant associations with ACPA− RA patients. However, several RA susceptibility factors showed no association with the ACPA− subset, and those shared independently of the serotype revealed different strength of association and effect size. Two high-throughput genotyping efforts, a genome-wide association study (GWAS) and a SNP-dedicated Immunochip-based dense mapping, including ACPA− RA patients have been recently carried out. The GWAS focused on the genetic comparison of both serological RA subgroups. 13 Although no genome-wide level association with the ACPA− subset was described, significant differences between ACPA+ and ACPA− patients in the HLA region were apparent 13 The Immunochip approach revealed genome-wide level associations in the HLA region and the ANKRD55 locus with ACPA− RA, and supported differences with the ACPA+ subgroup in several loci. 14 Additionally, a few loci have been associated with ACPA− RA but not with ACPA+ RA, such as CLEC16A and IRF5 that have been found following a candidate gene strategy. 15 16 In spite of the increasing interest in the identification of genetic risk factors associated with ACPA− RA, the underlying genetic background in this subset of RA patients remains widely unknown. 17 The aim of this study was to identify novel susceptibility loci implicated in ACPA− RA susceptibility through a hypothesis-free GWA strategy. Therefore, we carried out a genome-wide combined analysis of three independent Caucasian European cohorts, and a follow-up phase using a previously studied GWAS cohort, comprising a total of 1922 ACPA− RA patients and 7087 unaffected controls.

PATIENTS AND METHODS Populations
All RA patients participating in this study met the 1987 ACR criteria for the classification of RA and were classified as ACPA− using standard methods as described elsewhere. 13 18 Control populations comprised unaffected unrelated individuals from the same geographical and ethnical origin as cases. All individuals in this study were of European ancestry (self-reported and/or principal component analysis (PCA) derived, figure 1). This study was approved by the local ethics committees of the participating hospitals and all participating individuals gave written informed consent. A description of the analysed populations is provided in online supplementary note 1.
The power to detect an association with an OR of 1.40 considering 1922 cases and 7087 controls, at the p value <5×10 −8 level, under an additive model and using a minor allele frequency (MAF) of 0.20 was of 96% (additional power calculations are provided in online supplementary table S1). For power estimation we used Power Calculator for Genome Wide Studies. 19

Genotyping and imputation
The genotyping platforms for The Netherlands I, The Netherlands II and Spain cohorts are described in table 1. Quality control (QC) and imputation was conducted separately for each cohort and each chip type, using a common approach for all datasets (details in online supplementary note 2) using PLINK. 20 The Swedish cohort genotyping and QC were performed as described in Padyukov et al. 13 After QC, a total of 452 367 genotyped or imputed SNPs were shared between The Netherlands I, The Netherlands II and the Spanish case-control series; and 363 330 genotyped or imputed SNPs were shared between the previously mentioned cohorts and the Swedish population in Padyukov et al. 13 All SNP genotypes in the HLA region (chromosome 6 between the positions 29 000 000 and 34 000 000 reaching a total of 3882 SNPs) underwent a specific HLA imputation process as described in previous reports. [21][22][23] This novel approach resulted in the imputation of classical HLA genotypes (HLA-A, HLA-B, HLA-C, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1 and HLA-DRB1), their corresponding amino acid sequences and SNPs (see online supplementary note 3 and table S2). 21 23

STATISTICAL METHODS
The association of the imputed set of SNPs with ACPA− RA was tested by logistic regression, and this approach was followed  figure S1). We carried out individual population PCA using EIGENSTRAT software in order to detect population substructure. 24 25 Then, the association analyses for The Netherlands I, The Netherlands II and Spain cohorts were adjusted for population substructure by including the first 10 PCs of each population as covariates in the logistic regression (λ The Netherlands I =1.07; λ The Netherlands II =1.02; λ Spain =1.07; see online supplementary figure S1).
A combined analysis using the inverse variance method under a fixed effects model was performed on the basis of the PCA-adjusted association results of The Netherlands I, The Netherlands II and Spain cohorts using PLINK. The inflation value for this analysis was λ=1.02. Heterogeneity across the datasets was evaluated using Cochran's Q test, and those loci showing a high heterogeneity (Q<0.05) were not considered for the validation step.
Non-HLA SNPs and showing a significant p value at a tier 2 association level ( p<5×10 −5 ) in the combined analysis of The Netherlands I, The Netherlands II and Spain cohorts were selected for a validation step (table 2).
In the validation phase, a combined analysis of the previously selected SNPs using the inverse variance method under a fixed effects model was performed using the association results of the three analysed populations and the ACPA− RA patients and controls of the Swedish cohort in Padyukov et al. 13 A p value <5×10 −6 was established as arbitrary threshold to consider a SNP as a suggestive association in the meta-analysis. Additionally, loci showing a high heterogeneity (Q<0.05) were discarded.
For the analysis of the imputed data in the HLA region, we performed the association analyses by means of unconditional and conditional logistic regression analysis to account for dependency between the observed signals (details in online supplementary note 3). Finally, we searched recursively for models which better explained all association present in the HLA region (see online supplementary note 3).
Regional plots were generated using LocusZoom, and the remaining figures were generated using R V.2.15.1. 26 27

RESULTS
Independent and combined analyses of three European ACPA− populations Independent analysis of The Netherlands I, The Netherlands II and Spain cohorts showed no associations at genome-wide significance level ( p<5×10 −8 ) (see online supplementary figures S2-S4). However, the combined analysis of the three cohorts identified a SNP in the CLYBL locus and two in SMIM21 showing significant risk associations at this level (figure 2).
Moreover, 47 variants in 34 non-HLA loci showed a suggestive p value <5×10 −5 and no heterogeneity, and were selected for follow-up (table 2 and figure 3). Additionally, the combined analysis of the three cohorts resulted in a peak of association in the HLA region in chromosome 6.

HLA region deep interrogation
We further explored the association in the HLA region applying a new imputation method, which inferred the classical HLA alleles, polymorphic amino acid positions and SNPs in the studied populations. 21 22 After imputation, the most significant association corresponded with amino acid position 67 of the HLA-DRβ1 molecule ( p = 4.13×10 −9 , see online supplementary table S3). Three amino acids in this HLA-DRβ1 position were observed in our population (leucine, Leu; isoleucine, Ile; phenylalanine, Phe), but only the Leu67 variant showed genome-wide association ( p=9.41×10 −10 . Furthermore, three additional amino acid residues showed a genome-wide level significant association in our study: two located in the HLA-DRB1 gene (the presence of threonine in the 181 amino acid position, Thr181, p value = 2.74×10 −8 , and the combination of serine, Ser, valine, Val and leucine, Leu, in the position 11, SerValLeu11, p value = 4.27×10 −8 ) and one in the HLA-B locus (the presence of aspartic acid in the ninth position, Asp9, p value = 9.27×10 −9 ). Moreover, these residues corresponded with the most associated amino acid positions in the omnibus test (see online supplementary note 3 and table S3). Then, we hypothesised that the association in the HLA region might be explained by polymorphic amino acid residues in the HLA molecules as observed in previous reports. 14 21 22 Therefore, we performed step-wise conditional regression analysis including each of the four amino acid positions that had an amino acid residue reaching genome-wide level association. These analyses identified two independent signals among the selected amino acid positions: HLA-B at position 9 and HLA-DRB1 at position 11(see online supplementary table S4). After controlling for the two previously mentioned associations, no signal showing a p<5×10 −5 remained in the HLA region ( figure 4).
Additionally, we observed that the HLA-B*0801 allele was the only classical HLA allele associated at genome-wide level in our study. However, this allele was indistinguishable from the HLA-B Asp9 variant using conditional logistic regression, due to their high linkage disequilibrium (r 2 =0.996). Regarding the HLA-DRB1 independent position, several classical HLA alleles share the genome-wide associated amino acid residues (see online supplementary note 5). Additionally, we confirmed that the hypothesis-free model was equivalent to the amino acid model, and addition of each amino acid variant to our model outperformed the model without it and achieved a better

goodness-of-fit (table 3 and see online supplementary note 3).
We also confirmed that the model including the two independent amino acid variants was the most parsimonious explanation for our data, and the addition of the remaining amino acids did not lead to a better goodness-of-fit (table 3).

Follow-up meta-analysis
After imputation, 363 330variants were shared among all three previously described European populations and the Swedish cohort described in Padyukov et al. 13 The final set of 452 367 SNPs analysed in our cohorts tagged (with r 2 >0.8) a 52.37% of the SNPs with MAF>0.05 included in the HapMap phase 3 Caucasian of EUropean Ancestry (CEU) population and 51.63% of the SNPs in the CEU/Tuscans in Italy (TSI) populations (following the same parameters). As it can be observed in figure 5 and table 4, the variant in CLYB that previously showed genomewide level association (rs9557321) did not reach the genomewide significance threshold after meta-analysis but maintained a suggestive association ( p value=5.82×10 −8 OR=1.73). In the case of the genome-wide level associations in the SMIM21 locus (rs1943199, rs11663465), significant heterogeneity was found in the meta-analyses (Q<0.05) and the SNPs were discarded from analysis. Additionally, 30 out of 34 loci selected for the follow-up phase were shared between the genotyped and the Swedish populations, and two of them, CLYBL and SMIM21, were still significant considering a p value <5×10 −6 in the meta-analysis (figure 3, regional plots are given in figure 2). The variants showing tier 2 associations were rs518167 (located in the 2nd intron of the GRM5 gene) and rs3790022 (located in the RNASEH2B-FAM124A intergenic region, ie, GUCY1B2 pseudogene) (table 4  and see online supplementary table S5).
Regarding the HLA region, no classical HLA allele information was available for the Swedish cohort, and unfortunately, our two-variant models could not be tested in this cohort. Nevertheless, the genome-wide associated variant in the HLA class I region, rs2596565, was included in the combined analysis, showed nominal association in the Swedish cohort (p=2.12×10 −2 OR=1.24) and was the only SNP showing genome-wide significance in the HLA region ( p=9.26×10 −9 OR=1.4) (figure 6).  (see online supplementary  table S6). In the case of the HLA-DRB1* 04 : 01 tag-SNP, a trend association was observed, but the meta-analysis showed significant heterogeneity. Thus, we applied a random effects model and the initial association was lost (see online supplementary table S6).
The previously reported variant of CLEC16A (rs6498169) was filtered out from the current analyses during QC, but the most strongly associated variant located in the gene showed a nominal association in the combined analysis (rs17803698 p=2.12×10 −2 OR=1.14, r 2 =0.12). 15 In the case of TNPO3-IRF5 region, we selected variants from Viatte et al 12 and Sigurdsson et al 16 that were included in the combined analysis. The rs12531711 showed a remarkably strong association with ACPA− RA in the combined analysis ( p=4.35×10 −5 OR=1.30).

DISCUSSION
This report comprises the largest ACPA− RA cohort genotyped and analysed with a genome-wide platform to date (1922 ACPA − RA patients and 7087 non-affected controls). Our study identified a suggestive new risk factor for this condition, (CLYBL), confirmed the association of the HLA region with this subset and proposed a two-variant model including the HLA-DRB1 and HLA-B loci to explain the observed HLA association peak.
The first GWAS that focused on the ACPA+ versus ACPA− genetic differences did not show evidence for an association of ACPA− RA with the HLA region in a cohort of Swedish RA patients. 13 Nevertheless, recent reports have suggested a previously unidentified role for the HLA region. 12 14 Taking into account that the Immunochip platform includes a dense mapping of the HLA region, and that the cohorts analysed in the present study are partially overlapping with the populations in Eyre et al, we consider that the resemblance of the results in genome-wide and Immunochip platforms confirm previous findings, and indicate that the coverage of the HLA region in our study was appropriate. 14 Moreover, in Eyre et al, the five-amino acid model which was described by Raychaudhuri et al 21 and accounted for the observed association in the HLA region was confirmed. 14 Nevertheless, this novel HLA allele and amino acid residue imputation approach was not applied separately in the ACPA− group. In our ACPA− study, we used this approach to detect two-amino acid position (HLA-B at position 9 and HLA-DRB1 at position 11), as the most parsimonious model explaining the observed association in the HLA region in our data. The two of the identified amino acid position, HLA-DRB1 11th amino acid and HLA-B 9th amino acid, were located in the peptide binding groove of their corresponding HLA molecules. 21 Moreover, both amino acid positions were shared between our model and the ACPA+ RA model proposed by Raychaudhuri et al 21 and confirmed by Eyre et al 14 However, no significant association was found for the SE variants or other ACPA+-related HLA-DRB1 amino acid positions (HLA-.DRB1 13, 71 and 74 positions), which can be due to real association divergences in these subgroups. Moreover, given the observed association of HLA-B, it can be hypothesised that HLA class I has a more relevant role in ACPA− RA than in ACPA+ RA, while the HLA class II association seems more predominant ACPA+ RA. However, the partial overlap and the differences between the proposed HLA models for both RA serological subgroups would require further studies to clarify if the HLA-DRB1 and HLA-B signals belong to both RA subgroups or are restricted to ACPA+ or ACPA−, respectively. Additionally, individuals showing borderline ACPA titres and seroconversion from negative to positive ACPA is a rare event (only 2%), but it may act as a confounding factor. 28 In relation to the novel candidate association with a SNP in the CLYBL gene, although the signal is slightly below the genome-wide significance level, our findings might support future studies to clarify the veracity of this association. CLYBL encodes a citrate lyase subunit β-like protein which has citrate ( pro-3S)-lyase and ion-binding activities that are transported to the mitochondria. Interestingly, a SNP in the CLYBL locus has been recently associated with low serum levels of vitamin B12 (also known as cyanocobalamin). 29 30 Although the relation between CLYBL and the vitamin B12 serum levels remains  Table 3 Comparison of goodness-of-fit of the different amino acid position models for the HLA region The goodness-of-fit for each model was compared with the previous one. Best fitting model in bold. χ 2 Dif, improvement in the goodness of fit (by calculating the deviance, defined as −2 × the log likelihood, and following a χ 2 distribution of the model compared with the preceding one; p value, p value for the model comparison. *Comparison between the hypothesis free model and the amino acid model. unclear, a link between vitamin B12 and RA has been described. In a report by Regal et al, 24% of RA patients had low vitamin B12 serum levels and 26% of them had true vitamin B12 deficiency. 31 Besides, a previous study described that RA patients with vitamin B12 deficiency, folate deficiency, vitamin B6 deficiency and impaired renal function showed associated hyperhomocysteinemia that may have a role in promoting high cardiovascular morbidity in patients with RA. 32 Moreover, significant associations with the familial risk of RA in offspring according to parental proband were reported for pernicious anaemia, which is usually the result of intrinsic factor insufficient secretion and consequent vitamin B12 deficiency. 33 Additionally, the design of our study led to some limitations mainly due to the variety of genotyping platforms used.  Therefore, some associations might have been overlooked due to a limited coverage in the region despite the imputation step (ie, the ANKRD55/IL6ST locus). In spite of the large size of the ACPA− patient cohort, our study might have been underpowered to detect modest associations. Additionally, our conclusions may be applicable only in populations of European ancestry. The present report analysed the genetic component of a large cohort of ACPA− RA patients compared to non-affected controls following a genome-wide strategy, replicated previous findings in different non-HLA loci, such as TNFAIP3, GIN1/ C5orf30, PTPN22, CLEC16A and TNPO3/IRF5, and revealed a novel suggestive susceptibility gene, CLYBL. Moreover, our study provided a deep insight into the influence of the HLA region in ACPA− RA and identified a two-amino acid residue model explaining this association. The present study together with previous evidence supported the existence of an