Genetic associations in community context: a mixed model approach identifies a functional variant in the RBP4 gene associated with HDL-C dyslipidemia

Background The objective of this study was to examine individual and community factors that influence high-density lipoprotein cholesterol (HDL-C) dyslipidemia in Newfoundland and Labrador (NL), a genetically isolated population in Canada with a high prevalence of HDL-C dyslipidemia. Methods First, a group of single nucleotide polymorphisms from 10 metabolic trait candidate genes was tested using a multivariate logistic regression model. The significant SNPs were entered into the second phase, where a mixed logistic model incorporated the community disease risk factors together with the individual factors as the fixed part of the model and the geographic region as a random effect. Results Analysis of 1489 subjects (26.9% HDL-C dyslipidemia) identified rs3758539, a non-coding variant in the 5’UTR of RBP4, to be associated with HDL-C dyslipidemia (odds ratio = 1.45, 95% confidence interval = 1.08–1.97, p = 0.01). The association remained significant, and the effect size did not change after the incorporation of individual and community risk factors from 17 geographic regions (odds ratio: 1.41, 95% confidence interval = 1.03–1.93, p = 0.03) in NL. Besides this variant, sex, BMI, and smoking also showed significant associations with HDL-C dyslipidemia, whereas no role was identified for the community factors. Conclusions This study demonstrates the use of community-level data in a genetic association testing. It reports a functional variant in the promoter of RBP4, a gene directly involved in lipoprotein metabolism, to be associated with HDL-C dyslipidemia. These findings indicate that individual factors are the main reason for a higher prevalence of HDL-C dyslipidemia in the NL population.


Background
Low serum levels of High-Density Lipoprotein Cholesterol (HDL-C) is the most common lipid abnormality found among patients with coronary artery disease [1,2]. HDL-C dyslipidemia is a complex trait resulting from the cumulative and interactive effects of numerous genomic variants and environmental factors. It has a strong genetic component, such that through twin studies its heritability has been estimated to be around 40-60% [3]. Our previous study has shown that the prevalence of dyslipidemia of HDL-C is 11% higher in the Newfoundland and Labrador (NL) province of Canada compared to the rest of the country (38% vs. 27%) [4,5]. It is not clear what causes an increased rate of HDL-C dyslipidemia in this population; however, the unique population history of NL can lead to clues as to what factors might be involved.
The majority of today's NL population descend from 20,000 migrants who arrived in NL from England and Ireland in the mid-1700s [6]. This founding population experienced a low rate of in-migration over centuries, which resulted in NL to be one of the few remaining isolated founder Caucasian populations worldwide. As a result, the NL population has a unique cultural and genetic background, both of which can contribute to a higher rate of dyslipidemia. This makes NL a unique population for studying both the genetic and environmental factors that contribute to a human complex trait, e.g., dyslipidemia. In population genetics terms, NL has been shown to encompass a reduced rate of heterozygosity, increased runs of homozygosity, and extended genomic linkage disequilibrium [6,7]. This founder population, however, does not have an entirely homogenous genetic structure, and according to the religion (i.e., Catholic vs. Protestant) and the ethnicity of the founding groups (i.e., British vs. Irish) which tended to reside and remain in different geographical areas, the current NL population can be grouped into multiple clusters [7]. This geographic variability in the NL's population structure enables studying, simultaneously, both individual-and community-based factors that contribute to a complex human trait.
Traditional genetic studies of complex traits use epidemiological approaches to search for the association of genetic variants with the trait of interest. Such studies are performed on an individual level basis, and often the environment in which the individual resides is not considered [8,9]. Mounting evidence suggests that an individual's health is directly influenced by the community health indices of the society they live in [10,11], including the community's disease status, availability of health care and treatment, and the level of exposure to hazards. Therefore, a study that takes into account both individual-and geographical/community-level data may add more to the mechanism of involvement of a candidate genetic variant or other risk factors in a complex trait, e.g., HDL-C dyslipidemia.
Mixed models have been introduced as an advanced analytical tool to model the population stratification and unequal relatedness among individuals in genetic association studies [12]. The present study utilized a mixed model design to examine the individual (genetic and non-genetic) and community factors that contribute to a higher rate of HDL-C dyslipidemia in the NL population.

Source of data
We conducted a secondary analysis of individual and community level data from three existing databases. The study only included individuals over 20 years of age living in NL at the time of blood test, excluding pregnant women. The first dataset was only used to obtain individual-level data, whereas the other two datasets provided the community-level data. These data sources include: 1) COmplex DIseases in the Newfoundland population, environment, and Genetics (CODING) study: this ongoing study explores genetic, endocrine and nutritional links to metabolic traits in NL [13]. At the time of the study, over 3200 third generation Newfoundlanders participated in the program. Recruitment criteria were as follows: (i) at least third generation Newfoundlander; (ii) healthy, without any serious metabolic, cardiovascular or endocrine disease; and (iii) not pregnant at the time of participation. Extensive measurements including a complete lipid profile, history of diabetes, smoking, blood pressure and body mass index (BMI) were completed for these participants. All participants provided blood samples following a 12-h fasting period. Serum was isolated for measuring the concentrations of lipoproteins using Synchron reagents, which was performed on an Lx20 analyzer (Beckman Coulter, Fullerton, CA). Additionally, genotyping of 40 single nucleotide polymorphisms (SNPs) from 10 genes associated with metabolic traits was performed for a subset of the cohort. Tagging SNP selection was done according to the linkage disequilibrium (LD) block patterns of every gene to maximize the gene coverage using a pairwise R-squared approach (SNPbrowser Version 3.5; Applied Biosystems; based on HapMap data for CEU population). Genomic DNA was isolated from 5 ml of whole blood using the Wizard Genomic DNA Purification kit (Promega, Madison, WI). Genotyping was performed using Taqman validated or functionally tested SNP Genotyping Assays (Applied Biosystems, Foster City, CA) on an ABI Prism 7000 Sequence Detection System (Applied Biosystems). All experiments were conducted according to the manufacturers' protocols.
To assess the reproducibility of genotyping, 5% of the samples were randomly selected and re-genotyped, all of which were confirmed to match their original called genotypes. Only subjects for whom complete genotyping was performed were included in the analysis.

2) Newfoundland and Labrador Laboratory
Information System (LIS) Data: for each laboratory service provided in NL, the patient's identification, date of service, and laboratory test results are entered into this database. Therefore, this dataset represents all of the HDL-C measures that were reported by the clinical laboratories across the province. From this dataset, we obtained all of the HDL-C test results performed in NL during January 2011 and December 2011, along with the sex of the patients. The year 2011 was chosen because it encompassed the most comprehensive data across all of the three databases in this study. It was also the year in which the majority of the CODING study participants were recruited.

3) Digital Epidemiology Chronic Disease Tool
(DEPICT) Data: DEPICT is a spatiotemporal information system in NL which collects disease and demographic information on adults with chronic diseases from provincial medicoadministrative data. For this study, we obtained population age and prevalence of diabetes and hypertension in the year 2011.
HDL-C dyslipidemia in both the CODING and LIS datasets was defined as HDL-C levels ≤1.0 mmol/dl for males and ≤ 1.3 mmol/dl for females according to the Canadian guidelines for the diagnosis and treatment of dyslipidemia [14].

Determination of geographic/community level
For community-level analysis, the collected data was aggregated by a geographic stratifier. The level of geography to be used for this purpose was determined by examining the ratio of individuals included in the above datasets in each existing geographic unit in NL to the total population count living in that region. Canadian postal codes are composed of six alphanumeric characters, the first three of which are called the forward sortation area (FSA). Using sensitivity analysis we found that FSA has the best distribution of the records across the three data sources, providing the most optimal representation of NL's population distributions. Therefore, the variables from the community-level data (DEPICT and LIS) were aggregated by FSA.

Statistical analysis
Individuals with HDL-C dyslipidemia were compared with those with normal levels of HDL-C. The analysis was performed in two steps: 1) The first step was conducted on the CODING study data only at the individual level. All of the SNPs were tested for the Hardy-Weinberg Equilibrium among the control group using a Pearson chisquared test, and those with a P-value< 0.01 were determined to be deviating from the equilibrium and were excluded. With the assumption of an additive genetic effect, a multivariable logistics regression model was fit between HDL-C dyslipidemia and every SNP. The genotypes were coded as 0, 1 and 2, representing individuals with zero, one, and two copies of the risk allele for every SNP to allow for testing the association between the SNPs and HDL-C dyslipidemia. Individual factors related to dyslipidemia including sex, age, smoking, BMI, hypertension, and diabetes were treated as covariables in the model. Using the median values as cut-offs, these continuous variables were transformed to binary categorical variables to improve the model's goodness of fit. Male was considered as the base for the sex variable.
2) The SNPs that resulted in a p-value of less than 0.05 were used in the second step of the analysis. In this step, a multilevel logistic regression modeling was performed. The individual-level variables from the CODING study along with the community-based factors from the LIS and DEPICT were fit as the fixed effect, while the community classification code (FSA) together with the same community factors from LIS and DEPICT were incorporated as the random effects. The community-based data from these two datasets were obtained by aggregating the variables on the FSA level. These include mean population age and sex compositions, as well as the rate of HDL-C dyslipidemia, hypertension, and diabetes in every FSA unit. The individual level HDL-C dyslipidemia from CODING was treated as the binary outcome. Residual plots of the fitted models were inspected to ensure no deviation from normality or homoscedasticity was present. The analysis was performed using the lme4 package in R 3.3.2 [15].

Data descriptions
The CODING dataset was composed of 962 males and 2358 females aged 42.6 ± 13.2 with a mean BMI of 26.7 ± 5.2. Among this population, 4.0% were diabetic, 32.6% had hypertension, and 15.3% were smokers. The prevalence of HDL-C dyslipidemia in this cohort was 26.9%. The SNP genotyping was performed for 1489 subjects, who were included in the analysis. Characteristics of the healthy and dyslipidemic groups are presented in Table 1.
The risk allele, its frequencies among those with and without HDL-C dyslipidemia and the overlapping genes are shown in Table 2 for every SNP. The LIS dataset contained 121,090 records. The mean age was 56.0 ± 14.1 years, and 23.1% were classified as having HDL-C dyslipidemia. DEPICT data was composed of 226,530 electronic records (52.0% females). Table 3 shows this information for every FSA region.

Identification of the genetic variants associated with HDL-C dyslipidemia
This phase was performed on individual-level data from the CODING cohort. All of the available SNPs (n = 40) were queried using the SNP Annotation and Proxy search tool (SNAP Version 2.2; BROAD Institute, Cambridge, Massachusetts, US) for pair-wise LD measurements, eight of which were found to be in high LD (r 2 > 0.8) with the others. Following the removal of these   Table 2.

Identification of individual and community factors influencing HDL-C dyslipidemia
The second phase of the study linked the LIS and DE-PICT datasets to the CODING data using the geographic identifier (FSA). The postal code information was available for about 80-85% of the subjects in the CODING dataset. NL is composed of 35 FSA regions (Fig. 1). The total number of FSA regions with at least five records in the CODING data for every variable was 17. Table 3 shows the level of variation in the community-level data as well as the risk allele frequency of rs3758539 across these 17 FSA regions. These results indicate a degree of variability for the calculated indices across these FSAs. The inclusion of community-level factors was performed in two steps. First, only the geographic identifier was included as a random effect into the original model fitted earlier, which did not significantly change the parameters of the observed associations (Table 4), indicating that the variability between different communities for these factors is minimal. Next, the community-level factors were added as both the fixed effect and the random effect to the model. It was observed that the effect of these community factors on the individual factor associations is also negligible ( Table 4). None of the community-level factors showed any association with HDL-C dyslipidemia or affected its association with individual-level factors. The analysis was also repeated after removal of the community-level factors from the random effect, which produced a similar result (data not shown). This final model indicated that only individual factors of sex (Odds Ratio: 2.10; 95% confidence interval: 1.36-3.23), BMI (Odds Ratio: 1.12; 95% confidence interval: 1.08-1.15), and smoking (Odds Ratio: 1.69; 95% confidence interval: 1.12-2.55) besides rs3758539 influence HDL-C dyslipidemia.

Discussion
This study attempted to use a multilevel modeling approach to identify individual and community level factors that influence HDL-C dyslipidemia in NL, a genetically isolated population. Mainstream studies of individual, genetic or community factors in complex traits do not combine these lines of investigations despite their interactive role in the development of complex conditions. This is particularly important in genetic association testing where the majority of the reported genetic associations are known to be population specific, and replication rates are as low as 5% [16,17], despite controlling for population structures. One reason for this, which has not been investigated enough, might be the variations in the community disease risks across different populations and/or subpopulations. We used a mixed model approach to model the community factors into the association between genetic variants and HDL-C dyslipidemia. Mixed models in genetic association studies have the advantage of preventing false-positive findings due to population or family-relatedness structures and increasing the power by applying a correction to them [18]. As such, this approach would have increased the power in our study by correcting for sub-population variations in the community risk factors of HDL-C dyslipidemia. Our study, however, did not find that community factors can significantly influence the association between a SNP and HDL-C dyslipidemia. Yet, this should not lead to the conclusion that community-level factors do not alter the associations of genetic or non-genetic factors in complex conditions, since we did not test the multitude of other genomic loci associated with dyslipidemia or did not extend the study to other complex traits.
Our study found that the risk allele (C) of SNP rs3758539 (C > T), adjusted for other individual risk factors, is associated with~1.40 increased odds of having HDL-C dyslipidemia, and this association remains significant when other individual-or community-based factors are taken into account. This SNP has a potential to be a causal variant in HDL-C dyslipidemia. rs3758539 is located in the 5'UTR of the Retinol Binding Protein 4 (RBP4) gene. This region is an active promoter and harbors an enhancer element. A query in HaploReg v4.1 database (BROAD Institute, Cambridge, Massachusetts, US) reveals that it encompasses regulatory chromatin features and is in high LD (r 2 > 0.8) with multiple other functional SNPs. In vitro studies show that alternative alleles of this variant induce a differential transcriptional activity and variable binding affinities to the transcription factor hepatocyte nuclear factor 1 alpha (HNF1α) [19]. The minor allele (T) is specifically shown to be associated with a higher expression of RBP4 in adipocytes [20]. Multiple sequence alignments of human, mouse, rat, and cattle RBP4 promoter also indicates that the SNP is within a conserved element harboring seven transcription factor binding motifs [20]. Human genetic studies suggest that rs3758539 is associated with an increased risk of type 2 diabetes [19] and obesity [21]. Another SNP (rs12265684), which is in a tight LD with rs3758539, is associated with serum levels of triglycerides, and insulin resistance [22]. The present study will be the first report to directly associate this variant with HDL-C dyslipidemia. A previous study comparing 304 lean and 283 obese Caucasian children found a slight increase in the risk allele of this SNP in the obese group and those having more undesirable cardiometabolic parameters including dyslipidemia, being consistent with the findings of our study; however, their analysis failed to produce a significant p-value (p = 0.3) likely due to insufficient sample size [23]. Another study in a Chinese cohort did not identify any association between HDL-C levels and rs3758539 [24]. The same observation has been found in a second Chinese cohort [25]. These studies indicate that the observed association is most likely specific to the Caucasian population, yet not all ethnic groups have been explored. A previous study in NL population by Shea et al. [26] did not find an association between this variant and HDL-C serum levels. Instead, they found two other RBP4 variants (rs10882280 and rs11187545) associated with the serum levels of HDL-C. These two SNPs are not in LD with our SNP (r 2 = 0.01), but have a strong LD with each other (r 2 = 1.0). We tested one of them in our database and did not observe an association with HDL-C dyslipidemia ( Table 2). The putative reason for the discrepancy between these two studies might be related with designing of the outcome variables in the two studies (HDL-C serum levels vs. HDL-C dyslipidemia). HDL-C dyslipidemia is defined using different serum HDL-C cut-offs for males and females, and this gives a differential weight to males and females in statistical modeling, which cannot be resolved simply by adjusting for the sex co-variable. Regardless of which approach is the most appropriate, both studies identify RBP4 as a potential candidate for HDL-C dyslipidemia.
The RBP4 gene encodes a binding protein to retinol (vitamin-A alcohol) which is involved in preventing vitamin-A loss during glomerular filtration and delivering it from the liver stores to the peripheral tissues. It is assumed that RBP4 regulates lipid homeostasis by the activation of nuclear receptors through retinol metabolism [27]. This regulation has been under such scrutiny that recent studies have attempted to use RBP4 as an early marker of metabolic abnormalities including dyslipidemia [28,29]. Serum levels of RBP4 have also been reported to be positively correlated with HDL-C serum levels [30]. This is consistent with our finding that the risk allele (C) of rs3758539, being associated with a lower expression of RBP4 [20], confers a risk towards having HDL-C dyslipidemia. However, several studies have reported negative correlations between the serum levels of HDL-C and RBP4 [31,32], and thus further research is warranted to assign a causal relationship between rs3758539 and HDL-C dyslipidemia.
Our data also identified female sex (odds ratio: 2.25) as the strongest associated factor with HDL-C dyslipidemia followed by smoking (odds ratio: 1.71) and BMI (odds ratio: 1.12). We also tested age and comorbidities, but no association was observed. Smoking and obesity are amongst the most well-known risk factors for HDL-C dyslipidemia. A higher rate of HDL-C dyslipidemia in females and smokers in this study is also consistent with our previous studies of the primary-care population in NL [33][34][35][36].
Low HDL-C level is a major risk factor for cardiovascular diseases, the third cause of death worldwide. Knowing the genes involved in its pathogenesis is the frontline of further research for designing therapeutic and diagnostic interventions. Our study, for the first time, reports the association of a functional SNP with HDL-C dyslipidemia using a mixed model analysis, which is not frequently conducted in genetic association studies. Our study can be regarded as a proof of principle for incorporation of a multi-level approach in genetic association testings. This study is limited by a few factors. Conducting a community-based study requires a large sample size from every geographic location. This is particularly important in genetic studies of complex traits where the effect sizes are expected to be small. In several of our geographic FSA regions, the sample sizes were not significantly large, and it is possible that this obscured the community effect that we expected to observe. It should also be noted that genetic associations are not valid unless replicated. This should be taken into account particularly in the presence of moderate effect sizes.

Conclusion
Using a multilevel modeling approach, this study identified a genetic variant with potentially functional attributes in the promoter of the RBP4 gene to be associated with HDL-C dyslipidemia. We found that this association is robust and is not altered if the community factors are taken into account. Also, our analyses did not establish any associations between the community level factors and HDL-C dyslipidemia in the NL population. However, we observed a significant role for individual obesity and smoking in suboptimal HDL-C levels. These findings suggest that the individual genetic and lifestyle factors are the main reason for a higher prevalence of HDL-C dyslipidemia in the NL province of Canada.