A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study

Background Blood lipid levels including low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), and triglycerides (TG) are highly heritable. Genome-wide association is a promising approach to map genetic loci related to these heritable phenotypes. Methods In 1087 Framingham Heart Study Offspring cohort participants (mean age 47 years, 52% women), we conducted genome-wide analyses (Affymetrix 100K GeneChip) for fasting blood lipid traits. Total cholesterol, HDL-C, and TG were measured by standard enzymatic methods and LDL-C was calculated using the Friedewald formula. The long-term averages of up to seven measurements of LDL-C, HDL-C, and TG over a ~30 year span were the primary phenotypes. We used generalized estimating equations (GEE), family-based association tests (FBAT) and variance components linkage to investigate the relationships between SNPs (on autosomes, with minor allele frequency ≥10%, genotypic call rate ≥80%, and Hardy-Weinberg equilibrium p ≥ 0.001) and multivariable-adjusted residuals. We pursued a three-stage replication strategy of the GEE association results with 287 SNPs (P < 0.001 in Stage I) tested in Stage II (n ~1450 individuals) and 40 SNPs (P < 0.001 in joint analysis of Stages I and II) tested in Stage III (n~6650 individuals). Results Long-term averages of LDL-C, HDL-C, and TG were highly heritable (h2 = 0.66, 0.69, 0.58, respectively; each P < 0.0001). Of 70,987 tests for each of the phenotypes, two SNPs had p < 10-5 in GEE results for LDL-C, four for HDL-C, and one for TG. For each multivariable-adjusted phenotype, the number of SNPs with association p < 10-4 ranged from 13 to 18 and with p < 10-3, from 94 to 149. Some results confirmed previously reported associations with candidate genes including variation in the lipoprotein lipase gene (LPL) and HDL-C and TG (rs7007797; P = 0.0005 for HDL-C and 0.002 for TG). The full set of GEE, FBAT and linkage results are posted at the database of Genotype and Phenotype (dbGaP). After three stages of replication, there was no convincing statistical evidence for association (i.e., combined P < 10-5 across all three stages) between any of the tested SNPs and lipid phenotypes. Conclusion Using a 100K genome-wide scan, we have generated a set of putative associations for common sequence variants and lipid phenotypes. Validation of selected hypotheses in additional samples did not identify any new loci underlying variability in blood lipids. Lack of replication may be due to inadequate statistical power to detect modest quantitative trait locus effects (i.e., <1% of trait variance explained) or reduced genomic coverage of the 100K array. GWAS in FHS using a denser genome-wide genotyping platform and a better-powered replication strategy may identify novel loci underlying blood lipids.


Introduction
Blood lipid levels are a major contributor to atherosclerotic cardiovascular disease [1]. Current evidence suggests that blood lipids are complex genetic phenotypes, influenced by both environmental and genetic factors. Heritability estimates for blood lipids are high, including ~40-60% for high-density lipoprotein cholesterol (HDL-C), 40-50% for low-density lipoprotein cholesterol (LDL-C), and ~35-48% for triglycerides (TG) [2]. These estimates indicate that DNA sequence variation plays an important role in explaining inter-individual variation in blood lipid levels. Indeed, sequence variants in individual genes have been consistently related to blood lipid phenotypes, including APOE/PCSK9 with LDL-C [3][4][5], CETP/ LIPC/LPL with HDL-C [6][7][8][9], and APOA5/LPL with TG [10,11], among others. However, the extent to which common genetic variants across the genome account for total variation in blood lipid levels is unknown.
Recent advances in genomics enable a genome-wide association study (GWAS), an approach in which a substantial fraction of common genetic variation is tested for a role in determining phenotypic variation [12]. These advances include a map of the correlation structure for approximately 4 million common genetic variants (minor allele frequency >5%) and whole-genome genotyping technologies capable of assaying 100,000-500,000 single nucleotide polymorphisms (SNPs) in an individual [13]. Utilizing a fixed genotyping marker set such as the Affymetrix 100K GeneChip in an association study tests a substantial fraction of the genome in whites, ~30-45% in some estimates [14]. GWAS has been successfully applied to identify novel genetic loci related to several medical phenotypes including age-related macular degeneration [15], inflammatory bowel disease [16], and electrocardiographic QT interval [17]. Identifying novel genetic variants related to blood lipid phenotypes may provide new drug targets to alter blood lipid levels and may aid in the prediction of cardiovascular disease.
We hypothesized that common genetic variants explain a proportion of the inter-individual variability in LDL-C, HDL-C, and TG. Accordingly, we conducted genome-wide linkage and association studies for these three phenotypes in Framingham Heart Study (FHS) participants.

GWAS sample
Of the 1345 FHS participants who are part of the family plate set (see Executive Summary), we focused our analyses on the 1087 participants from the Offspring cohort who had Affymetrix 100K genotypes. Lipid phenotypes were measured at various examinations as described in Table 1. Each study participant provided written informed consent for genetic analyses and the study was approved by Boston University's Institutional Review Board.

Phenotype definition and methods
Blood lipids were measured from fasting venous blood collected at each of seven clinical examination time points extending from 1971 to 2001. Total cholesterol, HDL-C, and TG were measured by standard enzymatic methods. LDL-C was calculated using the Friedewald formula, with a missing value assigned for participants with a measured TG > 400 mg/dL. Clinical covariates utilized in phenotypic regression modeling included age at the time of blood lipid measurement, age 2 , body mass index (weight in kg divided by the height in m 2 ), alcohol intake (drinks per week), current cigarette smoking (yes, no), menopausal status (postmenopausal yes, no), and hormone replacement therapy (yes, no).
Commonly-used lipid lowering therapies affect total cholesterol and TG. To account for treatment effect, we imputed total cholesterol and TG values for those treated with lipid-lowering therapy. The imputation procedure was modeled after prior work on imputing blood pressure values for those on antihypertensive medication [18]. For each treated individual, a correction factor was added to the observed [treated] lipid value (total cholesterol or TG). This correction factor consisted of the difference between an ''expected'' residual and the ''calculated'' residual. The ''calculated'' residual for each individual was generated in a sex-specific manner after adjustment for age, age 2 , age 3 , and examination year (by decade). The ''expected'' residual was generated within each sex and 10 year-age-group as the average of ''calculated'' residuals Lipoprotein subclass profiles were measured by a commercially available proton NMR spectroscopic assay (LipoScience, Raleigh, NC) on plasma samples stored at -70°C as described previously [19]. The particle concentration of the following 9 lipoprotein species were determined: 3 [19], since concentrations of both have very similar relations to lipid levels.

Genotyping methods
All analyses were based on the Affymetrix 100K GeneChip genotyping data generated in Framingham Heart Study participants as described previously [20]. In order to minimize false positive associations due to genotyping artifact, we limited our analyses to SNPs with a genotyping call rate ≥80% and a Hardy-Weinberg Equilibrium P ≥ 0.001. Given lower statistical power to detect associations with rarer SNPs, we limited our results to SNPs with a minor allele frequency ≥10%.

Statistical analysis methods
TG levels were log-transformed to approximate a normal distribution. For each blood lipid phenotype, the longterm average of 4 to 7 serial measurements was used as the primary phenotype. Participants contributing fewer than 4 of 7 measures of a given phenotype were excluded from that analysis. MeanLDL-C, MeanHDL-C, and MeanTG were adjusted for covariates in sex-specific linear regression models. Two sets of phenotypic models were created: Model 1 (age, age 2 ) and Model 2 (age, age 2 , body mass index, alcohol intake, cigarette smoking, menopausal status, and hormone replacement therapy). For quantitative covariates (age, body mass index, and alcohol intake), the mean value across examinations was used as a covariate. For categorical covariates, the proportion of exams scored as 'yes' was used. The residual MeanLDL-C, MeanHDL-C, and MeanlogTG values from Model 1 and Model 2 served as the primary phenotypes.
For genotype-phenotype association analyses, we assumed an additive model of inheritance. We conducted multivariable linear regression using GEE, family-based association testing using FBAT, and linkage using Merlin for computation of IBDs and SOLAR for variance component models as described in the Executive Summary.

Heritability analyses
Heritability estimates for the lipid phenotypes were obtained from extended families with at least two members by variance-components methods using the Sequential Oligogenic Linkage Analysis Routines (SOLAR) package [21]. Using this approach, maximum-likelihood estimation was applied to a mixed-effects model that incorporated fixed covariate effects, additive genetic effects, and residual error. The additive genetic effects and residual errors were assumed to be normally distributed and to be mutually independent. The analyses were performed using residuals from the multivariable models (Model 1 and Model 2) mentioned above. For phenotypes with kurtosis > 1, heritability estimates were computed on ranked normalized deviates. The second stage consisted of ~1450 biologically unrelated individuals from the FHS unrelated plate set. The third stage consisted of ~1450 participants from GOLDN and ~5200 participants from MDC-CC. GOLDN is a family-based sample recruited from two National Heart, Lung, and Blood Institute's Family Heart Study field centers (Minneapolis, MN and Salt Lake City, UT). The Family Heart Study is a multi-center, population-based cohort designed to study the genetic and environmental determinations of cardiovascular disease.

Replication samples
The MDC study is a community-based prospective epidemiologic cohort of 28,098 persons recruited for a baseline examination between 1991 and 1996. From this cohort, 6103 persons were randomly selected to participate in the MDC-CC which sought to investigate risk factors for cardiovascular disease. Of the MDC-CC participants, 5466 had DNA and lipid phenotypes available. Individuals on lipid lowering therapy and with outlier values of LDL-C, HDL-C, or TG (top 0.5% of the distribution) were excluded, leaving 5212 individuals available for the SNPlipid association analyses

Staged replication strategy
For follow-up into Stage II (the FHS unrelated plate set), we selected all SNPs in the GWAS with an association P < 0.001 for the MeanLDL-C, MeanHDL-C, or MeanTG phenotypes from the minimally-adjusted phenotypic model (Model 1, adjustment for age, age 2 only). We next conducted a joint analysis of Stage I (GWAS 100K data) and Stage II (FHS unrelated plate set). The joint analysis consisted of a weighted average of the beta estimates and standard errors from Stages I and II and used the inverse of the variance in each stage as weights.
For follow-up into Stage III (GOLDN and MDC-CC), we selected for genotyping all SNPs with a P < 0.001 in the joint analysis of Stages I and II. For genotype-phenotype association analyses in MDC-CC and GOLDN, we assumed an additive model of inheritance. In MDC-CC, we conducted multivariable linear regression analyses to test the null hypothesis that LDL-C, HDL-C, or TG residuals (sex-specific residuals adjusted for age and age 2 ) did not differ by increasing minor allele copy number. In GOLDN, to account for correlated observations due to family relationships we used linear mixed-effects methods in SOLAR.
To summarize the statistical evidence for association for each SNP across all three stages, we reiterated the weighted average beta-estimates and standard errors as described above.

Results
Clinical characteristics of the FHS sample of 1345 subjects are presented in the Executive Summary. Table 1 displays the variables that were studied in our analyses of lipid phenotypes. Further information on these phenotypes can be found at http://www.ncbi.nlm.nih.gov/projects/ gap/cgi-bin/study.cgi?id=phs000007. Since Original cohort members were non-fasting at examination, our analyses considered only the 1087 Offspring Study participants with fasting lipid measurements and Affymetrix 100K SNP genotypes. For this paper we focus on longitudinal mean levels of serially measured values (minimum of 4, maximum of 7) of LDL-C, HDL-C, and TG (labeled MeanLDL-C, MeanHDL-C, and MeanTG).
Heritability estimates for long-term average lipid phenotypes (Mean LDL-C, MeanHDL-C, and MeanTG) were greater than those from single time-point measurements ( Table 1). For example, the heritabilities of MeanLDL-C, MeanHDL-C, and MeanTG were 0.66, 0.69, and 0.58, respectively, whereas heritabilities for LDL-C, HDL-C, and TG measured at FHS Examination 1 (a single time-point) were 0.59, 0.52, and 0.48, respectively. The highest heritability estimate for any available lipid phenotype was that for lipoprotein (a) at 0.90.
Linkage LOD scores > 2.0 are presented in Table 2c. The best evidence for linkage was a peak LOD score of 3.3 on chromosome 7 for the MeanHDL-C phenotype.
Because the prior probability of any SNP relating to a phenotype is low and given the number of tests, the P value distribution in a GWAS should approach a null distribution. Any strong departure from this expectation might suggest artifacts in genotyping or analysis. For the 70,987 SNPs that passed quality-control filters, the distribution of association P values (generated by the GEE methodology) approached a null distribution but with a slight excess of low P values. For example, for the MeanLDL-C, whereas one would expect 1% of SNPs to demonstrate a P < 0.01 by chance, we found that 1.34% of SNPs displayed a P < 0.01. Similar results were seen for meanHDL-C and meanTG (data not shown).
We evaluated the association results for a SNP and each of a set of four correlated phenotypes -ApoA-I, LDLNMRsm, MeanHDL-C, and MeanTG (Table 3). Several SNPs were associated with P < 0.01 for 3 of the 4 phenotypes.
Among the GEE association results, a SNP (rs7007797) in the lipoprotein lipase (LPL) was associated with Mean-HDL-C (p = 0.0005) and MeanTG (p = 0.002) ( Table 4). This SNP is a perfect proxy (r 2 = 1) to the previously studied rs328 (also known as S447X) [22]. The minor allele of rs328 has been consistently related to higher HDL-C and lower TG. The direction of effect for SNP rs7007797 in our dataset was consistent with previous observations. Due to a lack of SNPs in the Affymetrix 100K GeneChip correlated with previously reported variants (at r 2 > 0.5 threshold) in the APOE, PCSK9, CETP, LIPC, and APOA5 genes, we were unable to confirm these other previously reported associations (Table 4).
Replication is critical to distinguish true positives from false ones in a GWAS. We pursued a three-stage replication strategy with 287 SNPs (P < 0.001 in Stage I) tested in Stage II (n~1450 individuals) and 40 SNPs (P < 0.001 in joint analysis of Stages I and II) tested in Stage III (n~6650 individuals). Results are displayed in Table 5. After three stages of replication, there was no convincing statistical evidence for association (i.e. joint analysis stages I, II & III P < 10 -5 ) between any of the tested SNPs and lipid phenotypes.

Phenotype SNP rs ID* Chr Physical location (bp) GEE P-value FBAT P-value gene (IN or NEAR)
MeanLDL-C

Discussion
We examined associations of Affymetrix 100K SNPs and lipid traits in FHS and identified putative associations with lipid phenotypes. We studied the long-term average of up to 7 measurements each of LDL-C, HDL-C, and TG as the primary phenotypes and for one phenotype, the MeanLDL-C, we observed a nominal P that exceeded genome-wide significance [13]. However, validation of selected hypotheses in additional samples did not identify any new loci underlying variability in blood lipids.
GWAS offers the potential to identify novel genetic variants/loci that are associated with blood lipid variation, unlimited by our current knowledge of lipoprotein biology. However, a central limitation of GWAS is that the true signals are mixed amidst a large number of false pos-itive results. Validation in additional samples is required to distinguish the true positives from the false ones.
Replication of initial GWAS findings using a staged design has been suggested to minimize genotyping cost and maximize statistical power [23,24]. An important consideration in such a design is the proportion of markers taken forward to a second stage. We estimated the statistical power for our three-stage GWAS strategy. Assuming a modest number of markers (all SNPs with P < 0.001 for each phenotype, ~0.1% of markers) are taken forward to Stage II, a second stage sample size of 1450, that SNPs with P < 0.001 are taken forward from Stage II to Stage III, a stage III sample size of 6650, and that the final alpha (after Stages I, II, & III) is set at a conservative 5*10 -8 , we estimated that we had 89% power to detect a quantitative APOE LDL-C rs429358 rs7412   *The allele on the positive strand of the reference genome was modeled in all analyses. † Beta refers to the proportion of 1 standard deviation unit change in phenotype (phenotype is sex-specific residual adjusted for age and age 2 ) per copy of the allele modeled. ‡ "Failed" refers to SNP genotype failure in the sample.
trait locus explaining 2% of phenotypic variance, 48% power to detect a locus explaining 1% of the variance, and 13% power to detect a locus explaining 0.5% of the variance.
With our replication effort, we failed to identify any novel loci related to blood lipids. At least two potential explanations are possible. First, our study design had limited statistical power to detect common SNPs that explain ≤1% of trait variance. In the Diabetes Genetics Initiative genomewide association study for blood lipid traits, we recently showed that for lipid traits, there are few common variants that explain >2% of the variance and most SNPs explain <1% of trait variance [25]. To have adequate statistical power to detect these effects given an initial GWAS sample size of ~1000, many more markers (i.e., hundreds of SNPs) will need to be taken to the second and third stages. Second, the limited genomic coverage of the Affymetrix 100K array may have limited our ability to replicate previously reported loci and discover novel loci. For example, using the Affymetrix 500 K array, we recently identified glucokinase regulatory protein (GCKR) as a novel locus associated with TG [25]. Of any SNP on the 500 K array, an intronic GCKR SNP (rs780094) explained the greatest proportion of blood TG variance in the Diabetes Genetics Initiative study. However, on the Affymetrix 100K array, there are no SNPs within the 60 kb spanning GCKR.

Strengths and limitations
This study is distinguished by the availability of serial lipid phenotypes over a 30-year time span, the community-based nature of the collection, and the routine ascertainment of covariates in a standardized clinical examination. We acknowledge several limitations. These include the lack of validation for the imputation methodology used to address lipid lowering therapy, limited statistical power due to sample size, and confinement to a single ancestral group -whites of European ancestry.

Conclusions & future directions
Using a 100K genome-wide scan, we present association and linkage results for a rich set of lipid phenotypes in FHS. This resource may be useful for comparisons with other GWAS currently in progress. GWAS in FHS using a denser genome-wide genotyping platform and a betterpowered replication strategy may identify novel loci underlying blood lipids.