The Framingham Heart Study 100K SNP genome-wide association study resource: overview of 17 phenotype working group reports

Background The Framingham Heart Study (FHS), founded in 1948 to examine the epidemiology of cardiovascular disease, is among the most comprehensively characterized multi-generational studies in the world. Many collected phenotypes have substantial genetic contributors; yet most genetic determinants remain to be identified. Using single nucleotide polymorphisms (SNPs) from a 100K genome-wide scan, we examine the associations of common polymorphisms with phenotypic variation in this community-based cohort and provide a full-disclosure, web-based resource of results for future replication studies. Methods Adult participants (n = 1345) of the largest 310 pedigrees in the FHS, many biologically related, were genotyped with the 100K Affymetrix GeneChip. These genotypes were used to assess their contribution to 987 phenotypes collected in FHS over 56 years of follow up, including: cardiovascular risk factors and biomarkers; subclinical and clinical cardiovascular disease; cancer and longevity traits; and traits in pulmonary, sleep, neurology, renal, and bone domains. We conducted genome-wide variance components linkage and population-based and family-based association tests. Results The participants were white of European descent and from the FHS Original and Offspring Cohorts (examination 1 Offspring mean age 32 ± 9 years, 54% women). This overview summarizes the methods, selected findings and limitations of the results presented in the accompanying series of 17 manuscripts. The presented association results are based on 70,897 autosomal SNPs meeting the following criteria: minor allele frequency ≥ 10%, genotype call rate ≥ 80%, Hardy-Weinberg equilibrium p-value ≥ 0.001, and satisfying Mendelian consistency. Linkage analyses are based on 11,200 SNPs and short-tandem repeats. Results of phenotype-genotype linkages and associations for all autosomal SNPs are posted on the NCBI dbGaP website at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?id=phs000007. Conclusion We have created a full-disclosure resource of results, posted on the dbGaP website, from a genome-wide association study in the FHS. Because we used three analytical approaches to examine the association and linkage of 987 phenotypes with thousands of SNPs, our results must be considered hypothesis-generating and need to be replicated. Results from the FHS 100K project with NCBI web posting provides a resource for investigators to identify high priority findings for replication.

In this manuscript, we summarize the strategies that we pursued to conduct the 100K genome-wide study, providing an overview for a series of 17 companion manuscripts ( Table 1 of the Overview) describing associations with specific collections of traits [26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42]. The primary purpose of this project was to generate hypotheses regarding genetic factors that may contribute to the wide spectrum of phenotypic variables collected in the FHS through a genome-wide approach. More specifically, we primarily hypothesized that common genetic variants contributing to phenotypic variation can be detected through a genome-wide association study (GWAS) and that genetic loci contributing to phenotypic variation can be detected through linkage. Each manuscript also examines whether the 100K analyses replicated previously reported associations with consistent evidence from the literature for some specific traits. The main purpose of this series of publications is to describe the association results made available for investigators and to direct readers to their free availability in the database of Genotype and Phenotype (dbGaP) public repository http://www.ncbi.nlm.nih.gov/ projects/gap/cgi-bin/study.cgi?id=phs000007 at the National Center for Biotechnology Information (NCBI), where these comprehensive results are posted and may be browsed in the context of multiple genomic tracks includ-ing Entrez Gene, RefSeq, dbSNP, genetic markers, and OMIM. The deposition of these data in a public repository is consistent with the long tradition of publishing preliminary results from the FHS to benefit the wider scientific community.
To organize the evaluation of the rich resource of data collected over nearly 60 years of follow up, we established a set of "Phenotype Working Groups" that included clinicians, epidemiologists, geneticists, and biostatisticians. These groups specified the traits to be studied, along with covariate adjustment and subgroups for analyses. In all, 987 phenotypes were examined for association, 835 for linkage. Some phenotypes are the same trait with different covariate adjustments, at different examinations or evaluated in different subgroups. For example, many traits were evaluated with both age and sex adjustment as well as with additional multivariable adjustments, yielding more than one phenotype for analysis. Each manuscript in this series provides a platform for the web posted results. Not every trait is described in the manuscripts; rather, the purpose of each manuscript is to introduce the trait areas and to present a brief summary of the results. In the present manuscript, we describe the general approach to analysis of the traits, provide an overview of some results, and discuss the limitations of the studies.

Study sample
The Framingham Heart Study (FHS) began in 1948 with recruitment of 5209 men and women (2336 men and 2873 women) between the ages of 28 and 62 years in the town of Framingham, Massachusetts, about 20 miles west of Boston [43][44][45][46]. These individuals were recruited through a two-thirds systematic sample of the households of Framingham, Massachusetts. Although not initially intended as a family study, many households consisted of spouse pairs (1644 pairs). The primary purpose of the Study was to follow individuals over time for development of cardiovascular disease events to evaluate the interplay among multiple risk factors that lead to disease and their individual and joint effects. The participants in the Original Cohort have been examined every two years since.
In 1971, an Offspring Cohort of 5124 men and women, who were adult children of Original Cohort members or were spouses of these offspring, was recruited and has been examined every four to eight years since [47,48]. The subjects in this report are drawn from the largest 310 pedigrees in these two generations. The participants were recruited without regard to phenotypes. Thus, the Offspring Cohort of 5124 (2483 men and 2641 women) was recruited by inviting all offspring of the spouse pairs (2616 and 34 stepchildren), the offspring spouses (1576) and, additionally, those offspring (898) of singleton Original Cohort members with elevated lipid levels. Further information regarding recruitment can be seen in Cupples et al. [49] and Dawber [43].
In the late 1980s and through the 1990s, DNA was collected from living study participants. As many of the Original Cohort members were deceased by that time, these DNAs were mostly collected in Offspring Study participants. During the mid-to late-1990s, 1702 DNA samples were genotyped by the Mammalian Genotyping Service in the largest 330 two-generation pedigrees consisting of 2885 Framingham Study participants. These pedigrees were used for linkage analyses of blood pressure [15], lipids [17,50], body mass index [25] and a wide variety of other traits [51][52][53][54][55][56]. The numbers of relative pairs among the 1345 subjects both genotyped and phenotyped in this study are 435 parent-offspring pairs, 988 sib pairs, 300 avuncular pairs and 634 first-cousin pairs. Among the 1087 Offspring Cohort participants, who were the only participants evaluated in some analyses, there were 936 sib pairs, 63 avuncular pairs and 612 first-cousin pairs.
Original Cohort study subjects return to the Study every two years for a detailed medical history, physical examination and laboratory tests. The Original Cohort subjects are currently in their 29 th examination. Participants in the Offspring Cohort return every 4 to 8 years for similar examinations and the 8 th examination is currently underway.
In the early 2000s, a family DNA plate set with 1,399 participants from these 330 pedigrees http:// www.nhlbi.nih.gov/about/framingham/policies/ index.htm was established. Only subjects with lymphoblast cell lines were included on the plate set, although a substantial number of the DNA samples on the plate set were derived from whole blood or buffy coat. The family plate set was used for genotyping of the Affymetrix 100K GeneChip. After cleaning the genotyping data, the study sample comprised 1345 FHS participants, 278 from the Original and 1087 Offspring Cohorts. heart rate variability. In addition, we established a statistical and analytical methodology group. These groups were convened by the FHS Genetic Steering Committee to define phenotypes to be evaluated, including the covariates used in analyses, to review results of linkage and association analyses, to foster communication among various Framingham investigators who were working on different traits, and to suggest possible follow-up strategies.
For the 100K genome-wide project, each Working Group defined the phenotypes to be studied. Since most traits have well established factors that contribute to their variation, each group created a set of residuals from multivariable regression models accounting for the primary known covariates, in order to control for confounding from these variables and to increase the ability to detect genetic signals. For quantitative traits, the adjusted standardized residuals were generated using linear regression models. For qualitative traits, we used a variety of approaches including Cox proportional hazards with Martingale residuals for time-to-event (survival) traits and logistic regression with deviance residuals for dichotomous traits. These methods are described below. In some cases, several different covariate adjustments were used for a single trait. Each manuscript describes the specific adjustments that were applied. We used residuals from regression models that included all subjects with traits in each Cohort, rather than limiting analyses to those who were genotyped, to produce residuals based on all subjects with phenotypic values, regardless of availability of genotypic data. This approach avoids potential biases in covari-ate adjustment based only upon the subset of individuals with both genotype and phenotype data and produces robust estimates of covariate effects.

Genotyping methods
Genomic DNA derived from whole blood or buffy coat was phenol-chloroform extracted and DNA from immortalized lymphoblast cell lines was salt-precipitate extracted. Genotyping of the 100K SNPs in FHS families was performed through an ancillary study to Drs. Michael Christman and Alan Herbert at Boston University School of Medicine in the Department of Genetics and Genomics using the GeneChip Human Mapping 100K set from Affymetrix, following the manufacturer's protocol as previously described [57]. Genotypes were determined using the Dynamic Modeling (DM) algorithm [58]. For linkage analyses, we also included microsatellites that had been genotyped by the NHLBI Mammalian Genotyping Service, Center for Medical Genetics, Marshfield Medical Research Foundation http://research.marshfield clinic.org/genetics. A set of 401 microsatellite markers) [59], covering the genome at an average density of one marker every 10 cM and with an average heterozygosity of 0.77, were genotyped in 1702 subjects in the mid to late 1990s (Screening Set v. 8)) [60]. An additional 190 participants on the Family Plate Set were genotyped later with microsatellites using Screening Set v.13 and some additional microsatellites were also genotyped in the FHS Genetics Laboratory. With the addition of these microsatellites and changes in the marker sets from Set 8, there were 613 microsatellite markers available for analysis.

Statistical analysis methods
Data cleaning A total of 1380 individuals were successfully genotyped. First, familial relationships were checked using the sib_kin utility in the Aspex software package [61]. Because this study focused on participants of families, nine individuals were excluded as they no longer had biologic relatives in the sample. Twenty-six individuals were excluded due to inconsistencies; the majority of these individuals were found to have an excessive number of Mendelian errors as identified by the software PedCheck, Version 1.1 [62]. Others were excluded for having a relationship inconsistency, for a sex discrepancy or for a low genotyping call rate. Mendelian inconsistencies were resolved by removing the genotypes of all individuals within nuclear families in which the error occurred. These steps left 1345 individuals with genotypes available for analyses.
For Hardy-Weinberg equilibrium (HWE) testing, we randomly selected one individual per family to form a sample of unrelated individuals. Then, for each of the 100K SNPs, the observed genotype frequencies were compared to those expected under HWE using an exact chi-square test statistic [63] implemented in the Genetics package [64]

Linkage analyses
Both microsatellites previously genotyped by the Mammalian Genotyping Service and SNPs from the 100K were used to calculate identity by descent probabilities. We constructed genetic maps using all microsatellite NCBI genetic markers with Marshfield genetic location available and whose physical order and genetic order were consistent. Using this NCBI Marshfield map as our skeleton, we applied linear interpolation from physical to genetic distance to obtain approximate genetic locations (in centi-Morgans) for all SNPs in the 100K set with known physical location.
Because current linkage analysis software cannot handle the marker density available from a 100K scan, we selected a subset of 10,592 SNPs to supplement 613 genome scan microsatellite markers available on 1886 members of the largest 330 Framingham families. We selected SNPs to minimize linkage disequilibrium (LD) because current linkage software assumes that markers are in linkage equilibrium, and violation of this assumption has been shown to create spurious linkage evidence in certain contexts [66,67]. Thus, for calculation of identity by descent (IBD) probabilities for linkage analyses we used SNPs with a call rate of at least 85%, HWE p-value > 0.05 and more informative markers with MAF > 5%. We iteratively identified SNP pairs with LD measure D' > 0.5, as estimated from HapMap data, and eliminated the SNP that was least informative for linkage (lowest MAF). We started with SNP pairs most closely located (physical distance) and continued until no pairs of SNPs had a D' measure exceeding 0.5. The final set of 10,592 SNPs combined with the 613 microsatellites were checked for excess recombination using MERLIN, Version 0.10.2 [68], and 4 SNPs and 1 microsatellite were omitted from linkage analyses based on a high number of possible errors, leaving a total of 11,200 markers to perform linkage analysis (10,588 SNPs + 612 short tandem repeats).
Variance component linkage analyses were performed on residuals of up to 1341 individuals in 310 full pedigrees. Four of the 1345 subjects were the only person in a pedigree and were excluded from linkage analyses since they contributed no information. Multipoint probabilities of IBD between relative pairs were computed at each genetic marker location with the program MERLIN, Version 0.10.2. Due to size limitations for exact identity by descent (IBD) multipoint computation in MERLIN software [68], the 310 full pedigrees were broken into 356 smaller pedigrees. The hypothesis of "no linkage at a specific genomic location" was tested by comparing models incorporating an effect of a putative quantitative trait locus (QTL) in complete linkage to the genetic marker, in the form of multipoint IBD sharing probabilities at the locus, to models incorporating only polygenic effects without a QTL effect. At each genetic location, a LOD score was computed as the logarithm to base 10 of the likelihood ratio of the locus-specific model to the polygenic model using the program SOLAR, Version 3.0.4 [69]. Allele frequencies were estimated by simple allele counts. In Framingham family data, which were collected from randomly sampled pedigrees, we have found the allele frequency estimates by simple allele counting closely match those calculated by maximum likelihood methods accounting for familial correlations.

Association testing
We applied population-based and family-based methods to test for association between the 100K SNPs and residual phenotypes using an additive model unless otherwise specified. We used family-based association test methods, implemented in the program FBAT, Version 1.5.5 [70,71], to test for differences in probability of transmission of a genotype from parents to offspring based on phenotype, as a test of linkage and association. FBAT has limited power because it requires association within families, and many families are non-informative. However, because FBAT examines association only within families, the type-I error rate is not affected by population stratification bias) [72,73]. We did not report results if the number of informative families was fewer than 10.
For the population-based approach, we used generalized estimating equation (GEE) [74] regression models to test for association between the 100K SNPs and each residual phenotype while taking into account the correlation among related individuals. We implemented the GEE approach by breaking families into sibships and used an exchangeable working correlation matrix to account for correlation within each sibship. Parental correlations with their children were not considered in these analyses. The analyses were performed using the gee program package, Version 4.13-10 [75] in R [65]. The GEE association test is a population-based approach that uses all individuals with both genotype and phenotype, regardless of genotype configuration within a family. Therefore, it is expected to be a powerful test of association if population stratification bias is not believed to be an issue, as in the FHS [76].

Participant characteristics
Of the 1345 subjects who satisfied appropriate familial relationships and who were considered in the presentation of results in these manuscripts, 258 were Original Cohort participants (90 men and 168 women) and 1087 were Offspring Cohort participants (527 men and 560 women). Table 2 of the Overview presents descriptive information on these participants at enrollment (examination one). The Offspring and Cohort participants included on the family plates had lower mean age than other examination 1 participants, as these subjects needed to survive to the mid 1990s to provide DNA. We note that we used residuals based upon all subjects, as opposed to only those who were genotyped. Thus, the phenotypes reflect deviations of these subjects based on regressions for the full sample of subjects and are thus representative of the full sample.

Format of the FHS 100K manuscripts
We present 17 manuscripts, each displaying selected results for an epidemiologically related group of traits. Table 1 of the Overview presents the title and first author of each manuscript. The web resource displaying genetic association and linkage results is available at the NCBI dbGaP website, http://www.ncbi.nlm.nih.gov/projects/ gap/cgi-bin/study.cgi?id=phs000007. Each manuscript describes the traits that were studied and presents some results. These manuscripts are not intended to be comprehensive and generally do not include results for all phenotypes and covariate adjustment schemes that were studied and presented on the website. Full listings of all traits evaluated are provided in Additional file 1 (phenotypes for population-based GEE analyses), Additional file 2 (phenotypes for family-based FBAT analyses) and Additional file 3 (phenotypes for linkage analyses), including url links to the corresponding analytical results on the NCBI dbGaP website. To facilitate the reading of these manuscripts, we have used a common format for all manuscripts. Table 1 of each manuscript presents a general description of the phenotypes that were evaluated. Table  2 of each manuscript displays the top results (lowest pvalues) from GEE analyses, the top results (lowest p-values) from FBAT analyses and linkage results where the LOD score was 2 or more. Whereas top association results are based solely on p-value rank, the Working Groups also applied various additional strategies to identify SNPs that the group would prioritize to pursue further. For Table 3 of each manuscript, the groups devised schema to summarize results for related traits, grouping phenotypic traits within biologically plausible domains, or traits examined longitudinally. Each manuscript provides a description of the strategy employed and the results for its Table 3. Finally, Table 4 in each manuscript lists some SNPs that are the same as or correlated with genetic variants in genes that have been reported in the literature to be associated with the manuscript's phenotypes and indicates whether our results replicate those reports. Physical locations of the SNPs are provided according to NCBI Build35, whereas the dbGaP website uses a more recent version. Thus, the physical locations reported in the manuscripts may differ from those on the website. Each manuscript provides criteria for choosing which results were reported.

SNP allele frequencies and distribution
Allele frequencies for the 100K Affymetrix GeneChip in the Framingham sample are displayed in Figure 1. About 38% have MAF < 10% and are not considered in the series of manuscripts, although they are included on the dbGaP website. Among SNPs with MAF ≥ 10%, there were large numbers between 10-25% and were somewhat evenly spread over the range from 25-50% MAF. Many SNPs on the Affymetrix Chip are not near genes ( Figure 2). About 30,000 with MAF ≥ 10% are within 5 kb of a gene; another 10,000 with MAF < 10% are within 5 kb. The remaining SNPs are further away from known genes. We expect only a small percentage of all tested SNPs to be truly associated with any phenotype. Therefore, to obtain an approximation of the null distribution of p-values, we examined the distribution of p-values for 415 phenotypes from the Metabolic Working Group and 14 CVD event phenotypes. If one assumes that only a few true associations exist for each phenotype, these p-value distributions approximate the null distribution, because only a few SNPs out of the large total number tested would be expected to exceed any critical value due to true associations. Table 3 of the Overview displays the proportion of p-values among all SNPs below specific nominal alpha levels, summarized (mean, minimum and maximum) across all phenotypes in the trait group for GEE and FBAT results.

P-value distribution
Many of the phenotypes in the Metabolic Working Group were approximately normally distributed (about 90% had absolute value of skewness <1 and about 80% had abso-lute value of kurtosis <2) and thus may reflect the situation for which the assumptions of the analytical methods were generally satisfied. We display two sets of results in Table 3 of the Overview for these phenotypes, those used in the publication of the manuscripts with the number of SNPs equal to 70,987 and the larger set of results displayed on the website with the number of SNPs equal tõ 100-103 K. The difference in the number of SNPs evaluated for GEE and FBAT results arises from those SNPs that are uninformative for FBAT analyses (those with sufficiently rare minor allele so that fewer than 10 nuclear families were informative for transmission). The p-value distributions suggest that FBAT p-values generally follow the expected null distribution, assuming that nearly all results are false positives, and may actually be somewhat conservative. In contrast, the GEE p-values exhibit an excess of small p-values, especially for smaller nominal alpha levels. For example, for SNPs reported in the manuscripts, the average proportion of SNPs for a phenotype with p-value below specified alpha levels ranged from 1.3 to 19 times greater than the nominal level (1.3 times larger for nominal alpha of 0.01, 19 for nominal alpha of 10 -7 and 10 for 10 -8 ). The excess is higher for the full set of SNPs reported on the website. Here we found that the average proportion ranged from 1.2 times larger for nominal level of 0.05 to 19 times greater for nominal level of 10 -5 and 2500 times greater for nominal level 10 -8 .
The CVD phenotypes represent an extreme case, as the phenotypes were residuals from survival models, were generally bimodal, and do not satisfy general assumptions for normality. We see the same general pattern that we observed for the Metabolic Working Group phenotypes with somewhat conservative FBAT tests and excess num- We also examined the dependence of the p-value distribution for GEE results on the genotyping call rate for the Metabolic Working Group phenotypes. Our sample was not ascertained on trait status; so genotyping failures were likely to be randomly distributed. Therefore, one might expect that the effect of genotyping error on type I error would be more modest than for case-control studies) [77].
As expected, we continued to find an excess of small p-values, despite increasingly stringent call rate thresholds. More importantly, we found that this excess occurred regardless of call rate. For example, for nominal alpha of 0.001 and genotyping call rate > 95%, we found that the ratio of the number of observed to expected significant results ranged from 1.6 for MAF in the range of (0.2, 0.5) to 7.0 for MAF in the range (0, 0.05). Similarly, for call rate less than 80% we found similar ratios of 1.6 to 8.1, respectively. For nominal alpha of 10 -6 we found this ratio varied from 9.5 to 614 for call rate > 95% and 6.8 to 667 for call rate < 80%. Thus, we used a liberal genotyping call rate of > 80% for presentation of results in our manuscripts to err on the side of including a result rather than not, even though we expect nearly all results to be false positives.   We see that FBAT p-values tend to be less significant than expected (conservative) whereas GEE p-values tend to be more significant than expected (liberal), especially for smaller expected p-values. While we would expect most pvalues to fall on the line if there were no genetic associations, p-values that reflect true associations will be more extreme (smaller) than expected. In looking at the figure for mean fasting glucose, SNPs represented by the blue dots (GEE) far above the expected line on the right hand side of the figure may represent true associations with mean fasting plasma glucose. The plot for mean fasting HDL cholesterol also suggests that there may be some true positives, as even a few FBAT p-values are more extreme than expected.
As in any GWAS, we expect that most results with small pvalues are false positives. The p-value distributions support this notion and further suggest that the GEE results may have more false positives than one would expect. Table 2 in each manuscript ranks results by p-value, but each paper also pursues its own strategy to identify which results may be more worthy of follow up in Table 3 of each manuscript, usually by considering evidence from several sources such as correlated traits.

Power estimations for population-based association approach
To assess the power of the population-based association approach with GEE, we simulated a trait following a normal distribution with 30% polygenic heritability in this sample of 1345 subjects. We generated a SNP with MAF 0.10 and assumed that the SNP was the QTL with an additive effect and QTL heritability varying from 1% to 5%. We also varied the proportion of phenotyped individuals from 60% to 100%, as some traits were not available in all subjects genotyped. The phenotype and genotype data were simulated using SOLAR simqtl, Version 3.0.4. We tested the association between the SNP and the trait using GEE. One thousand replicates were performed for each scenario. The results are displayed in Table 4 of the Overview. For a conservative alpha level such as 10 -8 , we have more than 80% power to detect a SNP explaining 4% or more total phenotypic variation when 60% or more individuals are phenotyped. With higher MAF, the power remains similar for the same QTL heritability (data not shown). Thus, we have sufficient power to detect SNPs explaining ≥4% or more of the phenotypic variance using the population-based GEE association test approach, controlling for multiple testing for a single trait. The effect size for a specific QTL heritability, defined as the increase/ decrease of the phenotype value with one copy increment of the allele tested, depends on the MAF of the SNP. For example, for a SNP explaining 4% of the phenotypic variation, the effect size is 0.47 times the phenotypic standard deviation (SD) for a MAF of 0.1, and 0.28 SD for a MAF of 0.5.

Synthetic strategies
Beyond simple examination of individual p-values for single tests, the authors of each manuscript developed their own synthetic strategies to prioritize SNPs that may be worthy of follow up (results shown in Table 3 [29]. Within each trait group, SNPs were ranked according to the proportion of traits with p < 0.01 for both FBAT and GEE in the group.

Some results of interest
The main results for each Working Group are presented in the individual manuscripts of this series. Here we highlight some results that address some of our expectations.

Overlap of linkage and association results
Whereas strategies for genetic studies have been undergoing substantial changes in recent years, partly due to changes in the laboratory, we hypothesized that genomic regions that harbor significant linkage results would also contain significant association results.

SNPs overlapping across phenotypes
We did not expect the same SNPs to appear in many manuscripts, as cardiovascular disease is complex and involves a large and varied number of pathways for its development. In contrast, some manuscripts report on correlated traits. Thus, we examined overlap among the top 500 SNPs associated with the phenotypes across 3 Metabolic Working Groups: glycemic/diabetes phenotypes, lipid phenotypes and obesity phenotypes. Of 11 SNPs found in more than one group, none were found among the top 500 SNPs in all three groups. However, 7 SNPs were found in the glycemia and obesity groups, 2 in glycemia and lipid groups and 2 in lipid and obesity groups [38][39][40].

Replication of prior associations
In Table 4 of each manuscript, we investigated whether our results replicated previous reports in the literature. The 100K chip does not contain many SNPs in wellknown lipid genes, such as APOE. On the other hand, we found that SNP rs7007797 in the LPL gene was associated with both HDL and triglycerides [39]; we replicated recent findings of association of a SNP in the TCF7L2 gene with diabetes [38]. Strong statistical support was found for the association of factor VII concentrations with SNP rs561241 on chromosome 13 (4*10 -16 ) [34], which resides near the factor VII gene and is in complete linkage disequilibrium (r 2 = 1) with the Arg/Gln FVII SNP previously shown to account for 9% of the total phenotypic variance [78]. Similarly, we found associations of circulating levels of C-reactive protein with a SNP in the gene encoding C-reactive protein [33]. Two SNPs in SORL 1, a gene recently related to the risk of Alzheimer's disease [79], were found to be associated with performance on tests of abstract reasoning (rs1131497; FBAT p = 3.2 × 10 -6 ; rs726601; FBAT p = 8.2 × 10 -4 ) [37]. We found that SNPs (rs2543600 and rs27225364) near the WRN gene that causes premature aging are associated with age at death and morbidity-free survival at age 65 years [35]. The LD between these SNPs and those previously reported in the WRN gene is unknown as the previously reported SNPs are not in the HapMap. We found that SNP rs2478518 in the AGT gene was associated with both systolic and diastolic blood pressure [28]. The association of common variation at the NOS1AP locus with electrocardiographic QT interval duration was replicated with pvalues ranging from 0.0001-0.0009 for 4 partially correlated SNPs [27]. In contrast, a number of results reported Distribution of Affymetrix 100K GeneChip SNPs by distance from known genes Figure 2 Distribution of Affymetrix 100K GeneChip SNPs by distance from known genes. The X axis is the distance from known genes and the Y axis is the number of SNPs according to each distance. Blue represents monomorphic SNPs, maroon is for SNPs with MAF < 10% and green is for SNPs with MAF ≥ 10%. in the literature were not replicated in our results. For example, we did not find significant SNPs in ACE associated with blood pressure [28]. PPARG P12A (rs1801282) was not associated with diabetes or related traits including body mass index [38]. Some of these 'negative' results may be due to low power or low LD between a SNP reported in the literature and SNPs on the 100K chip.

Distribution of SNPs by MAF and Distance from a Gene
We have also identified a few results that are biologically compelling, although replication of the SNP association is warranted. For example, we found that a SNP (rs1158167) near the CST3 gene was highly associated with serum cystatin-C levels (GEE p-value = 8.5*10 -09 ) [42]. This SNP explained 2.5% of the variation in serum cystatin-C levels in our data and has been previously reported to be associated with cystatin-C. These results are presented in more detail in the Renal Endocrine Working Group manuscript [42].

Replication of results from other genome-wide studies
While we have been preparing this series of manuscripts, several genome-wide studies have been published [80][81][82][83][84][85][86]. Some results in our analyses support results reported in these recent studies. For example, we find significant associations for coronary heart disease, cardiovascular disease and coronary artery calcium [29,30] in the same chromosomal region on 9p recently reported to be associated with myocardial infarction by Helgadottir et al. [82] and McPherson et al. [83]. While our results need to be compared more closely with results being reported by other genome-wide studies, this example provides evidence that our results replicate strong associations from other genome-wide studies.

Discussion
We have presented a brief description of the methods and a few selected results derived from analyses of the 100K Affymetrix GeneChip with a large number of FHS traits, ranging from CVD events and subclinical measures to traditional cardiovascular risk factors of diabetes, lipid levels, blood pressure and also including more novel biomarker measures that reflect modern hypotheses, such as the role of inflammatory pathways in the development of CVD. We have also reported on a number of neurological, renal, cancer and aging traits, including longevity (age at death) and bone mass and structure. None of these manuscripts provide a comprehensive report. Rather, the purpose of this set of manuscripts is to provide a brief summary of the results and to introduce readers to the data posted on the dbGaP website http:// www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/ study.cgi?id=phs000007. We note that the genotypes in this sample have also been evaluated by Drs. Michael Christman and Alan Herbert. Some of their results are reported on line as described by Herbert et al. [87].
Several aspects of our investigation merit comment. First, the present investigation represents a comprehensive GWAS analysis of numerous phenotypes in a large community-based cohort. To our knowledge, it is the largest GWAS performed in an observational cohort in terms of the number of phenotypes analyzed and web posted. Second, we exploited the phenotypic diversity and richness of the Framingham Offspring Study database to analyze a set of phenotypes that were for the most part collected by detailed, direct measurements of study participants. Further, many of the phenotypes are quantitative traits. Phenotypes have been broadly categorized into seventeen different domains for manuscripts in this supplement. It is noteworthy that key risk factor phenotypes, such as blood pressure and lipid levels, were collected at multiple examinations, and thus we were able to conduct analyses using time-averaged traits, maximizing the scientific yield from the longitudinal prospective design of our cohort study. Further, several recently collected phenotypes, in particular biomarkers and imaging measures, were collected using highly reproducible, state-of-the-art modalities. Correlated phenotypes facilitated the assessment of pleiotropy by seeking associations of SNPs with such phenotypes. These investigations occurred primarily among the variables in each individual manuscript. Finally, for most phenotypes, there was evidence for a significant heritable component from FHS or other studies. We acknowledge that some phenotypic domains may represent analytical constructs, rather than truly distinct groups from a biological standpoint. Sixth, we note that use of the 80% genotyping call rate is unusually liberal by today's standards in GWAS. We used this threshold in these manuscripts to be inclusive, rather than exclusive, in a first look such as this. We recognize that this threshold may permit consideration of some results that could be spurious due problems with genotyping. However, a limitation of our genotypes is that the genotype calls were made with the DM algorithm, which is less precise than those that have recently been introduced. Seventh, in our analyses we found that the GEE results appear to have an excess of significant results. We suspect that one reason is low MAF. Also, given the small sample of at most 1345 subjects, we would expect only 13-14 individuals to have the minor homozygote. Thus, we limited the results that we present in the manuscripts to those SNPs with MAF = 10%. Further analyses have indicated that use of a linear mixed effects model such as incorporating a SNP as a covariate in a regression model with proper correlation structure for the error terms that fully represent the familial correlations remedies this problem and has a valid type I error rate in simulated data.
Eighth, coverage of LD is incomplete with the 100K scan. Nicolae et al. report that the Affymetrix 100K GeneChip includes fewer SNPs in coding and more SNPs in intergenic regions than represented on the HapMap [90]. Further, our sample size is modest. These two facts combined likely limit the power for detection of associations with several traits in these data. For instance, while we noted modest to high heritability of numerous phenotypes, underscoring the contribution of additive genetic effects to interindividual variation in these traits, we did not find significant low p-values for several heritable traits in relation to the SNPs evaluated. Factors contributing to this observation included both the limited coverage of the Affymetrix 100K GeneChip as well as the possibility that some of the less significant p-values (example between 0.05 and 10 -5 ) may represent true positive findings. The limited power to detect SNPs of small effect sizes offered by the analysis of our relatively modest sample size of 1300 participants contributes to this phenomenon as well; we only have high power to detect a SNP explaining 4% or more of the phenotypic variance in the populationbased GEE association test; the power of FBAT and variance component linkage analysis is even lower.
Additionally, for several of the analyzed phenotypes we did not observe any overlap between the top SNP-phenotype associations noted in GEE and FBAT analyses. The inherent differences in the two analytical methods especially in the context of the modest sample sizes, particularly for FBAT with small numbers of informative trios, may contribute to this phenomenon. FBAT is limited by the number of informative transmissions and although we suspect that there is little population stratification in our sample [76], GEE is limited by potential bias due to stratification. Furthermore, for several phenotypes the SNPs associated with the top LOD scores in linkage analyses were not among the top 50 SNPs in association analyses (GEE or FBAT). The following phenotype working groups did not have any traits achieving nominal genome-wide significance: echocardiography, flow-mediated dilation and exercise tolerance testing; blood pressure and tonometry; subclinical cardiovascular disease; cardiovascular outcomes; cancer; electrocardiography and heart rate variability; pulmonary function testing; aging; bone; lipids; obesity Ninth, we were limited in our ability to replicate genetic variants previously reported to be associated with phenotypes in our database because specific coverage of such genetic variation in these candidates was limited in the Affymetrix 100K GeneChip. We view such analyses as more illustrative of the potential utility of our GWAS, rather than as definitive evidence for or against an association described with a putative candidate gene in the published literature.
Our data do suggest several interesting biological candidates among the SNPs most strongly associated with different traits in the various analytical approaches. The strongest and most clear-cut of the associations were for those phenotypes that represent the direct protein product of a gene. Examples include the association of CRP Finally, the Framingham Study participants were white of European descent and predominantly middle-aged to elderly. Hence, the genetic associations may not be generalizable to other ethnicities/races or to younger individuals.

Conclusion
In summary, the results from the FHS 100K association and linkage studies described herein and posted on the NCBI website provide a GWAS resource for investigators. We have presented a description of the methods and general strategies used for analysis of the 100K Affymetrix GeneChip in relation to a broad range of traits measured in the FHS. Brief descriptions of results of these analyses are provided a series of 17 manuscripts, with results for all autosomal SNPs genotyped successfully displayed at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/ study.cgi?id=phs000007. Interested investigators can also access the data through a standing protocol, described at http://www.nhlbi.nih.gov/about/framingham/policies/ index.htm. Key to interpretation of these results is replication and evaluation of these results in other cohorts and ultimately, functional studies. We encourage investigators to examine the results and to pursue the genetic signals therein in their own cohorts. In the near future we will provide results and data from approximately 550,000 SNPs on more than 9000 participants from three generations in the FHS SNP Health Association Resource (SHARe) project. Data will be available to qualified investigators through an application process to dbGaP. It is our hope that the results from these two genome-wide association studies will lead to a much deeper understanding of the role of common genetic variation in the development of cardiovascular disease and its risk factors.