Skip to main content
  • Research article
  • Open access
  • Published:

Adjusting heterogeneous ascertainment bias for genetic association analysis with extended families

Abstract

Background

In family-based association analysis, each family is typically ascertained from a single proband, which renders the effects of ascertainment bias heterogeneous among family members. This is contrary to case–control studies, and may introduce sample or ascertainment bias. Statistical efficiency is affected by ascertainment bias, and careful adjustment can lead to substantial improvements in statistical power. However, genetic association analysis has often been conducted using family-based designs, without addressing the fact that each proband in a family has had a great influence on the probability for each family member to be affected.

Method

We propose a powerful and efficient statistic for genetic association analysis that considered the heterogeneity of ascertainment bias among family members, under the assumption that both prevalence and heritability of disease are available. With extensive simulation studies, we showed that the proposed method performed better than the existing methods, particularly for diseases with large heritability.

Results

We applied the proposed method to the genome-wide association analysis of Alzheimer’s disease. Four significant associations with the proposed method were found.

Conclusion

Our significant findings illustrated the practical importance of this new analysis method.

Peer Review reports

Background

Genome-wide association studies (GWASs) have been used to identify many genes involved in human diseases, and during the last decade, many disease-susceptibility variants have been identified. However, despite these successes, we have found that variants discovered from GWASs often explain only a small proportion of the heritability of diseases [1, 2]. For example, SNPs significantly associated with human height explain only about 5 % of phenotypic variance, despite studies of tens of thousands of people [3]. Many reasons, such as rare causal variants and gene/gene interactions, have been attributed to this so-called “missing heritability”. However, the low power induced by the multiple-testing problem is still an intractable issue in GWASs, and further investigations of the most efficient strategies for genetic association analysis are necessary.

Careful selection of samples based on phenotypes can lead to improved power for the discovery of risk variants [411]. One such example is the extreme discordant sib-pair design in linkage analysis, which may result in a substantial increase in statistical power when compared to other sib-pair designs [11, 12]. Similarly, ascertaining the extremes of quantitative phenotypes from large population cohorts has also been shown to increase the power to identify associated variants [1315]. In such a design, the effect of ascertainment conditions are homogeneous between individuals, and existing methods, such as the Cochran-Armitage(CA) trend test [16], can be an efficient choice. However, in association analysis using extended families, the effects of ascertainment bias are often heterogeneous among family members, and depending on their relationships with probands, different magnitudes of ascertainment bias may be generated. In particular, the probability of each individual being affected when his or her relatives are affected is similar to the prevalence, if the heritability is small, which indicates that the heterogeneous effect of the ascertainment bias depends on the magnitude of heritability. However, the heterogeneous effects of ascertainment conditions and the influence of heritability on it have not yet been investigated, and should therefore be taken into account for association analysis.

Recently, the CA trend test was extended for association analysis of dichotomous phenotypes with family-based samples [17, 18]. These statistics compares the genotype frequencies between affected and unaffected individuals, and the genetic association with family-based samples is tested by building a genotype correlation matrix with either kinship coefficients or an empirical correlation matrix estimated from large-scale genetic data. This approach has been extended to include family members with known phenotypes and missing genotypes or vice versa. By the nature of these statistics, it performs well for ascertained family-based samples and it can be an efficient choice, even for a case–control design, if the relatives’ phenotype information is available. However, their statistical efficiency is affected by the heterogeneous effect of the ascertainment bias on family members, and for extended families, its effects on statistical efficiency can be substantial.

In this report, we consider the heterogeneous effects of the ascertainment bias on family members for dichotomous phenotypes. By the nature of the proposed methods, individuals with missing genotypes and non-missing phenotypes can be utilized, and incorporation of the estimated kinship matrix to the proposed statistic provided robustness against the population substructure. The proposed method consists of two steps; the probability for each family member to be affected was calculated using a latent continuous liability [19], and then this probability is incorporated into a quasi-likelihood score test. With an extensive simulation, we showed that the proposed method performed better than the existing methods, particularly for a disease with large heritability. Application of our method to Alzheimer’s disease (AD) demonstrated its practical use in the detection of genetic associations in ascertained family-based samples.

Methods

Notations and statistic

We assumed that there were n families and n i family members in each family. We considered the situation where the family of size n i was ascertained because it contained a particular set of p i members, and we let q i  = n i p i . We called the members of the set of p i family members “probands”, and the remaining q i individuals “non-probands”. To provide a clearer motivation on this concept, we randomly selected two families, family 1 and 2, from our AD data (see Fig. 1). In family 1 (Fig. 1-(a)), individual 9 was diagnosed as AD and individuals 3–8 were selected as her relatives for genetic analysis. In family 2 (Fig. 1-(b)), individual 3 was diagnosed as AD, and individuals 4–6 were selected. Therefore p1 = p2 = 1, q1 = 6 and q2 = 3 in this example. In real data analysis, p i is often 1 and q i  = n i – 1. We assumed that N individuals were available and thus N = ∑ i n i . The genotypes were coded as 0, 1, or 2, according to the number of disease alleles. x P ij and x Ni ' j ' were defined as the genotypes of proband j and non-proband j' in family i and family i', respectively. Phenotypes were coded as 0 for an unaffected individual and 1 for an affected individual. If we let the prevalence of the disease be p, a missing phenotype was coded as p. We denoted the phenotypes of a proband and non-proband by y P ij and y Ni ' j ' , respectively, and the vectors for genotypes and phenotypes in family i were defined by

Fig. 1
figure 1

Two sample pedigree structures from AD data; (a) individual 9 of family 1 was selected as a "proband" and individual 3–8 of family 1 were selcted as "non-probands" (b) individual 3 of family 2 was selected as a "proband" and individual 4–6 of family 2 were selcted as "non-probands"

$$ {\mathbf{X}}_i^P=\left(\kern1em \begin{array}{c}{x}_{i1}^P\kern1em \\ {}\kern1em {x}_{i2}^P\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {x}_{i{p}_i}^P\end{array}\kern1em \right),\kern0.5em {\mathbf{X}}_i^N=\left(\kern1em \begin{array}{c}{x}_{i1}^N\kern1em \\ {}\kern1em {x}_{i2}^N\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {x}_{i{q}_i}^N\end{array}\kern1em \right)\kern0.5em ,\kern0.5em {\mathbf{X}}_i=\left(\kern1em \begin{array}{c}{\mathbf{X}}_i^P\kern1em \\ {}\kern1em {\mathbf{X}}_i^N\end{array}\kern1em \right),\kern0.5em {\mathbf{Y}}_i^P=\left(\kern1em \begin{array}{c}{y}_{i1}^P\kern1em \\ {}\kern1em {y}_{i2}^P\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {y}_{i{p}_i}^P\end{array}\kern1em \right),\kern0.5em {\mathbf{Y}}_i^N=\left(\kern1em \begin{array}{c}{y}_{i1}^N\kern1em \\ {}\kern1em {y}_{i2}^N\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {y}_{i{q}_i}^N\end{array}\kern1em \right),\kern0.5em \mathrm{and}\kern0.5em {\mathbf{Y}}_i=\left(\kern1em \begin{array}{c}{\mathbf{Y}}_i^P\kern1em \\ {}\kern1em {\mathbf{Y}}_i^N\end{array}\kern1em \right) $$

We also denoted the w × w identity matrix by I w , and the w × 1 column vector 1 w indicated a vector in which all elements were 1. Let π Pijj ' and π Nijj ' be the kinship coefficient between probands j and j' in family i, and non-proband j and j' in family i, respectively. In addition, we let π PNijj ' be the kinship coefficient between proband j and non-proband j' in family i, and let d ij P and d ij' N be the inbreeding coefficient for proband j and non-proband j' in family i, respectively. The inbreeding coefficient is the parameter that quantifies the departure from Hardy-Weinberg equilibrium (HWE) and ranges from 0 to 1. Several approaches [20, 21] that can estimate d ij have been proposed. We let

$$ {\mathbf{R}}_i^P=\left(\kern1em \begin{array}{ccc}1+{d}_{i1}^P\kern1em & \kern1em \cdots \kern1em & \kern1em 2{\pi}_{i1{p}_i}^P\kern1em \\ {}\kern1em \vdots \kern1em & \kern1em \ddots \kern1em & \kern1em \vdots \kern1em \\ {}\kern1em 2{\pi}_{i{p}_i1}^P\kern1em & \kern1em \dots \kern1em & \kern1em 1+{d}_{i{p}_i}^P\end{array}\kern1em \right),\kern0.5em {\mathbf{R}}_i^N=\left(\kern1em \begin{array}{ccc}1+{d}_{i1}^N\kern1em & \kern1em \cdots \kern1em & \kern1em 2{\pi}_{i1{q}_i}^N\kern1em \\ {}\kern1em \vdots \kern1em & \kern1em \ddots \kern1em & \kern1em \vdots \kern1em \\ {}\kern1em 2{\pi}_{i{q}_i1}^N\kern1em & \kern1em \dots \kern1em & \kern1em 1+{d}_{i{q}_i}^N\end{array}\kern1em \right),\kern0.5em {\mathbf{R}}_i^{PN}=\left(\kern1em \begin{array}{ccc}2{\pi}_{i11}^{PN}\kern1em & \kern1em \cdots \kern1em & \kern1em 2{\pi}_{i1{q}_i}^{PN}\kern1em \\ {}\kern1em \vdots \kern1em & \kern1em \ddots \kern1em & \kern1em \vdots \kern1em \\ {}\kern1em 2{\pi}_{i{p}_i1}^{PN}\kern1em & \kern1em \dots \kern1em & \kern1em 2{\pi}_{i{p}_i{q}_i}^{PN}\end{array}\kern1em \right), $$

and R i is defined by

$$ {\mathbf{R}}_i=\left(\kern1em \begin{array}{c}{\mathbf{R}}_i^P\kern1em \\ {}\kern1em {\left({\mathbf{R}}_i^{PN}\right)}^t\kern1em \end{array}\kern0.5em \begin{array}{c}\kern1em {\mathbf{R}}_i^{PN}\kern1em \\ {}\kern1em {\mathbf{R}}_i^N\end{array}\kern1em \right) $$

If we let q A be the disease allele frequency, E(X i ) was \( 2{q}_A{1}_{n_i} \), and q A is estimated with the best linear unbiased estimator (BLUE). var(X i ) is expressed by σ2R i , and σ2 is equal to 2q A (1 –q A ) under HWE.

When we analyzes the distribution of genotypes as in the FBAT approach, the statistical efficiency of the test statistic could be improved by adjustments of the phenotype with the so-called offset [22]. If we let μ ij P and μ i'j' N be offsets for proband j and non-proband j' in family i and family i', respectively, the offset vector for family i is defined as

$$ {\upmu}_i^P=\left(\kern1em \begin{array}{c}{\mu}_{i1}^P\kern1em \\ {}\kern1em {\mu}_{i2}^P\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {\mu}_{i{p}_i}^P\end{array}\kern1em \right),\kern0.5em {\upmu}_i^N=\left(\kern1em \begin{array}{c}{\mu}_{i1}^N\kern1em \\ {}\kern1em {\mu}_{i2}^N\kern1em \\ {}\kern1em \vdots \kern1em \\ {}\kern1em {\mu}_{i{q}_i}^N\end{array}\kern1em \right),\kern0.5em \mathrm{and}\kern0.5em {\upmu}_i=\left(\kern1em \begin{array}{c}{\upmu}_i^P\kern1em \\ {}\kern1em {\upmu}_i^N\end{array}\kern1em \right) $$

Setting T i  = Y i μ i , we can define

$$ \mathbf{X}=\left(\kern1em \begin{array}{c}{\mathbf{X}}_1\kern1em \\ {}\kern1em {\mathbf{X}}_2\kern1em \\ {}\kern1em \vdots \end{array}\kern1em \right),\kern0.5em \mathbf{Y}=\left(\kern1em \begin{array}{c}{\mathbf{Y}}_1\kern1em \\ {}\kern1em {\mathbf{Y}}_2\kern1em \\ {}\kern1em \vdots \end{array}\kern1em \right),\kern0.5em \mathbf{T}=\left(\kern1em \begin{array}{c}{\mathbf{T}}_1\kern1em \\ {}\kern1em {\mathbf{T}}_2\kern1em \\ {}\kern1em \vdots \end{array}\kern1em \right),\kern0.5em \mathrm{and}\kern0.5em \mathbf{R}=\left(\kern1em \begin{array}{c}{\mathbf{R}}_1\kern1em \\ {}\kern1em 0\kern1em \\ {}\kern1em \vdots \kern1em \end{array}\begin{array}{c}\kern1em 0\kern1em \\ {}\kern1em {\mathbf{R}}_2\kern1em \\ {}\kern1em \vdots \kern1em \end{array}\begin{array}{c}\kern1em \cdots \kern1em \\ {}\kern1em \cdots \kern1em \\ {}\kern1em \ddots \end{array}\kern1em \right). $$

We denoted a minor allele frequency (MAF) of a variant in unaffected individuals by q. We assumed [18] that for a constant γ,

$$ E\left(\mathbf{X}\Big|\mathbf{T}\right)=2p{\mathbf{1}}_N+\gamma \mathbf{T}, $$

where 0 < 2p + γ < 1. Then, the score for a variant [18, 23] can be defined by

$$ S={\mathbf{T}}^t\left(\mathbf{X}-\widehat{E}\left(\mathbf{X}\right)\right)\kern0.5em \mathrm{and}\kern0.5em \widehat{E}\left(\mathbf{X}\right)={\mathbf{1}}_N{\left({\mathbf{1}}_N^t{\mathbf{R}}^{-1}{\mathbf{1}}_N\right)}^{-1}{\mathbf{1}}_N^t{\mathbf{R}}^{-1}\mathbf{X}. $$

The variance of S is

$$ var(S)={\sigma}^2{\mathbf{T}}^t{\mathbf{V}}^{-1}\left(\mathbf{R}-{\mathbf{1}}_N{\left({\mathbf{1}}_N^t{\mathbf{R}}^{-1}{\mathbf{1}}_N\right)}^{-1}{\mathbf{1}}_N^t\right){\mathbf{V}}^{-1}\mathbf{T}, $$

and we considered the following statistic [17, 18]:

$$ \frac{{\mathbf{T}}^t\left({\mathbf{I}}_N-{\left({\mathbf{1}}_N^t{\mathbf{R}}^{-1}{\mathbf{1}}_N\right)}^{-1}{\mathbf{1}}_N^t{\mathbf{R}}^{-1}\right)\mathbf{X}}{\sqrt{\sigma^2{\mathbf{T}}^t\left(\mathbf{R}-{\mathbf{1}}_N{\left({\mathbf{1}}_N^t{\mathbf{R}}^{-1}{\mathbf{1}}_N\right)}^{-1}{\mathbf{1}}_N^t\right)\mathbf{T}}}\sim N\left(0,1\right)\mathrm{if}\gamma =0. $$

This statistic will be denoted by WL in the remainder of this report.

Adjusting the heterogeneous ascertainment bias

Families are often selected based on some probands, and the probability for family members to be affected depends on their relationship with the probands. Additional file 1 shows that the incorporation of conditional probability of each individual being affected to WL as offset lead to asymptotically smaller variance and therefore the adjustment of heterogeneous ascertainment bias is required to improve the statistical power of WL. This probability could be estimated with the liability model if the heritabilities, h2, and prevalence, p, were available. We let l P ij and l Ni ' j ' be the liability of proband j and non-proband j' in family i and family i', respectively, and let \( {\mathbf{L}}_i^P=\left({l}_{i1}^P\kern0.5em ,\dots, \kern0.5em {l}_{i{p}_i}^P\right) \) and \( {\mathbf{L}}_i^N=\left({l}_{i1}^N\kern0.5em ,\dots, \kern0.5em {l}_{i{q}_i}^N\right) \). We assumed that each liability followed the standard normal distribution, and their joint distributions were

$$ \left(\kern0.75em \begin{array}{c}{L}_i^P\kern0.75em \\ {}\kern0.5em {L}_i^N\end{array}\kern0.75em \right)\sim MVN\left(0,{h}^2\left(\kern0.75em \begin{array}{c}{\mathbf{R}}_i^P\kern1em \\ {}\kern0.75em {\left({\mathbf{R}}_i^{PN}\right)}^t\kern0.62em \end{array}\begin{array}{c}\kern1em {\mathbf{R}}_i^{PN}\kern1em \\ {}\kern1em {\mathbf{R}}_i^N\end{array}\right)\kern1em +\left(1-{h}^2\right){\mathbf{I}}_{n_i}\right). $$

Benchek and Morris [24] reported that significant asymptotic biases are likely to arise when the multivariate normal (MVN) liability assumption is not met and in such a case, different assumptions should be considered. We assume that M P * i and V P * i are the expectation and variances of L i P when their disease statuses are conditioned. If all probands are affected, they becomes

$$ {{\mathbf{M}}_i}^{P^{\ast }}\equiv E\left({L_i}^P\Big|{l_{i1}}^P>c,\dots, {l_{i{p}_i}}^P>c\right) $$

and

$$ {{\mathbf{V}}_i}^{P^{\ast }}\equiv var\left({L_i}^P\Big|{l_{i1}}^P>c,\dots, {l_{i{p}_i}}^P>c\right). $$

They can be calculated with the numerical algorithms [25]. If p i is 1, both can be simply calculated. We denote the cumulative and probability density function of standard normal distribution by Ф(·) and ϕ(·). If we let c be the (1–p)th quantile of the standard normal distribution, M P * i and V P * i becomes

$$ {\mathbf{M}}_i^{P\ast}\Big\{\kern1em \begin{array}{c}\phi (c)/\left[1-\varPhi (x)\right]\kern1em \\ {}\kern1em -\phi (c)/\varPhi (x)\kern1em \end{array}\kern2em \begin{array}{c}\kern1em \mathrm{if}\kern0.5em {y}_{i1}^P=1\kern1em \\ {}\kern1em \mathrm{if}\kern0.5em {y}_{i1}^P=0\kern1em \end{array},\kern1em \mathrm{and}\kern0.5em {\mathbf{V}}_i^{P\ast}\kern0.5em =1-{\left({\mathbf{M}}_i^{P\ast}\right)}^2+{\mathbf{M}}_i^{P\ast }c. $$

With Pearson-Aitken formula [26, 27], we could obtain the conditional mean and variance-covariance matrix of L i N given \( {\mathbf{L}}_i^P>{1}_{p_i}\kern0.5em \cdot c \) as follows:

$$ {\mathbf{M}}_i^{N\ast }=0+{\left({\mathbf{R}}_i^{PN}\right)}^t{\left({\mathbf{V}}_i^P\right)}^{-1}\left({\mathbf{M}}_i^{P\ast }-0\right) $$

and

$$ {\mathbf{V}}_i^{N\ast }={\mathbf{V}}_i^N+{\left({\mathbf{R}}_i^{PN}\right)}^t\left({\left({\mathbf{V}}_i^P\right)}^{-1}-{\left({\mathbf{V}}_i^P\right)}^{-1}{\mathbf{V}}_i^{P\ast }{\left({\mathbf{V}}_i^P\right)}^{-1}\right){\mathbf{R}}_i^{PN}. $$

We denoted the jth element in M N * i by m N * j and the jth diagonal element in V N * i by v N * j . Then the probability of being affected for a non-proband under multivariate normality of the liabilities could be calculated as

$$ \varPhi \left(\frac{c-{m}_j^{N\ast }}{v_j^{N\ast }}\right), $$

and this will be incorporated into the proposed statistic as offset. Thus far, we have assumed that there was a well-designed set of p i individuals who were “probands”, and for this situation, we calculated the statistic as indicated and denoted FQLS1. However in practice, different ascertainment condition such as sequential sampling frame [28] are often utilized, and the set of p i individuals will not be well defined. For this situation, we calculated the probability for each individual to be affected under the assumption that all the other family members were “probands”, and thus p i  = n i – 1 and q i  = 1. The statistic calculated this way was denoted by FQLS2.

Results

The simulation model

In our simulation studies, we considered two types of family structures; nuclear families with five offspring and the extended families that consist of 13 individuals along 3 generations (see Fig. 2). The latter will be called extended families in the remainder of this report. The disease allele frequency, p, was assumed to be 0.2. If we denoted the disease allele frequency by q A , the genotype frequencies for AA, Aa, and aa became q A 2, 2q A (1 – q A ), and (1 – q A )2 under HWE, respectively, and founders’ genotypes were generated under the corresponding multinomial distribution. The genotypes for non-founders were generated with randomly generated Mendelian transmission. The disease status was generated with the liability threshold model. Once continuous liabilities that consisted of polygenic effects and random errors were generated, they were transformed to being affected if they were larger than the threshold; and otherwise, they were considered to be unaffected. The threshold was chosen to preserve the prevalence, and prevalence was assumed to be 0.2. Continuous liability was determined by combining the phenotypic mean, polygenic effect, main genetic effect, and random error. The main genetic effect for each individual was the product of β and the number of disease alleles. If we denoted the relative proportion of the phenotypic variance attributable to the main disease gene by h a 2, and h2 was a heritability for continuous liability, β was calculated by

Fig. 2
figure 2

Family tree. There are two different types of family structures, including (a) nuclear family and (b) extended family, which were considered in our simulation study

$$ \beta =\sqrt{\frac{h_a^2}{2{q}_A\left(1-{q}_A\right)\left(1-{h}^2\right)}}. $$

For the evaluation of type-1 errors and power, h a 2 was assumed to be 0 and 0.005, respectively. Phenotypic correlations between family-members were explained by the polygenic effects. Parental polygenic effects were generated from N(0, h2), and h2 was assumed to be 0.2, 0.5, or 0.8. For non-founders, the average of maternal and paternal polygenic effects was combined with the values independently sampled from N(0, 0.5 h2) for the polygenic effects of offspring. Random errors were generated from N(0, σ e 2 = 1–h2). For each replicate, sampling was repeated until a given number of ascertained families was generated. Type-1 error estimates were calculated with 5000 replicates, and empirical power estimates were calculated with 1000 replicates.

Evaluation of the proposed methods with simulated data

The empirical type-1 errors for FQLS1 and FQLS2 were evaluated from 5000 replicates under the situation of no association (h a 2 = 0), and 900 nuclear families with five offspring in Fig. 2 were generated for each replicate. Fig. 3 shows the quantile quantile (QQ) plots from 5000 replicates, and the nominal significance levels for both methods were preserved for various significance levels. We also estimated the empirical type-1 error rates at the 0.01 and 0.05 significance levels; the empirical type-1 error estimates of FQLS1 and FQLS2 preserved these nominal significance levels (Table 1). These results verified that the use of the approximation to the standard normal distribution resulted in an accurate assessment of significance for the proposed methods.

Fig. 3
figure 3

QQ plots for FQLS1 and FQLS2 under the null hypothesis. QQplots for FQLS1 and FQLS2 are obtaind when h2 is 0.2((a), (b)), 0.5((c), (d)), or 0.8((e), (f)). P-values were calculated based on 5000 replicates when the number of families was 900. The genetic effect β was assumed to be 0, and the minor allele frequency was 0.2

Table 1 Empirical type-1 error estimates. The empirical type-1 error rates and their 95 % confidence intervals were estimated with 5000 replicates at the 0.01 and 0.05 significance level for h2 = 0.2, 0.5, and 0.8. The number of families was assumed to be 900, and the disease allele frequency was 0.2

The empirical powers at the various significance levels were measured based on 1000 replicates at the 0.01 and 0.001 significance levels. The relative proportion, h a 2, of phenotypic variance attributable to the main disease gene, 2p A (1 – p A )β2, was assumed to be 0.005, and nuclear and extended families in Fig. 2 were considered for the power comparison. In the first simulation setting, the numbers of nuclear families were assumed to be 100, 300, 600, 900, 1200, and 1400, and half of the families were ascertained if the number of affected family members was larger than or equal to n proband , and the other half of the families were ascertained if the number of unaffected family members was larger than or equal to n proband . Therefore, if 100 nuclear families were generated, half of nuclear families should have more than or equal to n proband affected family members, and the other half should have at least n proband unaffected family members. We assumed that the heritabilities were 0.2, 0.5, and 0.8, and results are shown in Tables 2, 3, 4, respectively. In the second simulation setting, the numbers of extended families were assumed to be 100, 300, 600, and 900, and all families were ascertained if the number of affected amily members was larger than or equal to n proband . Empirical power estimates for scenario 2 were calculated when h2 = 0.2, 0.5, and 0.8, and the data are shown in Tables 5, 6, 7, respectively. Our results showed that either FQLS1 or FQLS2 was usually the most efficient statistic, and the least efficiency was provided from WL. In particular, the power gap between the proposed methods and WL was largest if h2 was 0.8, which indicates that power improvement may be proportional to the heritability. If h2 was 0.2, the proposed methods were only slightly better than WL. While all methods in our power comparison focused on the distribution of genotypes to calculate statistics, the proposed methods uniquely considered the heterogeneous effects of ascertainment bias among family members which were proportional to the magnitude of heritability; this explained the power improvement of the proposed methods. Furthermore the differences of empirical power estimates from WL and the proposed methods are larger for Tables 5, 6, 7 than Tables 2, 3, 4, which indicates that the heterogeneity of ascertainment condition may be positively related with family size and the proposed methods become more efficient for large families. Last our simulation results show that FQLS2 was slightly better than FQLS1, and this may be induced by the uncertainty of probands in our simulation studies. Therefore, we concluded that the incorporation of a sampling scheme to the offset could make a substantial difference, and test statistic should be carefully selected depending on type of sampling scheme.

Table 2 Empirical power estimates for scenario 1 when h2 is 0.2. The empirical power estimates for scenario 1 were calculated with 1000 replicates at the both 0.01 and 0.001 significance levels. The disease allele frequency was assumed to be 0.2, and the prevalence was assumed to 0.2. The relative phenotypic variance attributable to the main disease gene was assumed to be 0.005

The bold text indicates the highest empirical estimate of the power for each situation

Table 3 Empirical power estimates for scenario 1 when h2 is 0.5. The empirical power estimates for scenario 1 were calculated with 1000 replicates at the both 0.1 and 0.001 significance levels. The disease allele frequency was assumed to be 0.2, and the prevalence was assumed to 0.2. The relative phenotypic variance attributable to the main disease gene was assumed to be 0.005

The bold text indicates the highest empirical estimate of the power for each situation

Table 4 Empirical power estimates for scenario 1 when h2 is 0.8. The empirical power estimates for scenario 1 were calculated with 1000 replicates at the both 0.01 and 0.001 significance levels. The disease allele frequency was assumed to be 0.2, and the prevalence was assumed to 0.2. The relative phenotypic variance attributable to the main disease gene was assumed to be 0.005

The bold text indicates the highest empirical estimate of the power for each situation

Table 5 Empirical power estimates for scenario 2 when h2 is 0.2. The empirical power estimates for scenario 2 were calculated with 1000 replicates at the both 0.01, and 0.001 significance levels. The disease allele frequency was assumed to be 0.2, and the prevalence was assumed to 0.2. The relative phenotypic variance attributable to the main disease gene was assumed to be 0.005

The bold text indicates the highest empirical estimate of the power for each situation

Table 6 Empirical power estimates for scenario 2 when h2 is 0.5. The empirical power estimates for scenario 2 were calculated with 1000 replicates at the both 0.1 and 0.001 significance levels. The disease allele frequency was assumed to be 0.2, and the prevalence was assumed to 0.2. The relative phenotypic variance attributable to the main disease gene was assumed to be 0.005

The bold text indicates the highest empirical estimate of the power for each situation

Table 7 Empirical power estimates for scenario 2 when h2 is 0.8. The empirical power estimates for scenario 2 were calculated with 1000 replicates at the both 0.01 and 0.001 significance levels. The disease allele frequency was assumed to be 0.2, and the prevalence was assumed to 0.2. The relative phenotypic variance attributable to the main disease gene was assumed to be 0.005

The bold text indicates the highest empirical estimate of the power for each situation

Robustness of the proposed methods against the misspecification of prevalence

The statistical powers of the proposed methods may depend on the accuracy of the prevalence and we evaluated the sensitivity of the proposed method to the misspecified prevalence with simulated data. h a 2 and h2 were assumed to be 0.05 and 0.8, and nuclear families (Fig. 2-(a)) were considered in this simulation. The number of nuclear families was assumed to be 900 and n proband was assumed to be 1, 2, or 3. Prevalence was assumed to be 0.2 for phenotype generation, and the offset for which we recommended prevalence was set to be 0.1, 0.2 or 0.3 for calculation of FQLS1 and FQLS2. In particular, y ij for individuals with missing phenotypes are coded by the assumed prevalence, and sensitivity of the proposed methods can be substantial when there are individuals with missing phenotypes. Therefore, individuals were randomly selected from non-probands, and their phenotypes were assumed to be unknown for calculation of the proposed statistics. The number of family members with missing phenotypes in each family was denoted by n missing . The empirical powers were calculated at the 0.01 significance level with 1000 replicates. Table 8 shows that the results obtained by setting prevalence to be 0.1 and 0.3 are similar to the results when the prevalence was set to be 0.2, which indicates that the power loss attributable to the misspecified prevalence is not substantial. Furthermore the empirical power estimates are positively related with n proband and inversely related with n missing . If n missing is larger than 3, the power loss may be more substantial.

Table 8 Empirical power estimates for three situations when h2 is 0.8. The empirical power estimates for three situations were calculated with 1000 replicates at the 0.01 significance levels. Phenotypes were generated under the assumption that the prevalence was assumed to 0.2. Prevalence was set to be 0.1, 0.2 or 0.3 to calculate the proposed statistics. The relative phenotypic variance attributable to the main disease gene was assumed to be 0.005

The bold text indicates the reference for the proposed statistics to compare with the misspecified prevalence.

Application of the proposed method to AD

AD is an irreversible, progressive brain disorder characterized by genetic heterogeneity. However, the genetic variations that contribute to AD still remain elusive. Thus, we applied the proposed method for identification of the disease susceptibility loci for AD. The heritability and prevalence of AD are approximately 0.8 [29, 30] and 0.1, respectively; therefore, we chose heritabilities of 0.8, and a prevalence of 0.1 for the calculation of proposed methods. Samples were collected as part of the National Institute of Mental Health Genetics Initiative (NIMH). The NIMH Alzheimer’s Disease Genetics Family Sample was used along with the information about the genotype platform (Affy 6.0) [20, 31], and ethical approach and participant approval were obtained through the NIMH IRB panel. Families were selected based on the disease status of a certain family member. However the proband for each family was not clear, and FQLS2 was uniquely applied for the proposed method. 1376 individuals from 410 families were available, and all families were nuclear. All individuals were of self-reported European ancestry. HWE for each single nucleotide polymorphism (SNP) was tested, and MAFs were estimated. SNPs for which p-values for HWE were less than 10−6 or MAFs less than 0.05 were excluded, and therefore, 417,680 SNPs were analyzed for genetic association analysis.

The choice of kinship coefficient matrix for the proposed method depends on the presence of population substructure. To confirm the presence of population substructure, multidimensional-scaling [32] analysis was performed with PLINK1.07 [20], and we constructed a multidimensional-scaling plot (Fig. 4) to provide evidence concerning the presence of population substructure. Therefore, the genomic control method [31, 33, 34] was used for explicit detection and correction of population stratification with common SNPs, and they were incorporated into both FQLS2 and WL. Fig. 5 shows QQ plots for FQLS2 and WL; these plots revealed that the presence of population substructure was appropriately adjusted. Results showed four genome-wide significant results for FQLS2 and one significant result for WL.

Fig. 4
figure 4

Multidimensional scaling plots from samples for the GWAS for AD. Founders were selectively used, and multidimensional scaling plots were obtained with the first and second PC scores

Fig. 5
figure 5

QQ plots of results from GWAS for AD. QQ plots are provided with the results from (a) WL and (b) FQLS2

Detailed information for these significant results is provided in Table 9. We also considered FBAT statistics [35]. FQLS2 identified four genome-wide significant SNPs, while WL and FBAT identified one genome-wide significant SNP. In addition, one SNP that was significant according to WL was more significant according to FQLS2. The most significant result acquired using FQLS2 was SNP4 (p = 2.10× 10−10). The other three SNPs, i.e., SNP1, SNP2 and SNP3, reached the genome-wide significance level.

Table 9 Top significant results of GWAS for AD. For the genome-wide significant SNPs from FQLS2 and WL, their p-values are given

Discussion

Although major advances in high-density genome scans have enabled the genetic association analysis of more than 10,000 individuals, disappointing results in the mapping of many common diseases have illustrated the need for more powerful methods for detecting disease susceptibility loci. Statistical efficiency is known to being affected by the ascertainment bias, and its careful adjustment can lead to substantial improvement of statistical power [3638]. In particular, genetic association analysis has often been conducted using family-based designs, but without addressing the fact that the probability for each family member to be affected is inversely related with the familial relationship with affected probands. In this report, we proposed new methods to adjust this heterogeneity with known prevalence and heritability. Our simulation studies showed that the proposed methods provided substantial power improvement. In particular, the mis-specified heritability and prevalence can lead to the statistical power loss for the proposed methods, but it was found to be not substantial at least in our simulation studies. FQLS1 and FQLS2 were suggested, and FQLS1 is an efficient choice if probands for each family are clearly defined and all remaining family members are incorporated to the genetic analysis. However, these conditions are often not satisfied, and different methods such as sequential sampling frame [28] are usually utilized. Simulation studies showed that FQLS2 is usually better than FQLS1 if the ascertaining condition is not clearly defined and thus we recommend FQLS2 unless probands are clearly defined. However we considered the limited ascertainment conditions and comprehensive simulation studies are still necessary.

Furthermore, the proposed method was conceptually simple and can be applied to the large families. Our methods require only a single calculation of offset for all markers, and the real data analysis could be completed with a single CPU in a few hours. For M markers and N individuals, the time complexity is O(N3 + MN2) for the proposed method. The proposed method was implemented with C++, and can be downloaded from http://healthstat.snu.ac.kr/mfqls/.

Heterogeneity between samples is an important issue in large-scale genetic analysis, and the proposed method can likely be applied to various additional scenarios with some modifications. For instance, the disease status of relatives reveals the importance of genetic components for each individual, and for this reason, such information has been used, albeit only on occasion, in genetic association analysis. The effect of relatives’ disease statuses is dependent on prevalence and heritability, and the probability for each individual to be affected could be calculated with the proposed method. This probability can be used to improve the statistical efficiency of genetic association analysis. In addition, the heterogeneity of the ascertainment bias is often an important issue for genome-wide meta-analysis because samples are collected from multiple medical centers [39, 40], and different sampling schemes among studies need to be adjusted to improve statistical efficiency. Therefore, we believe that proposed method can be extended to provide a statistical framework that adjusts the heterogeneity between samples.

Conclusions

We proposed FQLS method to adjust this heterogeneity with known prevalence and heritability and the software was implemented with C++. We identified several significant associations between AD and SNPs, and their potential functional information will provide the better understanding of the pathogenesis of AD. Although this study has some limitations, our proposed methods illustrated important features required for genetic analysis with family-based samples, and an extension of the proposed method to rare variant association analysis such as FARVAT [41] will be investigated in future studies.

Abbreviations

GWAS:

Genome-wide association studies

CA:

Cochran-Armitage

AD:

Alzheimer’s diseas

PD:

Parkinson’s disease

HWE:

Hardy-Weinberg equilibrium

BLUE:

Best linear unbiased estimator

QQ:

Quantile quantile

MVN:

Multivariate normal

MAF:

Minor allele frequency

SNP:

Single nucleotide polymorphism

FBAT:

Family Based Association Testing software

FQLS:

Family based quasi-likelihood score test

References

  1. Maher B. Personal genomes: the case of the missing heritability. Nature. 2008;456(7218):18–21.

    Article  CAS  PubMed  Google Scholar 

  2. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.

  3. Visscher PM. Sizing up human height variation. Nat Genet. 2008;40(5):489–90.

    Article  CAS  PubMed  Google Scholar 

  4. Clement K, Vaisse C, Manning BS, Basdevant A, Guygrand B, Ruiz J, et al. Genetic-variation in the beta(3)-adrenergic receptor and an increased capacity to gain weight in patients with morbid-obesity. New Engl J Med. 1995;333(6):352–4.

  5. Gu C, Todorov AA, Rao DC. Genome screening using extremely discordant and extremely concordant pairs. Genet Epidemiol. 1997;14(6):791–6.

    Article  CAS  PubMed  Google Scholar 

  6. Hu SW, Zhong YF, Hao YT, Luo MQ, Zhou Y, Guo H, et al. Novel rare alleles of ABCA1 are exclusively associated with extreme high-density lipoprotein-cholesterol levels among the Han Chinese. Clin Chem Lab Med. 2009;47(10):1239–45.

  7. Khor CC, Goh DLM. Strategies for identifying the genetic basis of dyslipidemia: genome-wide association studies vs. the resequencing of extremes. Curr Opin Lipidol. 2010;21(2):123–7.

    Article  CAS  PubMed  Google Scholar 

  8. Li BS, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Liang KY, Huang CY, Beaty TH. A unified sampling approach for multipoint analysis of qualitative and quantitative traits in sib pairs. Am J Hum Genet. 2000;66(5):1631–41.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Price RA, Li WD, Zhao H. FTO gene SNPs associated with extreme obesity in cases, controls and extremely discordant sister pairs. BMC Med Genet. 2008;9:4.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Risch N, Zhang H. Extreme discordant sib pairs - the design of choice for mapping quantitative trait loci in humans. Am J Hum Genet. 1995;57(4):1159–9.

    Google Scholar 

  12. Risch NJ, Zhang HP. Mapping quantitative trait loci with extreme discordant sib pairs: Sampling considerations. Am J Hum Genet. 1996;58(4):836–43.

    CAS  PubMed  PubMed Central  Google Scholar 

  13. Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR. Power of deep, all-exon resequencing for discovery of human trait genes. Proc Natl Acad Sci U S A. 2009;106(10):3871–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Lander ES, Botstein D. Mapping Mendelian factors underlying quantitative traits using Rflp linkage maps. Genetics. 1989;121(1):185–99.

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Van Gestel S, Houwing-Duistermaat JJ, Adolfsson R, van Duijn CM, Van Broeckhoven C. Power of selective genotyping in genetic association analyses of quantitative traits. Behav Genet. 2000;30(2):141–6.

    Article  PubMed  Google Scholar 

  16. Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics. 1997;53(4):1253–61.

    Article  CAS  PubMed  Google Scholar 

  17. Won S, Lange C. A general framework for robust and efficient association analysis in family-based designs: quantitative and dichotomous phenotypes. Stat Med. 2013;32(25):4482–98.

  18. Thornton T, McPeek MS. Case–control association testing with related individuals: a more powerful quasi-likelihood score test. Am J Hum Genet. 2007;81(2):321–37.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Falconer DS. The inheritance of liability to diseases with variable age of onset, with particular reference to diabetes mellitus. Ann Hum Genet. 1967;31(1):1–20.

    Article  CAS  PubMed  Google Scholar 

  20. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.

  21. Yang JA, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Lange C, DeMeo DL, Laird NM. Power and design considerations for a general class of family-based association tests: Quantitative traits. Am J Hum Genet. 2002;71(6):1330–41.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. McPeek MS, Wu X, Ober C. Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics. 2004;60(2):359–67.

    Article  PubMed  Google Scholar 

  24. Benchek PH, Morris NJ. How meaningful are heritability estimates of liability? Hum Genet. 2013;132(12):1351–60.

    Article  PubMed  Google Scholar 

  25. Leppard P, Tallis GM. Evaluation of the mean and covariance of the truncated multinormal distribution. J R Stat Soc C. 1989;38(3):543–53.

    Google Scholar 

  26. Aitken AC. Note on selection from a multivariate normal population. Proc Edinburgh Mathematical Society B. 1934;4:106–10.

    Article  Google Scholar 

  27. Pearson K. Mathematical contributions to the theory of evolution. XI. On the influence of natural selection on the variability and correlation of organs. Philos Transact A Math Phys Eng Sci. 1903;200:1–66.

    Article  Google Scholar 

  28. Elston RC, Sobel E. Sampling considerations in the gathering and analysis of pedigree data. Am J Hum Genet. 1979;31(1):62–9.

    CAS  PubMed  PubMed Central  Google Scholar 

  29. Gatz M, Reynolds CA, Fratiglioni L, Johansson B, Mortimer JA, Berg S, et al. Role of genes and environments for explaining Alzheimer disease. Arch Gen Psychiatry. 2006;63(2):168–74.

  30. Gatz M, Pedersen NL, Berg S, Johansson B, Johansson K, Mortimer JA, et al. Heritability for Alzheimer’s disease: the study of dementia in Swedish twins. J Gerontol A Biol Sci Med Sci. 1997;52(2):M117–25.

  31. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–9.

    Article  CAS  PubMed  Google Scholar 

  32. Li Q, Yu K. Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol. 2008;32(3):215–26.

    Article  CAS  PubMed  Google Scholar 

  33. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55(4):997–1004.

    Article  CAS  PubMed  Google Scholar 

  34. Reich DE, Goldstein DB. Detecting association in a case–control study while correcting for population stratification. Genet Epidemiol. 2001;20(1):4–16.

    Article  CAS  PubMed  Google Scholar 

  35. Laird NM, Horvath S, Xu X. Implementing a unified approach to family-based tests of association. Genet Epidemiol. 2000;19 Suppl 1:S36–42.

    Article  PubMed  Google Scholar 

  36. Vieland VJ, Hodge SE. Inherent intractability of the ascertainment problem for pedigree data: a general likelihood framework. Am J Hum Genet. 1995;56(1):33–43.

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Nielsen R, Hubisz MJ, Clark AG. Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics. 2004;168(4):2373–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Clark AG, Hubisz MJ, Bustamante CD, Williamson SH, Nielsen R. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 2005;15(11):1496–502.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS results: A review of statistical methods and recommendations for their application. Am J Hum Genet. 2010;86(1):6–22.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Ripke S, Wray NR, Lewis CM, Hamilton SP, Weissman MM, Breen G, et al. A mega-analysis of genome-wide association studies for major depressive disorder. Mol Psychiatry. 2013;18(4):497–511.

  41. Choi S, Lee S, Cichon S, Nothen MM, Lange C, Park T, Won S: FARVAT: a family-based rare variant association test. Bioinformatics. 2014;30(22):3197-205.

Download references

Acknowledgements

SP, SL, YL, and SW were supported by the Industrial Core Technology Development Program (10040176, Development of Various Bioinformatics Software Using Next Generation Bio-Data) funded by the Ministry of Trade, Industry and Energy (MOTIE, Korea), and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2013R1A1A2010437). CP was supported by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2011-220-C00004). BH, KM, CL and RT were supported by NIMH RO1 grant (2R01MH060009) and the Cure Alzheimer’s Fund.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Rudolph Tanzi or Sungho Won.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SP participated in the design of the study, and performed the simulation studies and statistical analysis. SL and TP developed the program. YL conducted the statistical analysis. CH, BH, KM, CP, LB, CL and RT conceived of the study. SW conceived of the study, and participated in data analysis. All authors read and approved the final manuscript.

Additional file

Additional file 1:

Web-based Supporting Materials for “Adjusting heterogeneous ascertainment bias for genetic association analysis with extended families” by Suyeon Park, Sungyoung Lee, Young Lee, Christine Herold , Basavaraj Hooli, Kristina Mullin, Lars Bertram, Taesung Park, Changsoon Park, Christoph Lange, Rudolph Tanzi , and Sungho Won.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, S., Lee, S., Lee, Y. et al. Adjusting heterogeneous ascertainment bias for genetic association analysis with extended families. BMC Med Genet 16, 62 (2015). https://doi.org/10.1186/s12881-015-0198-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12881-015-0198-6

Keywords