In this section, we will give details of the data set and describe a new analytical method to analyze this data set.
The Data Set from GWAS for Sporadic ALS
Schymick et al. have made their data set publicly available through the website of the National Institute of Neurological Disorders and Stroke (NINDS) Human Genetics Resource Center at the Coriell Institute http://ccr.coriell.org/ninds[1]. The data set contained 555,352 unique SNPs across the genome in 276 patients with sporadic ALS and 271 neurologically normal controls. The 555,352 SNPs were carefully chosen tagging SNPs from phase I and II of the HapMap Project. The sampled individuals were all non-Hispanic white Americans. There were 102 females and 174 males in cases, and 142 females and 129 males in controls. All sampled individuals had a more than 95% genotype call rate. The average call rate across all samples was 99.6%. Of the 555,352 SNPs studied, the genotype call rate was greater than 99% for 514,088 (representing 92.6% of all SNPs assayed) and greater than 95% for 549,062 (98.9%) SNPs. The phenotype file of this data set contained the status of sporadic ALS, age of onset, site of onset (bulbar-onset, upper-limb-onset, and lower-limb-onset), gender, and smoking status among other information.
Statistical Analysis
Two-locus Analysis Based on Seventeen Two-locus Models
In this article, we used seventeen two-locus models to analyze the genome-wide association data. For each SNP, we called one allele a high-risk allele if its frequency in cases was larger than the frequency in controls. For SNP A with alleles A, a and SNP B with alleles B, b, Figure 1 and 2 give eight epistatic two-locus models and nine multiplicative two-locus models with high-risk alleles A and B, respectively. Some of the eight epistatic two-locus models have been used and discussed by Xiong et al. and Zhao et al. [34, 35]. The multiplicative models that are good approximations of additive models have been discussed by Hodge and Risch [36, 37].
Under each of the epistatic models, the nine two-locus genotypes were divided into two groups: high-risk genotype group and low-risk genotype group. For example, under the model Dom∩Dom, the high-risk group was G
H
= {aAbB, AAbB, aABB, AABB} and the low-risk group was G
L
= {aabb, aAbb, AAbb, aaBB} For the eight epistatic models, we used one degree of freedom (df) χ
2 test statistic given by
to test for association of two-locus joint effects, where
,
, and
denote the frequencies of the high-risk genotype group in cases, controls and the pooled sample (cases and controls are pooled together).
For the nine multiplicative models, we constructed a two-locus association test as follows. Let P(Disease|g) denote the penetrance of two-locus genotype combination g = (g
1, g
2), where g
1 and g
2 are the genotypes in the first and second markers, respectively. Let β
0 denote the logarithm of the penetrance of genotypes with a relative risk of 1 in the models (see Figure 2) and β
1 = logθ, where θ is the relative risk given in Figure 2. Then, the nine multiplicative models can be described by the following log linear model log P(Disease|g) = β
0 + β
1
X, where X = x
1 + x
2, x
1 is the numerical code of g
1 and is given by
for a dominant, recessive or multiplicative model, respectively; x
2 is similarly defined as the numerical code of g
2. Under the log linear model log P(Disease|g) = β
0 + β
1
X, β
1 = 0 means that all the genotypes have the same penetrance which implies that θ = 1. So a test of the association between the disease and the two loci under the nine multiplicative models is equivalent to a test of the null hypothesis H
0: β
1 = 0. For the i
thindividual, let y
i
denote the trait value (1 for diseased individual and 0 for normal individual) and X
i
denote the numerical code of the genotype (X in the log linear model). The score test statistic is given by
where N is the sample size,
is the average of X
1,..., X
N
, and
is the average of y
1,..., y
N
. Under the null hypothesis, T
score
follows a χ
2 distribution with 1 df. Note that under each of the two-locus epistatic models, if we code X = 1 for a high-risk genotype group and X = 0 for a low-risk genotype group, then T
epi
= T
score
.
The method to search for significant two-locus combinations for each of the seventeen models has the following two steps:
Step 1: For each SNP, let n and m denote the number of individuals in cases and controls (different SNPs may have a different number of cases and controls due to missing genotypes). Let n
1, n
2, n
3 and m
1, m
2, m
3 denote the number of three genotypes in cases and controls, respectively. The 2 df genotypic test statistic is given by
where
and
. We applied this test statistic to each SNP, calculated the corresponding p-value, and returned M SNPs with the smallest p-values (M = 1,000 was used in this article).
Step 2: Under each of the seventeen two-locus models, we applied a two-locus association test to each of the L two-locus combinations among the M retained SNPs, where L = M(M-1)/2. For a two-locus epistatic model given in figure 1, we used the two-locus test T
epi
. For a multiplicative model given in figure 2, we used the score test T
score
. In this step, we got a p-value (called raw p-value) for each of the L two-locus combinations and each of the seventeen two-locus models.
A permutation procedure was used to adjust for multiple tests and multiple models. In each permutation, we randomly shuffled the cases and controls and repeated step 1 and step 2 based on the permuted data. We performed the permutation procedure B times (B = 1,000 was used in this article). For the i
thmodel and l
thtwo-locus combination (i = 1,...,17; l = 1,..., L), let p
il and
denote the raw p-values of the two-locus tests in step 2 based on the original data and on the b
thpermutated data, respectively. Let
Then, for the i
thmodel and l
thtwo-locus combination, P
il
, the p-value adjusted for multiple tests and multiple models, was given by
.
A New Method to Estimate Penetrance
When a study identifies a locus or locus-combination that shows evidence of association with a disease, it is common to estimate the impact of this locus or locus-combination on the phenotype of interest. This impact is often expressed as an odds ratio. Estimation of the odds ratio is also helpful for planning successful replication studies.
It is recognized that the traditional estimate of odds ratio is up-biased because it is typically estimated for the locus which was significant for association [32, 33]. Recently, Zollner and Pritchard proposed a new method to estimate penetrance (odds ratio can be calculated based on the penetrance) [32]. This new method was based on the likelihood of observed genotypes given that the locus was significant for association. We modified Zollner and Pritchard's method to estimate the penetrance and odds ratio for two-locus combinations under each of the seventeen models given in Figure 1 and Figure 2. We use the Dom∩Dom model given in Figure 1 as an example to describe our method.
We use the following notation:
n, m: the number of cases and controls
the data D = {n
1,..., n
9; m
1,..., m
9}: the counts of nine two-locus genotypes in cases and controls that constitute the significant signal for association
(q
1,..., q
9): the population frequencies of the genotypes
R: the relative risk of high-risk genotype combination to low-risk genotype combination, R = β/α.
F: the population prevalence of the disease which is assumed to be known.
Because ALS is a rare disease with F = 0.0001, we can estimate q
i
from the sampled controls. Thus, we assume that q
i
= (number of i
thgenotype in controls)/m is known in the following discussion. In the Dom∩Dom model, the 5th, 6th, 8thand 9thgenotype combination {(aA, bB), (AA, bB), (aA, BB), (AA, BB)} is the high-risk genotype combination, and the combination of the other genotypes is the low-risk genotype combination. Let q
H
= q
5 + q
6 + q
8 + q
9 denote the population frequency of the high-risk genotype combination. Then, the penetrance α and β (see Figure 1) can be calculated by
Thus, we have only one unknown parameter R Let S indicate that the two-locus combination of interest shows significant association. As described in the previous section, we use a two-step approach for the two-locus analysis. A significant association of the two-locus combination from our two-step method means that each of the two loci shows significant marginal association at level α
1 in step 1 and significant joint association at level α
2 in step 2. We calculate the likelihood L(R) using the equation
where the data D = {n
1,..., n
9; m
1,..., m
9}. Since the data D constitutes, by definition, a significant result, so D implies S; hence Pr(S|D,R) = 1. If the value of L(R) can be calculated for each given R, we can obtain the MLE of R by using a numerical optimization method (grid search was used in this article). For each R, the numerator can be calculated by the product of two multinomial distributions
where
if the k
thgenotype is a low-risk genotype;
otherwise. The traditional method to estimate the relative risk is to maximize Pr(D|R), the numerator in the likelihood function L(R), without considering the fact that the loci were significant for association. There is no simple method to calculate the denominator Pr(S|R), the power of our two-step test. We propose to use a simulation method as described below. For a given R, the values of α and β can be calculated by equation (1). When α, β, and q
i
are known, we can generate the two-locus genotypes for n cases and m controls. Next, we will perform the single-marker test and the two-locus test on the data set. If the p-values of the two single-marker tests are less than α
1 and the p-value of the two-locus test is less than α
2, the data set is said to be significant for association. We repeat the process to generate the data sets many times (1 million was used in this article). The proportion of significant data sets is the estimate of Pr(S|R).
When the relative risk R has been estimated, the corresponding estimates of α and β can be obtained from equation (1). The estimate of odds ratio of the high-risk genotype group is given by
.
Following Zollner and Prichard, when there are more than two genotype groups in the models such as these in Figure 2, we define the odds ratio of one group to be the odds of this group divided by the odds of the combination of the others. For example, there are three genotype groups in the Dom × Dom model: low risk genotype group G
L
= {aabb}, middle risk genotype group G
M
= {aabB, aaBB, aAbb, AAbb}, and high risk genotype group G
H
= {aAbB, aABB, AAbB, AABB}. The odd ratio of the high risk group OR
His the odds of G
H
divided by the odds of G
M
∪ G
L
= {aabb, aabB, aaBB, aAbb, AAbb}. The odd ratio of the low risk genotype group OR
Lis the odds of G
L
divided by the odds of G
M
∪ G
H
= {aabB, aaBB, aAbb, AAbb, aAbB, aABB, AAbB, AABB}. The odds ratio estimation method will be the same as the case of two genotype groups.
We used this new proposed method to estimate the odds ratio for each of the two-locus combinations that showed significant association with ALS in our two-locus analysis. Based on the estimated penetrance, we used a simulation method to estimate the sample size required to replicate the findings with 80% power.