Quantitative phenotype scan statistic (QPSS) reveals rare variant associations with Alzheimer's disease endophenotypes.

BACKGROUND
Current sequencing technologies have provided for a more comprehensive genome-wide assessment and have increased genotyping accuracy of rare variants. Scan statistic approaches have previously been adapted to genetic sequencing data. Unlike currently-employed association tests, scan-statistic-based approaches can both localize clusters of disease-related variants and, subsequently, examine the phenotype association within the resulting cluster. In this study, we present a novel Quantitative Phenotype Scan Statistic (QPSS) that extends an approach for dichotomous phenotypes to continuous outcomes in order to identify genomic regions where rare quantitative-phenotype-associated variants cluster.


RESULTS
We demonstrate the performance and practicality of QPSS with extensive simulations and an application to a whole-genome sequencing (WGS) study of cerebrospinal fluid (CSF) biomarkers from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Using QPSS, we identify regions of rare variant enrichment associated with levels of AD-related proteins, CSF Aβ1-42 and p-tau181P.


CONCLUSIONS
QPSS is implemented under the assumption that causal variants within a window have the same direction of effect. Typical self-contained tests employ a null hypothesis of no association between the target variant set and the phenotype. Therefore, an advantage of the proposed competitive test is that it is possible to refine a known region of interest to localize disease-associated clusters. The definition of clusters can be easily adapted based on variant function or annotation.

Next-generation sequencing technologies have allowed for more comprehensive genome-wide approaches, enabling accurate genotyping of rare variants (often defined as a variant with minor allele frequency (MAF) < 1-5%). Consequently, whole-exome sequencing (WES) and whole-genome sequencing (WGS) are ideal approaches to identify novel genes and rare variants associated with complex traits.
Traditional single-variant-based association tests are underpowered to detect rare variant associations unless either or both of the sample size and the effect size are very large [9]. Instead of testing single variants individually, more powerful and computationally efficient approaches for aggregating the effects of rare variants have become the standard for association testing. Many such approaches for testing association between rare variants within a pre-specified region and a disease have been proposed. Burden and variance component tests are two of the most common classes of rare variant analysis methods. Burden tests collapse rare variant effects from a specified region (e.g., gene) using a weighted average of variant counts, whereas variance component tests like the sequence kernel association test (SKAT) use the variance of effect sizes to examine association [10,11]. Burden tests are more powerful than the SKAT when most of the variants are causal and have the same direction of effect. On the other hand, SKAT is powerful when both risk and protective variants are mixed and when a small proportion of variants are causal [10]. The sheer number of published rare variant methods makes systematic evaluation of relative performance across a spectrum of realistic scenarios challenging [12].
A scan-statistic-based test was introduced into human genetics by Hoh et al. [13] and later adapted to find a window in which rare disease-associated risk variants cluster [1]. The underlying premise is that variants within a functional protein-coding domain may be located in close proximity and may play complementary roles in the genetic mechanisms of a disease. Unlike association tests or other cluster detection analyses, the scan-statistic-based test that we extend in this work can both detect the location of clusters and test for association [14]. Here, the null hypothesis is that rare variants within a certain genetic region/scan window are as likely to confer disease risk as are those outside the window. This approach is generally powerful when there are clusters of disease-related variants with the same direction of association within a selected region/window [14].
In this study, we propose the Quantitative Phenotype Scan Statistic (QPSS), expanding Ionita-Laza et al's scanstatistic from dichotomized responses to continuous outcomes by way of a normal probability model ( [15]; Supplementary Method 1). We then apply QPSS to WGS data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) with continuous outcomes to identify clusters harboring rare variants associated with Alzheimer's disease-linked cerebrospinal fluid (CSF) biomarkers. Implementation QPSS: quantitative phenotype scan statistic Assume a study comprising n subjects, each with a continuous outcome of interest y i (i = 1, …, n). Let m G denote the total number of rare variants in a genetic region of interest G, where "rare" is defined by a specified cutoff value (e.g., MAF < 5%). Variants are sorted by physical position in ascending order. We set a subwindow W containing m W variants (m W < m G ) within the genetic region G (i. e., W ⊆ G). Let n G be the number of individuals who carry at least one rare variant within the genetic region G (n G ≤ N). Of the n G rare variants carriers, n W þ ð ≤ n G Þ carry a rare variant in the window W. Similarly, n W − ð¼ n G −n W þ ) do not carry a rare variant within window W. We can then partition the maximum likelihood estimate of the trait variance σ 2 W among the n G rare variant carriers aŝ Under the null hypothesis that variants within the subwindow W are equally as likely to correlate with the disease outcome that those outside the window (but still within the region of interest G), the outcome variance should be similar between the n W − and n W þ subjects so that the pooled variance would equal the overall variance, that is, where σ 2 0 is the variance under the null hypothesis and its maximum likelihood estimate iŝ Under the alternative hypothesis, we expect the outcome values of the n W þ subjects carrying a rare variant within W to be more similar to each other than to the rare variant carrying subjects whose variants are all outside W, that is σ 2 W < σ 2 0 . Using these definitions, we calculate the following log likelihood ratio test statistic for the window W (see Supplementary Method 1 for full details). In order to distinguish between risk and protective clusters, we can adjust the log likelihood ratio test statistic based on the estimated effect direction. For instance, Kulldorff et al. suggested the simple indicator function I ðμ W þ >μ W − Þ for risk clusters with high values of the outcome, which effectively removes from consideration any window W where the trait mean among subjects in W + is less that that for subjects in W − . Similarly, Iðμ W þ <μ W − Þ can be used for protective clusters with low values of the outcome [15]. To evaluate both positive and negative associations of clusters with the phenotype simultaneously, we can employ the sign function sgn The window harboring rare variants associated with a phenotype can be identified as the window that maximizes the log likelihood ratio, i.e., The null distribution of the test statistic is unknown; thus, p-values are calculated by a permutation approach [16]. To minimize computation time, we applied a GPD approximation [17] (see Supplementary Method 2 for full details) for estimating p-value of permutation test in the application study.

Simulation study Genotype data
We generated 10,000 haplotypes of a 1 Mb genomic region from a European population as implemented in the software Cosi2 (https://software.broadinstitute.org/mpg/ cosi2/) [18]. We randomly extracted haplotype pairs to create genotypes for sample sizes (n) of 500 and 1000. We then removed common variants (MAF > 0.05) and singletons.

Phenotype data
We considered three cluster sizes: small (200 basepairs (bp)) and moderate (500 bp), both of which have consecutively located disease-associated rare variants; and large (2 kb) where 20% of rare variants are diseaserelated (thus, not consecutive). In each of the clusters, we randomly chose a start position and then generated quantitative phenotypes from the model where β j = c|log 10 MAF j |, m C is the number of variants in the cluster, g ij is the additively-coded genotype (g ij = 0, 1, 2 according to the minor allele count) for the i th individual at the j th variant (j = 1, …, m C ), and ε i is the error for the i th individual generated from a standard normal (i.e., ε i ∼ N(0, 1)). We set c = 0.2, c = 0.4 and c = 0.6 for the empirical power simulations, and c = 0 for the type I error simulations. For each scenario, we generated 1000 simulation replicates.
Specification of the overall genetic region G and the subwindows W are flexible. To find a window that maximizes the log likelihood ratio, we thus employed a sliding window approach considering several different window sizes. We used windows sizes of 5 k, 2 k, 1 k, and 500 bp, and then slid each of those windows by half its respective size (i.e., by 2.5 k, 1 k, 500, and 250 bp). These scenarios are depicted both graphically and via a table ( Fig. 1 and Supplementary Table 1

Type I error simulation results
The empirical type I error rates were calculated as the proportion of p-values less than or equal to the corresponding Bonferroni-adjusted significance level (α * = α/ the number of examined sliding windows in the region G) for the window in which ln c LR W was maximized. As shown in Table 1, the type I error rates were acceptable in all scenarios.

Power simulation results
Power was calculated as the proportion of simulation replicates with an empirical p-value (corresponding to the window with maximum value of ln c LR W ) reaching Bonferroni-adjusted significance. Supplementary Figures 3 to 20 show the means of ln c LR W (over the 1000 simulation replicates) in each of the windows. Supplementary Tables 2 and 3 show the number of the targeted sliding windows (those containing true disease-related variants) with achieved the maximum value of ln c LR W . Figure 2 shows the empirical powers for the targeted sliding windows under the alternative hypothesis in each scenario when all disease-related variants had positive associations with the phenotype.   Empirical type I error rates were estimated as proportion of p-values less than or equal to the corresponding Bonferroni-corrected significance level (α* = α /m) for the window where ln b LR W is maximized When the effect size and the cluster size were small (c = 0.2 and cluster size = 200 bp), the proposed QPSS had low power in all scanning windows. In other scenarios, power depends on the scanning window size relative to the cluster size. When the scanning window is too large compared to the cluster size, power can decrease considerably. On the other hand, when the scanning window size is smaller than the size of the variantcontaining cluster, multiple sliding windows can overlap the outcome-related cluster; however, it was hard to identify the window most likely to harbor risk variants. Not surprisingly, the best scenario was when the scanning window is of similar length to the true cluster.

Application to Alzheimer's Disease Neuroimaging Initiative (ADNI)
Data used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. We retrieved baseline data on CSF amyloid β 1-42 (Aβ 1-42 ) and phosphorylated tau at the threonine 181 (p-tau 181P ) levels measured at the ADNI Biomarker Core laboratory at the University of Pennsylvania Medical Center [19,20] and single nucleotide variants (SNVs) from WGS data sequenced on the Illumina HiSeq2000 using paired-end 100-bp reads. We used data from 538 subjects aged 65 years or older at the time when the baseline CSF sample was drawn.

Discussion
We showed the performance and practicality of QPSS with extensive simulations and in application to a WGS dataset with CSF biomarkers from ADNI. We identified regions enriched with rare variants in TOMM40 and the surrounding intergenic region that were associated with   Tables 4 and 5 and Supplementary Figure 25). Unlike TOMM40, ARHG AP23 went undetected using commonly-used genebased tests (i.e., burden test, SKAT, and SKAT-O). Using a 1 k bp window (and corresponding 500 bp slide), the window 36,637,321 -36,638,320 in ARHG AP23 was the most likely to harbor a cluster of variants associated increased CSF p-tau 181P levels, which was significant after Bonferroni correction (p = 9.13 × 10 − 9 using generalized Pareto distribution (GPD) approximation). There were three rare variants in the implicated window in ARHGAP23; each minor allele associated with increased CSF p-tau 181P levels (Supplementary Table 6). The association between ARHG AP23 and CSF p-tau 181P levels is novel and warrants attempts at replication in future work. Notably, ARHGAP23 is located~7 Mb from the MAPT gene suggesting that this is an independent signal.

Conclusions
QPSS is implemented under the assumption, similar to burden tests, that causal variants within a window have the same direction of effect. However, there is a difference in the nature of the tested hypotheses between these methods. The null hypothesis of the competitive tests, like our proposed method, is that associations between the phenotype and the set of variants within a specified window are the same as those outside the window. Typical self-contained tests employ a null hypothesis of no association between the target variant set and the phenotype. Therefore, an advantage of the proposed competitive test is that it is possible to refine a known region of interest to localize diseaseassociated clusters. Note that the definition of clusters can be easily adapted based on variant function or annotation. A limitation of these approaches is the possibility of population structure confounding as the proposed method does not take into account covariate adjustment. We aim to address this limitation in future work.
License: Free academic research use. Any restrictions to use by non-academics: License required.
Additional file 1: Supplementary Method 1. Scan statistics for the normal probability model developed by Kulldorff et al. [1]. Supplementary Method 2. Computing empirical p-values based on permutation test and approximation by a generalized Pareto distribution described by Knijnenburg et al. [2]. Table S1. Simulation scenario for type I error and power evaluations. Table S2. Frequency of the number of targeted sliding windows that produced the maximum value of lnLR̂W (n = 500). Table S3. Frequency of the number of targeted sliding windows that produced the maximum value of lnLR̂W (n = 1000). Table S4. Single variant associations on the significant window of TOMM40 (45,403,046 -45,405,045) with log-transformed cerebrospinal fluid amyloid β 1-42 levels in ADNI. Table S5. Single variant associations on the significant window of intergenic region (45,412,796 -45,413,295) with log-transformed cerebrospinal fluid amyloid β 1-42 levels in ADNI. Table S6. Single variant associations on the significant window of ARHG AP23 (36,637,321 -36,638,320) with log-transformed cerebrospinal fluid phosphorylated tau levels in ADNI. Figure S1. Mean of continuous phenotype y in each scenario. Figure S2. Estimate of heritability in each scenario. Figure S3.  Figure S11. Mean of lnLR̂W for n = 500, cluster size = 2 kbp (containing 20% disease-related variants), and effect size c = 0.6. Each point represents the center position of each of the windows, and the blue vertical line indicates the center of the cluster position. Figure S12 Figure S16. Mean of lnLR̂W for n = 1000, cluster size = 500 bp, and effect size c = 0.4. Each point represents the center position of each of the windows, and the blue vertical line indicates the center of the cluster position. Figure S17. Mean lnLR̂W for n = 1000, cluster size = 500 bp, and effect size c = 0.6. Each point represents the center position of each of the windows, and the blue vertical line indicates the center of the cluster position. Figure S18. Mean of lnLR̂W for n = 1000, cluster size = 2 kbp (containing 20% disease-related variants), and effect size c = 0.2. Each point represents the center position of each of the windows, and the blue vertical line indicates the center of the cluster position. Figure S19. Mean of lnLR̂W for n = 1000, cluster size = 2 kbp (containing 20% disease-related variants), and effect size c = 0.4. Each point represents the center position of each of the windows, and the blue vertical line indicates the center of the cluster position. Figure S20. Mean of lnLR̂W for n = 1000, cluster size = 2 kbp (containing 20% disease-related variants), and effect size c = 0.6. Each point represents the center position of each of the windows, and the blue vertical line indicates the center of the cluster position.  Figure S22. Gene-based associations between rare variants located on chromosome 17 and log-transformed CSF phosphorylated tau levels in ADNI using the burden test, SKAT, and SKAT-O. The red horizontal line indicates the significance level with Bonferroni correction (α = 0.05/the number of genes on chromosome 17). Figure S23. QPSS p-values computed by the permutation with generalized Pareto distribution approximation for the associations between rare variants around APOE (± 10 Mbp) located on chromosome 19 and logtransformed CSF amyloid β 1-42 in ADNI. Figure S24. QPSS p-values computed by the permutation with generalized Pareto distribution approximation for the associations between rare variants around MAPT (± 10 Mbp) located on chromosome 17 and log-transformed CSF phosphorylated tau levels in ADNI. Figure S25. Single variant associations between rare variants around APOE (± 10 Mbp) located on chromosome 19 and log-transformed CSF amyloid β 1-42 in ADNI.