The InterLymph application
The InterLymph Consortium (International Consortium of Investigators Working on Non-Hodgkin Lymphoma Epidemiologic Studies) is an open scientific forum for epidemiologic research on mature B-cell malignancies, including NHL. Formed in 2001, the Consortium is a group of international investigators who have completed or are in charge of ongoing case-control studies and who discuss and undertake collaborative research projects that pool data across studies to elucidate the etiology of lymphoma.
In the past few years, the genetics working group of the consortium has been engaged in large-scale GWAS, targeting among others the most prevalent NHL subtypes, chronic lymphocytic leukemia (CLL), diffuse large B-cell lymphoma (DLBCL), and follicular lymphoma (FL). For an investigation into the etiological relevance of genetic variability in epigenetic enzymes and regulators for NHL risk, the consortium provided imputed data for 366 pre-selected genes for all three subtypes from a total of 29 study sites, covering 8,628 cases and 8,748 controls. Part of this data restricted to the CLL and DLBCL subtypes will be used to illustrate the method developed here. Also, we pre-selected a specific chromosome, i.e. the results should not be interpreted from a biological perspective, but serve as illustration purposes of the proposed method. More comprehensive analyses from a subject matter perspective are ongoing.
In the InterLymph Consortium, the choice of different genotyping platforms, for example the Illumina OMNIexpress-24 BeadChip or the Illumina OMNI2.58 BeadChip, resulted in studies which lacked complete SNP overlap. In theory, imputing the data and performing an analysis based on the superset of all SNPs available in any of the studies would be favored. This can, however, not always be guaranteed because usually only high-quality imputed SNPs are taken into account. These may vary due to platform-specific differences in the coverage of genomic regions, which in turn leads to non-concordant SNPs.
Synthesis regression
Molecular data from case-control designs are frequently analyzed by univariate approaches. Despite such initial univariate analyses, the markers identified from case-control studies frequently feed into multi-SNP genetic risk scores. Multivariable approaches that can perform variable selection are able to directly provide such risk scores, specifically taking correlation between markers into account.
The underlying idea in our setting is to construct a stable multivariable genetic risk score by selecting those SNPs that best explain the outcome. In such situations, regularized regression approaches can perform variable selection to obtain sparse models. Such approaches are widely used in high-dimensional data settings, when classical maximum likelihood estimation fails. Specifically for SNP data, approaches such as the lasso [12] or componentwise likelihood-based boosting [13] have been suggested. We use the latter as a basis for a synthesis regression approach [19] that can deal with partial overlap of the molecular data to address a challenge likely encountered when data are pooled from several studies, such as in the context of the InterLymph Consortium.
An advantage of componentwise boosting, compared to black-box approaches, is that it can be expressed in terms of univariate estimators. Therefore, we will briefly introduce the corresponding univariate estimators before subsequently describing componentwise boosting and its adaptation to partial overlap settings.
The model and univariate estimators
In the following, we consider a set of in total p SNPs across k studies, the superset of all SNPs. Corresponding to a partial overlap scenario, let us further assume that covariate j (j=1,…,p) corresponding to a specific SNP is only present for kj out of the k studies. Let Kj={l∈{1,…,k}:covariate j is present for study l},|Kj|=kj, be the set of studies comprising covariate j, and nl the number of individuals in study l=1,…,k. Thus, in total, covariate j is present for \(n_{j}=\sum \nolimits _{l\in K_{j}} n_{l}\) individuals.
We assume additive coding, e.g. SNP values are available as 0, 1, and 2. Therefore, we have a single covariate xlij of a SNP j=1,…,p for patient i=1,…,nl from study l=1,…,k. In the following, the SNP values are assumed to be centered and standardized, such that \(\sum \nolimits _{i=1}^{n_{l}} x_{lij}^{2} = n_{l}\). Such a standardization to equal variance is not specific to the present proposal but is typical for regularized regression approaches.
Cases and controls are treated like in logistic regression to determine whether some markers occur more frequently in cases than in controls (and the other way around). In order to obtain such an outcome yli for our regression model, the case-control status is coded as 1 for cases and −1 for controls and centered per study. The centering could be omitted, but it allows the intercept terms to subsequently be ignored. For simplified notation, we will still refer to values 1 and −1 in the following.
To investigate whether SNPs are linked to the case-control outcome, i.e. whether they should be considered as risk markers, we use a linear model
$$\begin{array}{*{20}l} \mathbb{E}(Y=y|X=x)=x'\beta, \end{array} $$
(1)
where x is a vector comprising one or more of the SNP covariates, and β is a corresponding parameter that is to be estimated. This is non-standard, but allows for analytical tractability in the following. As we deal with a binary outcome, this is a quasi-likelihood approach, e.g. as compared to a logistic regression model. Yet, the linear model will typically provide non-zero estimates for β whenever they would also have been provided by a logistic regression model, i.e. the linear model should be sufficient for marker selection. At the same time, it enables a simple presentation and adaptation for partial overlap settings, as shown in the following.
If only a single SNP at a time is considered in model (1), a separate parameter \(\hat {\beta }_{lj}\) is estimated for each SNP (j) and study (l), while the univariate estimate for βlj takes the form
$$\begin{array}{*{20}l} \Delta_{lj}&=\frac{1}{n_{l}}\sum\limits_{i=1}^{n_{l}}x_{lij}y_{li} \end{array} $$
(2)
$$\begin{array}{*{20}l} &=\frac{1}{n_{l}}\sum\limits_{\substack{i\in\{1,\ldots,n_{l}\}:\\y_{i}=1}} x_{lij}- \frac{1}{n_{l}}\sum\limits_{\substack{i\in\{1,\ldots,{n_{l}}\}:\\y_{i}=-1}} x_{lij} \end{array} $$
(3)
being, up to a constant factor, the mean difference between SNP values in cases and SNP values in controls. This statistic can be pooled across studies, where a SNP is provided by using inverse variance weighting as has been established in a GWAS setting. The resulting joint statistic (up to a constant factor, assuming equal error variance) is
$$\begin{array}{*{20}l} \Delta_{j}&=\frac{1}{\sum\nolimits_{l \in K_{j}} {n_{l}}} \sum\limits_{l \in K_{j}} {n_{l}}\Delta_{lj} \end{array} $$
(4)
$$\begin{array}{*{20}l} &=\frac{1}{n_{j}} \sum\limits_{l \in K_{j}} \sum\limits_{i=1}^{n_{l}} x_{lij}y_{li}, \end{array} $$
(5)
i.e. an average of the per-study mean differences, corresponding to the calculation of the least squares estimates pooling all individuals where SNP j has been measured.
While such a statistic is not commonly used in practice, it is expected to result in SNP rankings similar to rankings obtained from standard statistics. The advantage of this non-standard statistic is that it provides a straightforward link to multivariable approaches, as shown in the following.
Stagewise regression
Componentwise likelihood-based boosting [13] is a stagewise approach for estimating multivariable regression models, i.e. when x in model (1) comprises all SNPs. This approach performs variable selection by delivering estimates \(\hat \beta =(\beta _{1},\ldots,\beta _{p})'\) with many elements equal to zero. It is closely linked to (forward) stagewise regression, being more cautious than classical (forward) stepwise selection, i.e. the final model is built in very small steps [20]. Due to this relation, the resulting variable selection is similar to the lasso, but tends to be more robust in the presence of strong linkage disequilibrium of the SNPs [13]. Therefore, we used this approach as a basis for synthesis regression in a setting with partial overlap.
The basic idea of componentwise likelihood-based boosting is to start with an initial estimate for the parameter vector β with all elements set to zero, i.e. none of the SNPs is part of the genetic risk score. Subsequently, in each of a number of steps, a single element of the parameter vector is selected to be updated when accounting for the SNPs that have been selected in earlier steps by an offset term, or equivalently, when considering the results from the previous step as an outcome. In doing so, the correlation between covariates is incorporated.
More formally, the boosting algorithm is as follows for each boosting step m=0,…,M:
-
1.
For each covariate j, we determine the parameter estimate \(\hat {\gamma }_{j}\) from a univariate regression model, taking previous boosting steps into account (more details given below).
-
2.
Determine the index j∗ of covariate j with maximum value for \(\left (\hat {\gamma }_{j}^{(m+1)}\right)^{2}\) which corresponds to the score statistic.
To get a weak learner, set \(\bar {\gamma }_{j}^{(m+1)}=\nu \cdot \hat {\gamma }_{j}^{(m+1)}\), where 0≤ν≤1 is a shrinkage parameter fixed in advance [21].
-
3.
Update the parameter estimates
$$ \hat{\beta}_{j}^{(m+1)}= \left\{\begin{array}{ll} \hat{\beta}_{j}^{(m)}+\bar{\gamma}_{j}^{(m+1)}&\text{ if } j=j^{*}\\ \hat{\beta}_{j}^{(m)}&\text{ else.}\\ \end{array}\right. $$
(6)
This iterative procedure is stopped when the chosen stopping criterion is met. This could be, for example, a pre-defined number of covariates having non-zero estimates (the number of SNPs to be selected) or a pre-specified number of boosting steps [22].
We first consider the estimation per study, which requires specification of \(\hat {\gamma }_{lj}^{(m+1)}\). A regression model for the residuals \(r_{li}^{(m)}=y_{li} - \hat {y}_{li} = y_{li} - x_{li}'\beta ^{(m)}\) results in the following parameter estimate of the candidate model:
$$ \begin{aligned} \hat{\gamma}_{lj}^{(m+1)}=&\frac{1}{n_{l}} \sum\limits_{i=1}^{n_{l}} x_{lij}r_{li}^{(m)}\\ =&\frac{1}{n_{l}} \sum\limits_{i=1}^{n_{l}} x_{lij}\left(y_{li}-\hat{y}_{li}^{(m)}\right)\\ =&\frac{1}{n_{l}} \sum\limits_{i=1}^{n_{l}} x_{lij}y_{li}\\ &-\frac{1}{n_{l}}\sum\limits_{k:|\hat{\beta}_{k}^{(m)}|>0}\hat{\beta}_{k}^{(m)}\sum\limits_{i=1}^{n_{l}} x_{lij}x_{lik}\\ =&\Delta_{lj}-\frac{1}{n_{l}}\sum\limits_{k:|\hat{\beta}_{k}^{(m)}|>0}\hat{\beta}_{k}^{(m)}\sum\limits_{i=1}^{n_{l}} x_{lij}x_{lik}. \end{aligned} $$
(7)
This can be interpreted as a decorrelation based on the estimated effects of the other SNPs, or alternatively as adjusting the (scaled) difference of means Δlj for effects that are due to other SNPs already included in the model.
Furthermore, this parameter estimate of the candidate model only depends on the univariate statistic Δlj and the (scaled) covariance \(\frac {1}{n_{l}} \sum \nolimits _{i=1}^{n_{l}} x_{lij}x_{lik}\). This implies a straightforward way for estimating \(\gamma _{j}^{(m+1)}\), pooled across studies where SNP j is available. Specifically, building on the univariate meta-analysis ideas described above, we propose using
$$ \begin{aligned} \hat{\gamma}_{j}^{(m+1)}&=\frac{1}{n_{j}} \sum\limits_{l \in K_{j}} \sum\limits_{i=1}^{n_{l}} x_{lij}y_{li}\\ &-\frac{1}{n_{j}}\sum\limits_{k:|\hat{\beta}_{k}^{(m)}|>0}\hat{\beta}_{k}^{(m)}\sum\limits_{l \in K_{j}} \sum\limits_{i=1}^{n_{l}} x_{lij}x_{lik}\\ &=\Delta_{j}-\frac{1}{n_{j}}\sum\limits_{k:|\hat{\beta}_{k}^{(m)}|>0}\hat{\beta}_{k}^{(m)}\sum\limits_{l \in K_{j}} \sum\limits_{i=1}^{n_{l}} x_{lij}x_{lik}, \end{aligned} $$
(8)
i.e. not only the (scaled) differences are pooled, but also the covariances.
In this way, our proposal for synthesis regression is based only on pairwise covariances. This enables us to incorporate the data of several datasets at the same time. More precisely, all information on a specific covariate j that is available in the different studies can be utilized — irrespective of whether data for this covariate are available in only one, several, or all studies.
Stability Selection
Application of covariance-based boosting for synthesis regression leads to a selection of SNPs from (pooled) molecular data. However, the approach itself does not allow for type 1 error control. The so-called stability selection [16] is a tool to approach the question of statistical significance in situations where subsampling is combined with variable selection. Judging the relevance of the (significant) effects is a different issue not considered in the scope of these investigations.
We refer to subsampling as a resampling method where B subsamples of all studies are drawn randomly without replacement[23]. The size of the subsamples is set to n/2, n being the size of the full sample. Below, we use the inclusion frequency (IF) to detail how frequently a SNP has been selected in all B subsamples.
The idea of the approach is to find out whether the variables selected more often than others over all subsamples are selected in a way that the type 1 error is controlled for. In the following, we will detail the approach, which can be directly applied to our synthesis regression proposal.
\(\mathbb {E}(V)\), the expected number of false positives or per-family error rate, is bounded by a value determined from the resampled data and the variable selection procedure:
$$ \mathbb{E}(V)\leq\frac{1}{2\pi_{thr}-1}\cdot\frac{q^{2}}{p}, $$
(9)
where V is the number of false positives, p is the total number of covariates and q is the average number of selected covariates over all B subsamples in the last step M of the variable selection procedure [16]. πthr∈(0.5,1) denotes the threshold on the IF in B subsamples for calling a SNP significant. In general, different values for πthr should be considered, as they correspond to different type 1 error levels.
When the chosen parameters and results from resampling provide for \(\mathbb {E}(V)\leq 0.05\), the familywise error rate\(\mathbb {P}(V\geq 1)\) is controlled at the 5% level since \(\mathbb {P}(V\geq 1)\leq \mathbb {E}(V)\leq 0.05\).