Prediction of lung cancer risk in a Chinese population using a multifactorial genetic model

Background Lung cancer is a complex polygenic disease. Although recent genome-wide association (GWA) studies have identified multiple susceptibility loci for lung cancer, most of these variants have not been validated in a Chinese population. In this study, we investigated whether a genetic risk score combining multiple. Methods Five single-nucleotide polymorphisms (SNPs) identified in previous GWA or large cohort studies were genotyped in 5068 Chinese case–control subjects. The genetic risk score (GRS) based on these SNPs was estimated by two approaches: a simple risk alleles count (cGRS) and a weighted (wGRS) method. The area under the receiver operating characteristic (ROC) curve (AUC) in combination with the bootstrap resampling method was used to assess the predictive performance of the genetic risk score for lung cancer. Results Four independent SNPs (rs2736100, rs402710, rs4488809 and rs4083914), were found to be associated with a risk of lung cancer. The wGRS based on these four SNPs was a better predictor than cGRS. Using a liability threshold model, we estimated that these four SNPs accounted for only 4.02% of genetic variance in lung cancer. Smoking history contributed significantly to lung cancer (P < 0.001) risk [AUC = 0.619 (0.603-0.634)], and incorporated with wGRS gave an AUC value of 0.639 (0.621-0.652) after adjustment for over-fitting. This model shows promise for assessing lung cancer risk in a Chinese population. Conclusion Our results indicate that although genetic variants related to lung cancer only added moderate discriminatory accuracy, it still improved the predictive ability of the assessment model in Chinese population.


Background
Lung cancer is one of the leading causes of cancer death worldwide [1,2]. Most patients are diagnosed at an advanced stage, so are not able to undergo surgical removal of tumors [1]. As a result, the overall 5-year survival rate is low. Early stage detection when treatment might be more effective, would therefore help reduce lung cancer mortality. For this reason, a wellestablished assessment model that could identify individuals at high risk would greatly benefit patients, clinicians and researchers.
Lung cancer is a polygenic disease, for which many genetic factors appear to play an important role in disease development [2,3]. During the past three years, several genome-wide association (GWA) studies have identified a number of genetic susceptibility loci associated with lung cancer risk [4][5][6][7][8][9], but most of these studies were conducted in populations of European descent, and many identified risk alleles have not been adequately evaluated in Asian populations.
In addition, when examined individually, each of the genetic susceptibility loci only confers a small to moderate disease risk, and is of limited utility in risk prediction. It is possible that combining multiple disease-related loci with modest effects into a genetic risk score (GRS) may be useful to identify subgroups that are at high risk of lung cancer [10,11]. Several lung cancer risk assessment models have been proposed, including the Bach model, Spize model, and Liverpool Lung Project (LLP) model [12][13][14][15]. However, most predictors from these models focus on demographic and clinical factors, and, to our knowledge, no report has quantified the risk of lung cancer using a combination of newly identified risk loci in a Chinese population.
In this case-control study, we evaluate the discriminatory and predictive ability of the cumulative effect of several SNPs associated with lung cancer risk in populations of European descent, and estimate the proportion of genetic variants explained by the selected risk loci in a Chinese population.

Subjects
A total of 2,283 lung cancer cases and 2,785 cancer-free controls (from Shanghai Zhongshan Hospital, Shanghai Chest Hospital, First Affiliated Hospital of Nanjing Medical University, Beijing Union Medical College Hospital, and Wuhan Union Hospital, China) who were genetically unrelated Han Chinese were enrolled in this study. Eligible patients had histopathologically confirmed lung cancer, and with no previous cancer history and were no receiving radiotherapy or chemotherapy for other condition. Control participants were randomly selected from individuals receiving routine physical examinations in local hospitals or those who participated in a community-based screening program of non-communicable diseases. They were frequency-matched to the cases according to age, gender and residential area.
Information on smoking was collected by means of interviews. Individuals who had smoked less than one cigarette per day for less than one year of their lifetime, or less, were defined as nonsmokers. The remaining individuals were divided into light and heavy smokers according to the threshold of 25 pack years (median pack years in the controls). All participants provided written informed consent for study participation with approval from institutional review boards of each participating institution.

Selection of genetic risk factors and genotyping
We reviewed the literature on GWAS and large cohort studies published up until June, 2011, and selected those lung cancer risk SNPs from GWAS demonstrating p < 5E-6 or from large cohort studies with evidence of replication at p < 0.05. In total, Five SNPs were selected for analysis (Table 1).
Blood samples were collected from each subject at the time of recruitment, and genomic DNA was extracted using QIAamp DNA Maxi kit (Qiagen GmbH). All SNPs were determined using the Sequenom MassARRAY iPLEX platform using the matrix-assisted laser desorption/ionization time-of-flight mass spectrometer (MALDI-TOF). Primer sequences are available on request. Overall, more than 98% of genotypes were successfully determined for all the SNPs; 5% of samples were randomly selected to re-genotype for quality control, and showed a reproducibility of 100%.

Genetic risk score computation
Two approaches were used to calculate the genetic risk score (GRS): a simple risk alleles count method (count GRS, cGRS) and a weighted method based on the genotype frequencies for each SNP and effect sizes (allelic odds ratio) from our study (weighted GRS, wGRS). Based on the log-additive model, the three genotypes AA, AB, and BB (A, low-risk allele; B, high-risk allele) for an SNP had a relative risk of 1, OR and OR 2 , respectively. If the B allele had frequency p, then the average relative risk in the population is calculated as: u = (1-p) 2 + 2p (1-p) OR + p 2 OR 2 . The adjusted risk values for AA, AB, and BB genotype were 1/u, OR/u, and OR 2 /u 2 , respectively. Missing genotypes were assigned a value of 1. The formula for our combined SNP weighted risk score was: wGRS = SNP1 × SNP2 × SNP3 × SNP4, where SNP1-4 were weighted risk score for individual SNPs.

Percentage of genetic variance explained
The percentage of genetic variance was estimated under a liability threshold model [16]. Allele frequencies and effect sizes corresponding to ORs were used to calculate the threshold: [2p (1-p)] β 2 (p, risk allele frequency; β, additive allelic effect). *Characteristics of the loci from the cited genome-wide association and fine-mapping studies. CI, confidence interval; OR, odds ratio. †Rs402710 showed relatively litter LD with rs2736100 (D' = 0.020). ‡ Odds ratio per copy of the risk allele, as reported in the cited study.

Statistical analysis
Logistic regression was employed to test the association between genetic variants and lung cancer risk. The classification ability of the model was assessed using the area under the receiver operating characteristic (ROC) curve (AUC), known as a concordance (c) statistic. The Hosmer-Lemeshow test was used to evaluate the calibration of risk estimated in our cohort data. Internal validation of models was carried out using a bootstrap method involving 1000 replications to adjust model parameters for potential over-fitting. A second validation was performed by randomly dividing the cohort population into two unequal groups (one with 75% of the population, and the second with the remaining 25%). The larger group (training set) was used to rebuild the same model, which was then tested on the remaining 25% of the population (test set). All analyses were conducted by Statistical Analysis System (SAS) software (version 8.2; SAS Institute, Cary, NC). All p values were two-sided, and p values < 0.05 were considered statistically significant.

Association between genetic risk alleles and lung cancer
Five lung cancer-associated SNPs identified in previous GWA studies were evaluated in this study (Table 1). Each SNP was in Hardy-Weinberg equilibrium (p > 0.05) in the control group. The results for the selected risk alleles with lung cancer are shown in Table 2. Four SNPs (rs2736100, rs402710, rs4083914, and rs4488809) were significantly associated with lung cancer in our study. Rs2736100 and rs402710 displayed weak linkage disequilibrium (D' = 0.022), but each was still associated with lung cancer risk after adjusting for the other. The four significantly associated SNPs were selected for further analysis. A liability threshold model was used to estimate the percentage of genetic variation explained by each of the risk alleles. Our data showed that rs2736100, rs402710, rs4083914 and rs4488809 SNPs accounted for 1.33%, 0.40%, 0.47% and 1.82% of the genetic variance, respectively ( Table 2).

Genetic risk score association
Associations between lung cancer risk and genetic risk score based on two methods (cGRS and wGRS) were evaluated and the results are shown in Table 3. To estimate the risk of genetic risk score, we calculated odds ratios according to wGRS deciles. Compared with participants who were in the lowest decile, those in the highest decile had a 2.01-fold (95% confidence interval (CI), 1.59-2.54; p < 0.001) increased risk of lung cancer. We next compared the discriminative ability of GRS by calculating the AUC, and wGRS was shown to be significantly better than cGRS in lung cancer risk prediction (Table 3 and Figure 1). The AUC was 0.551 for wGRS versus 0.542 for cGRS (p < 0.001).

Discrimination performance of wGRS × demographic characteristics
General demographic characteristics (including age, gender, smoking status) were analyzed (Table 3), and factor associated with lung cancer risk (smoking history) was further evaluated in the risk assessment model. The AUC value of model that only has smoking was 0.619 (0.603-0.634). To estimate the relative risk of lung cancer for individuals with a given combination of risk factors, we used logistic regression to construct an assessment model including the wGRS and smoking status. As shown in Table 4 and Figure 1, the prediction model expressed as follows: OR = exp (−0.9372 + 0.0610 × wGRS + 0.5142 × smoking status).
The AUC of full model was 0. 639 (p-value for discrimination difference of AUC values for wGRS <0.001; p-value for discrimination difference of AUC values for smoking status = 0.024). To correct for potential over-fitting, we adjusted the model parameters using the bootstrap method [17]. As the adjusted and unadjusted values were almost the same, this indicated that there was little overfitting of the model. The adjusted AUC for the full model described above was 0.637 (Table 4 and Figure 1). The contribution of wGRS to the model was 0.020 (assessed by the reduction in c statistic when wGRS was removed from the full model). Smoking status was the strongest predictor in the model, with a contribution of 0.088.

Internal model validation
After the model was rebuilt on the training set, it displayed similar discrimination ability to the original one (c statistic, 0.641). The model was then tested on the test set, and also showed similar discrimination ability (c statistic, 0.633). We selected a cut-off value corresponding to the maximum sensitivity and specificity. Predictive performances of the rebuilt model for defining a high risk group were then assessed by sensitivity, specificity, accuracy, positive predicted value (PV+) and negative predicted value (PV-) ( Table 5). The predictive performances of this model in the two separated groups were similar (accuracy of training set: 61.72%; accuracy of test set: 61.48%).

Discussion
In this study, we systematically evaluated the clinical utility of five SNPs identified in recent GWAs and large cohort studies of lung cancer. Using data from a large case-control study that enrolled 5,068 participants, we found that most of the genetic variants (rs2736100, rs402710, rs4488809, and rs4083914) identified previously in other populations were also associated with risk of lung cancer in a Chinese population. In addition, we showed that a wGRS accounting for the adjusted effect size of each SNP was a better predictor than a cGRS, and had a stronger association with lung cancer risk than any single SNP alone. Although the weighted genetic risk score had a moderate predictive ability, it gave a better discrimination between lung cancer cases and cancer-free controls (AUC of ROC curve, 0.639) when used in combination with smoking status using the logistic regression model. Several lung cancer risk assessment models have previously been proposed [12][13][14][15], but most predictors focused on traditional risk factors such as family history of lung cancer, smoking status, environmental exposure, age and gender. In contrast to these, genetic scores derived from inherited genetic variations offer the advantage of stability during the lifetime of the individual.
15q25 region encoding nicotinic acetylcholine receptor subunits was thought to be related with lung cancer risk [6][7][8]. We evaluated the rs1051730 SNP from this region in the present study, but it showed no association with disease risk. It is conceivable that the rs1051730 allele frequency in the Chinese Han population (MAF, 0.02) is too low to confirm the effects seen in European populations [22]. Reported risk SNPs at 6p21 (rs3117582 and rs3131379) are not polymorphic in the Chinese Han population, so were excluded from this study. Rs4488809 and rs4083914, previously identified by GWA and large cohort investigations, were also shown to be significantly associated with lung cancer risk in this study [23,24].
Of the five SNPs evaluated in this study, the strongest signal was found for rs4488809, for which there was 21% elevated risk of lung cancer with each risk allele. The three other SNPs (rs2736100, rs402710, and rs4083914) were also associated with a risk of lung cancer, albeit at lower levels (<18%) for each risk allele. The estimated proportion of genetic variation explained by these four SNPs was therefore 4.02%, which includes 1.82% due to rs4488809 and 1.33% due to rs2736100. This suggests that the genetic susceptibility loci identified by GWA and large cohort studies in other populations only confer a small to moderate risk in a Chinese population when   considered alone, and are of little use in lung cancer risk assessment.
To overcome this, a genetic risk score combining multiple loci might improve the identification of persons at high risk for developing lung cancer. Our results showed that although wGRS was highly associated with lung cancer susceptibility, a model including wGRS alone did not provide a better predictive capacity than a model including traditional factors (c statistic for wGRS alone, 0.551). Smoking history was also associated with lung cancer risk in this study, in agreement with previous reports [12,25]. Moreover, wGRS, in combination with smoking status showed a better predictive ability (c statistic, 0.639). Indeed, the c statistic decreased by 0.020 when wGRS was removed from the full model, indicating that genetic risk factors could improve the discriminatory ability of the traditional assessment model, although this effect was moderate.
This study has a number of limitations. First, the susceptibility loci identified by GWA and large cohort studies with evidence of replication were associated with a lung cancer risk through strong linkage disequilibrium, and always conferred moderate effects. Many additional susceptibility loci for lung cancer remain to be discovered, and it is possible that rare variants with high penetrance would explain the remaining hereditary [26]. Next generation sequencing technologies offer hope in the future research of such variants [27]. Recently, several identified SNPs were reported [28][29][30]. Combining these new SNPs might result in improvement in classification of lung cancer risk. Second, because of limited traditional factors, the full predictive model established in this study only provided a moderate level of classification accuracy, with a c statistic of 0.639, which is inadequate for risk prediction. The discriminatory capability of our model might be improved by including additional factors such as history of bronchitis, emphysema or pneumonia, asbestos exposure, and family history of lung cancer. Third, our assessment model lacked external validation even though our estimates of ROC AUC were corrected for over-fitting by bootstrap and internal validation was conducted. Finally, as this was a retrospectively designed study, the results need to be validated by a large-scale, prospective study.

Conclusions
We have shown that most of the genetic susceptibility loci identified by previous GWA and large cohort studies in other populations were also associated with lung cancer risk in a Chinese population. Although the weighted genetic risk score had only a moderate discriminatory accuracy, it still improved the predictive ability of the assessment model, which might help in the identification of individuals at a high risk of developing lung cancer. Future studies should focus on establishing a risk assessment model that incorporates both genetic variants and established traditional factors for lung cancer.