- Research article
- Open Access
- Open Peer Review
Assessment of risk based on variant pathways and establishment of an artificial neural network model of thyroid cancer
BMC Medical Genetics volume 20, Article number: 92 (2019)
This study aimed to establish an artificial neural network (ANN) model based on variant pathways to predict the risk of thyroid cancer.
The RNASeq data of 482 thyroid cancer samples were downloaded from the TCGA database. The samples were divided into low-risk and high-risk groups, followed by identification of differentially expressed genes (DEGs). Co-expression analysis and pathway enrichment analysis were then performed. The variant pathways were screened according to the functional deviation score of each pathway, and an ANN model was established. Finally, the efficiency of the ANN model for risk assessment was validated by survival analysis and analysis of an independent microarray dataset (GSE34289) for thyroid cancer.
In total, 190 DEGs (85 up-regulated and 105 down-regulated) were identified between the low-risk and high-risk groups. Ten risk-related variant pathways were identified between the low-risk and high-risk groups, which were related to inflammatory and immune responses. Based on these variant pathways, an ANN model was built, consisting of an input layer, two hidden layers, and an output layer, corresponding to 15, 8, 5, and 1 neuron, respectively. Survival analysis showed that this model could effectively distinguish the samples with different risks. Analysis of microarray dataset GSE34289 showed that the accuracy of this model for predicating low-risk and high-risk samples was 77.5 and 86.0%, respectively.
This study suggests that the ANN model based on variant pathways can be used for effectively evaluating the risk of thyroid cancer.
Thyroid cancer is the most prevalent endocrine malignant cancer, with a steadily increasing incidence rate over the past several years . Differentiated thyroid cancer comprises the majority (> 90%) of all thyroid cancers, including papillary and follicular cancer [2, 3]. The main treatments for thyroid cancer are surgery, TSH suppressive treatment, and radioactive iodine (RAI) ablation therapy, and the average overall 5-year survival rate is up to 97.7% . However, approximately 10–20% patients lose their life due to the recurrence or progression of thyroid cancer . Therefore, exploring effective approaches is of great importance for the diagnosis of the risk of thyroid cancer.
During the past several decades, multiple computer-aided diagnostic models have been used for predicting the risk of a variety of cancers, such as logistic regression, Cox proportional hazard model, and decision trees [6,7,8]. Artificial neural networks (ANNs) represent a more recent approach for risk assessment of multiple diseases, including Parkinson’s disease , cardiovascular autonomic dysfunction , metabolic disorders , and various cancers [12,13,14]. Notably, ANN-based exploration of gene-nutrient interactions in folate and xenobiotic metabolic pathways can be used for investigating how micronutrients regulate susceptibility to breast cancer . In thyroid cancer, Notch signaling is found to regulate tumor growth . PI3K/Akt signaling pathway is considered a key mechanism to regulate the tumor-suppressive effects of metallothionein 1G in thyroid cancer . The sonic hedgehog signaling pathway can induce snail expression and thereby control the self-renewal of stem cells in anaplastic thyroid cancer . Although many signaling pathways have been found to be involved in thyroid cancer, studies on the use of ANN-based pathways for predicting thyroid cancer risk are rare.
In this study, we downloaded the RNASeq data for thyroid cancer samples with different cancer risks from The Cancer Genome Atlas (TCGA) database. Differentially expressed genes (DEGs) were identified between low-risk and high-risk groups, followed by co-expression analysis and pathway enrichment analysis. The variant pathways between the low-risk and high-risk groups were screened according to functional deviation score of each enriched pathway and an ANN model was established. Moreover, combined with the survival data for these samples, the efficiency of ANN model for risk assessment was validated by survival analysis and an independent microarray dataset of thyroid cancer, GSE34289. Microarray dataset GSE34289 has been utilized to build a gene-expression classifier for improving preoperative risk assessment . Our study results should provide new insights for predicting the risk of thyroid cancer.
The RNASeq data of thyroid cancer, including 482 thyroid cancer samples, were downloaded from TCGA (https://portal.gdc.cancer.gov/) in April 2017, based on the thca_tcga_pub_rna_seq_v2_mrna platform. The clinical information of thyroid cancer samples downloaded from the TCGA database was shown in Additional file 1: Table S1. The downloaded data had been preprocessed.
Data reconstruction and grouping
Based on the clinical data and information, the thyroid cancer samples downloaded from TCGA database were divided into low-risk (including T0, T1, N0, and M0) and high-risk (including T2, N1, M1, and above stages) groups according to the International Union Against Cancer (UICC) tumor–node–metastasis (TNM) classification. Finally, 114 high-risk samples and 368 low-risk samples were distinguished.
Data normalization and identification of DEGs
To eliminate the inherent expression differences between genes, the low-risk group was used as control to normalize the expression value of all samples using Z-score transformation . The expression values with more than 2.5-fold change in terms of the standard deviation were defined as abnormal and subjected to correction. The DEGs between high-risk samples and low-risk samples were then identified using the limma package (version 3.10.3, http://www.bioconductor.org/packages/2.9/bioc/html/limma.html) . P < 0.05 and coefficient of variance (CV) > 33.8 or < − 35.1 were used as the cut-off values to identify the DEGs.
Coexpression network analysis
The coexpression analysis of different genes between low-risk and high-risk samples was evaluated using Pearson correlation coefficient. R > 0.5 was identified as positive correlation, whereas R < − 0.5 indicated negative correlation. In both low-risk and high-risk samples, the coexpressed genes were defined as stable pairs, whereas the coexpressed genes found only in one of the groups were considered specific pairs. The stable pairs were recognized as the key genes with important functions, whose coexpression was not markedly affected by environment or disease stimulation. Nevertheless, the coexpression of specific pairs would change during tumor initiation and progression, which could be used for evaluating the cancer risk. After identifying the stable and specific pairs, unsupervised hierarchical clustering analysis  for the genes in these pairs was performed using the heatmap2 package in R , for distinguishing the samples with different cancer risks. In addition, based on the coexpression relationships, the respective gene coexpression networks under the low-risk and high-risk status were constructed using Cytoscape 3.4.0 . Two topological properties, including average shortest path (ASLP) and degree distribution, were then analyzed to measure the connectivity of the network.
Pathway enrichment analysis and pathway deviation score
To further analyze the biological functions of the genes in stable pairs and specific pairs, KEGG pathway enrichment analysis was carried out using Fisher’s exact test in the Database for annotation, visualization, and integrated discovery (DAVID) online tool . A value of P < 0.05 was used as the cut-off value. As these DEGs were associated with the risk of cancer, these significantly enriched pathways might be used for the evaluation of gene functions related to cancer risk. Based on the expression of DEGs in each sample, the functional deviation score for significantly enriched pathways was calculated using the Euclidean distance algorithm according to the following formula:
In the formula, n represents the number of enriched genes, Gi represents the expression of a single gene, and mean is the mean value of Gi in the low-risk group. High scores indicate a marked pathway deviation from normal levels of the low-risk group, whereas low scores indicate that the pathway expression levels were close to the normal levels of the low-risk group. Using this method, the differing pathways between the low-risk and high-risk groups were screened out using Student’s t-test. P < 0.01 was used as the cut-off value for significance.
Establishment of ANN model based on the variant pathways
Based on the DEGs and functional pathway analysis, an ANN model was established using the supervised classification method, with multilayer perceptron (MLP) algorithm . Meanwhile, the ROC curve was drawn with five-time cross validation to evaluate the classification efficiency of ANN model. Meanwhile, logistic regression as a reference was also analyzed to evaluate the efficiency of ANN model, based on variant pathways for risk assessment of thyroid cancer.
Validation with survival analysis
Using the above ANN model, 482 thyroid cancer samples obtained from the TCGA database was classified into two groups with different cancer risks. Using the survival data for these samples, survival analysis for the low-risk and high-risk groups was conducted using the survival package in R, and significant differences in survival times between two groups were determined using log-rank test. To further validate the efficiency of the ANN model for risk assessment, an independent dataset, with accession number GSE34289 deposited by Alexander et al. , was downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi), generated on the GPL14961 platform ([HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array [transcript (gene) version]). A total of 318 thyroid cancer samples were used in this study, including 40 benign, 233 indeterminate, and 45 malignant samples. The clinical information of thyroid cancer samples downloaded from the GEO database was shown in Additional file 2: Table S2. The downloaded data had been preprocessed. The risk for these 318 thyroid cancer samples was then predicted using this ANN model.
Identification of DEGs in thyroid cancer
Based on the criteria of P < 0.05 and CV > 33.8 or CV < − 35.1, a total of 190 DEGs were screened out between low-risk and high risk groups, including 85 up- and 105 down-regulated genes.
Co-expression network analysis
Pearson correlation coefficient was used to evaluate the correlation between a given pair of DEGs. Figure 1 displays the heat map of the top 30 risk-related DEGs with the highest correlation. According to the coexpression relationships between the gene pairs, the respective gene coexpression networks for the DEGs between low-risk and high-risk groups were constructed (Fig. 2a and b). From the topological perspective of coexpression networks, we found that these risk-related DEGs had a high degree of aggregation and central. Compared with the low-risk group, the high-risk group had more low-degree nodes (Fig. 2c), suggesting that, with increasing risk of thyroid cancer, the network node degree distribution gradually decreased and the association between the genes was gradually lost. Meanwhile, compared with the low-risk group, the average shortest path length in the high-risk group was higher, indicating a decreased capacity of information transfer through the network with increasing risk of thyroid cancer (Fig. 2d). In addition, supervised hierarchical clustering analysis was performed for these coexpressed DEGs, and the results showed that the samples with different cancer risk could be successfully distinguished (Fig. 3).
Variant pathway analysis and establishment of ANN model for risk assessment
KEGG pathway enrichment analysis for risk-related DEGs revealed 21 significantly enriched pathways, including those for Graft-versus-host disease, Allograft rejection, Type I diabetes mellitus, Autoimmune thyroid disease, Viral myocarditis, and Herpes simplex infection (Table 1). In order to quantify these significantly enriched pathways, the functional deviation score for each pathway was calculated for the identification of variant pathways associated with cancer risk. Using Student’s t-test with a significance threshold of P < 0.01, 10 risk-related variant pathways with significant differences between low-risk and high-risk groups were identified, including those for Measles, Antigen processing and presentation, Rheumatoid arthritis, Phagosome, Systemic lupus erythematosus, Herpes simplex infection, Inflammatory bowel disease IBD, Tuberculosis, Type I diabetes mellitus, and Toxoplasmosis (Table 2), most of which were related to inflammatory and immune responses. An ANN model for risk assessment was then constructed, for which the above 10 risk-related pathways served as the input variables (Fig. 4). This ANN model consisted of an input layer, two hidden layers, and an output layer, respectively, corresponding to 15, 8, 5, and 1 neuron, respectively. To assess the performance of this ANN model, ROC curves for this model and logistic regression model (control) were prepared (Fig. 5). We found that the area under the receiver operating curve (AUC) for the ANN model and logistic regression model was 0.85 and 0.73, respectively, indicating that the ANN model based on variant pathways had a better prediction accuracy for risk assessment.
Survival analysis validation
The 482 thyroid cancer samples were classified into low-risk and high-risk groups using ANN model. The results of survival analysis showed that the survival time of thyroid cancer samples in the low-risk group was significantly greater than that of the high-risk group samples (P = 0.0166, Fig. 6), indicating that this model could effectively distinguish the samples with different risks and yielded accurate predictions for the risk of thyroid cancer. Furthermore, the efficiency of this model was validated using the microarray dataset GSE34289, containing 318 samples representing different stages of thyroid cancer (benign, indeterminate, and malignant). The results showed that the accuracy of this model for low-risk (benign) samples and high-risk (indeterminate and malignant) samples was 77.5 and 86.0%, respectively.
In this study, we developed an ANN model for thyroid cancer, using 10 significantly enriched pathways related to inflammatory and immune responses. Analysis using the survival data of these samples showed that this model could effectively distinguish the samples with different risks. Analysis of microarray dataset GSE34289 showed that the accuracy of this model for predicting low-risk and high-risk samples was 77.5 and 86.0%, respectively. These findings highlight the efficiency of ANN in predicting the risk of thyroid cancer and merit further discussion.
One of the important findings of this study was that 10 variant pathways were identified to establish an ANN model, all of which were related to inflammatory and immune responses. Approximately 15–20% of human tumors are attributed to infection-driven inflammations, and two interrelated pathways have been reported as the link inflammation and cancer . The incidence of thyroid cancer is increased in autoimmune thyroid diseases, and an inflammatory cell infiltrate is often observed, which contributes to the development of thyroid cancer . In addition, the elevated serum concentration of antithyroglobulin antibody and TSH ≥ 1 μIU/ml in patients with Hashimoto’s thyroiditis (a common autoimmune disease) have been confirmed as independent predictors for thyroid cancer . An immunological association has also been reported between Hashimoto’s thyroiditis and thyroid cancer , and the pathology of Hashimoto’s thyroiditis can increase the risk of thyroid cancer . The inflammatory microenvironment is shown to play a crucial role in thyroid carcinogenesis . Considering the key roles of inflammatory and immune responses in the development of thyroid cancer, we speculate that these variant pathways may be associated with the risk of thyroid cancer.
Furthermore, ANN and logistic regression model are the most accepted type of models in biomedicine [32,33,34]. ANN model is considered to be better suited than logistic-regression-based model for predicting outcomes . Consistent with this finding, the ANN model developed here had a higher AUC than the logistic regression model, highlighting the better risk assessment accuracy of the ANN model based on variant pathways. Moreover, the results of survival analysis showed that the survival time of thyroid cancer samples in the low-risk group was greater than that of the high-risk group samples, indicating that this model could effectively distinguish the samples with different risk and was accurate for predicting the risk of thyroid cancer. Furthermore, based on the analysis of microarray dataset GSE34289, the accuracy of this model for low-risk (benign) samples and high-risk (indeterminate and malignant) samples was computed as 77.5 and 86.0%, respectively. These data further confirm the high accuracy of ANN model for the prediction of disease risk, as reported previously . Nevertheless, the performance of the prediction model still needs to be verified through comparison with multiple computer-aided diagnostic models.
In conclusion, this study suggests that the ANN model based on variant pathways could be used to evaluate the risk of thyroid cancer. With this model, we can identify the patients with a high risk of thyroid cancer, and the model-predicted risk probability would be helpful for clinicians in guiding the management and prevention of cancer high-risk patients.
Availability of data and materials
The microarray data GSE34289 was downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi).
Artificial neural network
Average shortest path
Coefficient of variance
Database for annotation, visualization, and integrated discovery
Differentially expressed genes
The Cancer Genome Atlas
International Union Against Cancer
Davies L, Welch H. Increasing incidence of thyroid cancer in the United States, 1973-2002. JAMA. 2006;295(18):2164–7.
Davies L, Welch H. Current thyroid cancer trends in the United States. JAMA Otolaryngol Head Neck Surg. 2014;140(4):317–22.
Lubitz CC, Sosa JA. The changing landscape of papillary thyroid cancer: epidemiology, management, and the implications for patients. Cancer. 2016;122(24):3754–9.
Haugen BR, Alexander EK, Bible KC, Doherty GM, Mandel SJ, Nikiforov YE, Pacini F, Randolph GW, Sawka AM, Schlumberger M, et al. 2015 American Thyroid Association management guidelines for adult patients with thyroid nodules and differentiated thyroid Cancer: the American Thyroid Association guidelines task force on thyroid nodules and differentiated thyroid Cancer. Thyroid. 2016;26(1):1–133.
Hollenbeak CS, Boltz MM, Schaefer EW, Saunders BD, Goldenberg D. Recurrence of differentiated thyroid cancer in the elderly. Eur J Endocrinol. 2013;168(4):549–56.
Ayer T, Chhatwal J, Alagoz O, Shavlik J, Kahn CE, Burnside ES: Comparison of Artificial Neural Network and Logistic Regression Model for Breast Cancer Risk Prediction. In: Radiological Society of North America 2008 Scientific Assembly and Meeting.
Zhu L, Luo W, Su M, Wei H, Wei J, Zhang X, Zou C: Comparison between artificial neural network and Cox regression model in predicting the survival rate of gastric cancer patients. 2013, 1(5):757–760.
Bartfay E, Mackillop WJ, Pater JL. Comparing the predictive value of neural network models to logistic regression models on the risk of death for small-cell lung cancer patients. Eur J Cancer Care (Engl). 2006;15(2):115–24.
Sharma RK, Gupta AK. Voice analysis for Telediagnosis of Parkinson disease using artificial neural networks and support vector machines. Int J Intell Syst Technol Appl. 2015;7(6):41–7.
Tang ZH, Liu J, Zeng F, Li Z, Yu X, Zhou L. Comparison of prediction model for cardiovascular autonomic dysfunction using artificial neural network and logistic regression analysis. PLoS One. 2013;8(8):e70571.
Hirose H, Takayama T, Hozawa S, Hibi T, Saito I. Prediction of metabolic syndrome using artificial neural network system based on clinical data including insulin resistance index and serum adiponectin. Comp Biol Med. 2011;41(11):1051.
Bertolaccini L, Solli P, Pardolesi A, Pasini A. An overview of the use of artificial neural networks in lung cancer research. J Thorac Dis. 2017;9(4):924.
Cancilla JC, Wierzchoś K, Shehadeh N, Haick H, Leja M, Torrecilla JS: Artificial neural networks in the determination of different types of cancer. In: International Symposium on Profiling: 2015.
Jerez JM, Franco L, Alba E, Llombartcussac A, Lluch A, Ribelles N, Munárriz B, Martín M. Improvement of breast cancer relapse prediction in high risk intervals using artificial neural networks. Breast Cancer Res Treat. 2005;94(3):265–72.
Naushad SM, Ramaiah MJ, Pavithrakumari M, Jayapriya J, Hussain T, Alrokayan SA, Gottumukkala SR, Digumarti R, Kutala VK. Artificial neural network-based exploration of gene-nutrient interactions in folate and xenobiotic metabolic pathways that modulate susceptibility to breast cancer. Gene. 2016;580(2):159–68.
Yamashita AS, Geraldo MV, Fuziwara CS, Kulcsar MA, Friguglietti CU, Da CR, Baia GS, Kimura ET. Notch pathway is activated by MAPK signaling and influences papillary thyroid cancer proliferation. Transl Oncol. 2013;6(2):197–205.
Fu J, Lv H, Guan H, Ma X, Ji M, He N, Shi B, Hou P. Metallothionein 1G functions as a tumor suppressor in thyroid cancer through modulating the PI3K/Akt signaling pathway. BMC Cancer. 2013;13(1):462.
Heiden KB, Williamson AJ, Doscas ME, Ye J, Wang Y, Liu D, Xing M, Prinz RA, Xu X. The sonic hedgehog signaling pathway maintains the cancer stem cell self-renewal of anaplastic thyroid cancer by inducing snail expression. J Clin Endocrinol Metab. 2014;99(11):2178–87.
Alexander EK, Kennedy GC, Baloch ZW, Cibas ES, Chudova D, Diggans J, Friedman L, Kloos RT, LiVolsi VA, Mandel SJ. Preoperative diagnosis of benign thyroid nodules with indeterminate cytology. N Engl J Med. 2012;367(8):705–15.
Cheadle C, Vawter MP, Freed WJ, Becker KG. Analysis of microarray data using Z score transformation. J Mol Diagn. 2003;5(2):73–81.
Smyth GK: Limma: linear models for microarray data, in Bioinformatics and computational biology solutions using R and Bioconductor. Springer 2005:397–420.
Sturn A, Quackenbush J, Trajanoski Z. Genesis: cluster analysis of microarray data. Bioinformatics. 2002;18(1):207–8.
Warnes GR, Bolker B, Bonebakker L, Gentleman R, Liaw WHA, Lumley T, Maechler M, Magnusson A, Moeller S, Schwartz M. Gplots: various R programming tools for plotting data. R package version 2.17. 0. Computer software. Available online at: https://cran.r-project.org/web/packages/gplots/index.html.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.
Shimray BA, Singh KM, Khelchandra T, Mehta RK: Ranking of sites for installation of hydropower plant using MLP neural network trained with GA: a MADM approach. Computational Intelligence & Neuroscience 2017, 2017.
Allavena P, Garlanda C, Borrello MG, Sica A, Mantovani A. Pathways connecting inflammation and cancer. Curr Opin Genet Dev. 2008;18(1):3–10.
Guarino V, Castellone MD, Avilla E, Melillo RM. Thyroid cancer and inflammation. Mol Cell Endocrinol. 2010;321(1):94–102.
Azizi G, Keller J, Lewis M, Piper K, Puett D, Rivenbark KM, Malchoff C. Association of Hashimoto's thyroiditis with thyroid cancer. Endocr Relat Cancer. 2014;21(6):845–52.
Ehlers M, Schott M. Hashimoto's thyroiditis and papillary thyroid cancer: are they immunologically linked? Trends Endocrinol Metab. 2014;25(12):656–64.
Paparodis R, Imam S, Todorova-Koteva K, Staii A, Jaume JC. Hashimoto's thyroiditis pathology and risk for thyroid cancer. Thyroid. 2014;24(7):1107–14.
Cunha LL, Marcello MA, Ward LS. The role of the inflammatory microenvironment in thyroid carcinogenesis. Endocr Relat Cancer. 2014;21(3):R85–R103.
Levy PS, Stolte K. Statistical methods in public health and epidemiology: a look at the recent past and projections for the next decade. Stat Methods Med Res. 2000;9(1):41–55.
Ding W, Zhou L, Bao Y, Zhou L, Yang Y, Lu B, Wu X, Hu R. Autonomic nervous function and baroreflex sensitivity in hypertensive diabetic patients. Acta Cardiol. 2011;66(4):465–70.
Chen Z, Liu J, Liang K, Liang W, Ma S, Zeng G, Xiao S, He J. The diagnostic value of a multivariate logistic regression analysis model with transvaginal power Doppler ultrasonography for the prediction of ectopic pregnancy. J Int Med Res. 2012;40(1):184–93.
Warner B, Misra M. Understanding neural networks as statistical tools. Am Stat. 1996;50(4):284–93.
Li H, Luo M, Zheng J, Luo J, Zeng R, Feng N, Du Q, Fang J. An artificial neural network prediction model of congenital heart disease based on risk factors: a hospital-based case-control study. Medicine. 2017;96(6):e6090.
This work was supported by Natural Science Foundation of Jilin Province (No. 20180101101JC) and Finance Department Health Special Project of Jilin Province (No. 3D518V613429).
Both funds have contributed to data collection, analysis and interpretation.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Zhao, Y., Zhao, L., Mao, T. et al. Assessment of risk based on variant pathways and establishment of an artificial neural network model of thyroid cancer. BMC Med Genet 20, 92 (2019) doi:10.1186/s12881-019-0829-4
- Thyroid cancer
- Risk assessment
- Variant pathway
- Artificial neural network