Assessment of risk based on variant pathways and establishment of an artificial neural network model of thyroid cancer

Background This study aimed to establish an artificial neural network (ANN) model based on variant pathways to predict the risk of thyroid cancer. Methods The RNASeq data of 482 thyroid cancer samples were downloaded from the TCGA database. The samples were divided into low-risk and high-risk groups, followed by identification of differentially expressed genes (DEGs). Co-expression analysis and pathway enrichment analysis were then performed. The variant pathways were screened according to the functional deviation score of each pathway, and an ANN model was established. Finally, the efficiency of the ANN model for risk assessment was validated by survival analysis and analysis of an independent microarray dataset (GSE34289) for thyroid cancer. Results In total, 190 DEGs (85 up-regulated and 105 down-regulated) were identified between the low-risk and high-risk groups. Ten risk-related variant pathways were identified between the low-risk and high-risk groups, which were related to inflammatory and immune responses. Based on these variant pathways, an ANN model was built, consisting of an input layer, two hidden layers, and an output layer, corresponding to 15, 8, 5, and 1 neuron, respectively. Survival analysis showed that this model could effectively distinguish the samples with different risks. Analysis of microarray dataset GSE34289 showed that the accuracy of this model for predicating low-risk and high-risk samples was 77.5 and 86.0%, respectively. Conclusions This study suggests that the ANN model based on variant pathways can be used for effectively evaluating the risk of thyroid cancer. Electronic supplementary material The online version of this article (10.1186/s12881-019-0829-4) contains supplementary material, which is available to authorized users.


Background
Thyroid cancer is the most prevalent endocrine malignant cancer, with a steadily increasing incidence rate over the past several years [1]. Differentiated thyroid cancer comprises the majority (> 90%) of all thyroid cancers, including papillary and follicular cancer [2,3]. The main treatments for thyroid cancer are surgery, TSH suppressive treatment, and radioactive iodine (RAI) ablation therapy, and the average overall 5-year survival rate is up to 97.7% [4]. However, approximately 10-20% patients lose their life due to the recurrence or progression of thyroid cancer [5]. Therefore, exploring effective approaches is of great importance for the diagnosis of the risk of thyroid cancer.
During the past several decades, multiple computeraided diagnostic models have been used for predicting the risk of a variety of cancers, such as logistic regression, Cox proportional hazard model, and decision trees [6][7][8]. Artificial neural networks (ANNs) represent a more recent approach for risk assessment of multiple diseases, including Parkinson's disease [9], cardiovascular autonomic dysfunction [10], metabolic disorders [11], and various cancers [12][13][14]. Notably, ANN-based exploration of gene-nutrient interactions in folate and xenobiotic metabolic pathways can be used for investigating how micronutrients regulate susceptibility to breast cancer [15]. In thyroid cancer, Notch signaling is found to regulate tumor growth [16]. PI3K/Akt signaling pathway is considered a key mechanism to regulate the tumor-suppressive effects of metallothionein 1G in thyroid cancer [17]. The sonic hedgehog signaling pathway can induce snail expression and thereby control the selfrenewal of stem cells in anaplastic thyroid cancer [18]. Although many signaling pathways have been found to be involved in thyroid cancer, studies on the use of ANN-based pathways for predicting thyroid cancer risk are rare.
In this study, we downloaded the RNASeq data for thyroid cancer samples with different cancer risks from The Cancer Genome Atlas (TCGA) database. Differentially expressed genes (DEGs) were identified between low-risk and high-risk groups, followed by co-expression analysis and pathway enrichment analysis. The variant pathways between the low-risk and high-risk groups were screened according to functional deviation score of each enriched pathway and an ANN model was established. Moreover, combined with the survival data for these samples, the efficiency of ANN model for risk assessment was validated by survival analysis and an independent microarray dataset of thyroid cancer, GSE34289. Microarray dataset GSE34289 has been utilized to build a gene-expression classifier for improving preoperative risk assessment [19]. Our study results should provide new insights for predicting the risk of thyroid cancer.

Data source
The RNASeq data of thyroid cancer, including 482 thyroid cancer samples, were downloaded from TCGA (https://portal.gdc.cancer.gov/) in April 2017, based on the thca_tcga_pub_rna_seq_v2_mrna platform. The clinical information of thyroid cancer samples downloaded from the TCGA database was shown in Additional file 1: Table S1. The downloaded data had been preprocessed.

Data reconstruction and grouping
Based on the clinical data and information, the thyroid cancer samples downloaded from TCGA database were divided into low-risk (including T0, T1, N0, and M0) and high-risk (including T2, N1, M1, and above stages) groups according to the International Union Against Cancer (UICC) tumor-node-metastasis (TNM) classification. Finally, 114 high-risk samples and 368 low-risk samples were distinguished.

Data normalization and identification of DEGs
To eliminate the inherent expression differences between genes, the low-risk group was used as control to normalize the expression value of all samples using Z-score transformation [20]. The expression values with more than 2.5-fold change in terms of the standard deviation were defined as abnormal and subjected to correction. The DEGs between high-risk samples and low-risk samples were then identified using the limma package (version 3.10.3, http://www. bioconductor.org/packages/2.9/bioc/html/limma.html) [21]. P < 0.05 and coefficient of variance (CV) > 33.8 or < − 35.1 were used as the cut-off values to identify the DEGs.

Coexpression network analysis
The coexpression analysis of different genes between low-risk and high-risk samples was evaluated using Pearson correlation coefficient. R > 0.5 was identified as positive correlation, whereas R < − 0.5 indicated negative correlation. In both low-risk and high-risk samples, the coexpressed genes were defined as stable pairs, whereas the coexpressed genes found only in one of the groups were considered specific pairs. The stable pairs were recognized as the key genes with important functions, whose coexpression was not markedly affected by environment or disease stimulation. Nevertheless, the coexpression of specific pairs would change during tumor initiation and progression, which could be used for evaluating the cancer risk. After identifying the stable and specific pairs, unsupervised hierarchical clustering analysis [22] for the genes in these pairs was performed using the heatmap2 package in R [23], for distinguishing the samples with different cancer risks. In addition, based on the coexpression relationships, the respective gene coexpression networks under the low-risk and highrisk status were constructed using Cytoscape 3.4.0 [24]. Two topological properties, including average shortest path (ASLP) and degree distribution, were then analyzed to measure the connectivity of the network.

Pathway enrichment analysis and pathway deviation score
To further analyze the biological functions of the genes in stable pairs and specific pairs, KEGG pathway enrichment analysis was carried out using Fisher's exact test in the Database for annotation, visualization, and integrated discovery (DAVID) online tool [13]. A value of P < 0.05 was used as the cut-off value. As these DEGs were associated with the risk of cancer, these significantly enriched pathways might be used for the evaluation of gene functions related to cancer risk. Based on the expression of DEGs in each sample, the functional deviation score for significantly enriched pathways was calculated using the Euclidean distance algorithm according to the following formula: In the formula, n represents the number of enriched genes, Gi represents the expression of a single gene, and mean is the mean value of Gi in the low-risk group.
High scores indicate a marked pathway deviation from normal levels of the low-risk group, whereas low scores indicate that the pathway expression levels were close to the normal levels of the low-risk group. Using this method, the differing pathways between the low-risk and high-risk groups were screened out using Student's t-test. P < 0.01 was used as the cut-off value for significance.

Establishment of ANN model based on the variant pathways
Based on the DEGs and functional pathway analysis, an ANN model was established using the supervised classification method, with multilayer perceptron (MLP) algorithm [25]. Meanwhile, the ROC curve was drawn with five-time cross validation to evaluate the classification efficiency of ANN model. Meanwhile, logistic regression as a reference was also analyzed to evaluate the efficiency of ANN model, based on variant pathways for risk assessment of thyroid cancer.

Validation with survival analysis
Using the above ANN model, 482 thyroid cancer samples obtained from the TCGA database was classified into two groups with different cancer risks. Using the survival data for these samples, survival analysis for the low-risk and high-risk groups was conducted using the survival package in R, and significant differences in survival times between two groups were determined using log-rank test. To further validate the efficiency of the ANN model for risk assessment, an independent dataset, with accession number GSE34289 deposited by Alexander et al. [19], was downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih. gov/geo/query/acc.cgi), generated on the GPL14961 platform ([HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array [transcript (gene) version]). A total of 318 thyroid cancer samples were used in this study, including 40 benign, 233 indeterminate, and 45 malignant samples. The clinical information of thyroid cancer samples downloaded from the GEO database was shown in Additional file 2: Table S2. The downloaded data had been preprocessed. The risk for these 318 thyroid cancer samples was then predicted using this ANN model.

Identification of DEGs in thyroid cancer
Based on the criteria of P < 0.05 and CV > 33.8 or CV < − 35.1, a total of 190 DEGs were screened out between low-risk and high risk groups, including 85 up-and 105 down-regulated genes.

Co-expression network analysis
Pearson correlation coefficient was used to evaluate the correlation between a given pair of DEGs. Figure 1 displays the heat map of the top 30 risk-related DEGs with the highest correlation. According to the coexpression relationships between the gene pairs, the respective gene coexpression networks for the DEGs between lowrisk and high-risk groups were constructed ( Fig. 2a and  b). From the topological perspective of coexpression networks, we found that these risk-related DEGs had a high degree of aggregation and central. Compared with the low-risk group, the high-risk group had more lowdegree nodes (Fig. 2c), suggesting that, with increasing risk of thyroid cancer, the network node degree distribution gradually decreased and the association between the genes was gradually lost. Meanwhile, compared with the low-risk group, the average shortest path length in the high-risk group was higher, indicating a decreased capacity of information transfer through the network with increasing risk of thyroid cancer (Fig. 2d). In addition, supervised hierarchical clustering analysis was performed for these coexpressed DEGs, and the results showed that the samples with different cancer risk could be successfully distinguished (Fig. 3). KEGG pathway enrichment analysis for risk-related DEGs revealed 21 significantly enriched pathways, including those for Graft-versus-host disease, Allograft rejection, Type I diabetes mellitus, Autoimmune thyroid disease, Viral myocarditis, and Herpes simplex infection (Table 1). In order to quantify these significantly enriched pathways, the functional deviation score for each pathway was calculated for the identification of variant pathways associated with cancer risk. Using Student's t-test with a significance threshold of P < 0.01, 10 risk-related variant pathways with significant differences between low-risk and high-risk groups were identified, including those for Measles, Antigen processing and presentation, Rheumatoid arthritis, Phagosome, Systemic lupus erythematosus, Herpes simplex infection, Inflammatory bowel disease IBD, Tuberculosis, Type I diabetes mellitus, and Toxoplasmosis ( Table 2), most of which were related to inflammatory and immune responses. An ANN model for risk assessment was then constructed, for which the above 10 risk-related pathways served as the input variables (Fig. 4). This ANN model consisted of an input layer, two hidden layers, and an output layer, respectively, corresponding to 15,8,5, and 1 neuron, respectively. To assess the performance of this ANN model, ROC curves for this model and logistic regression model (control) were prepared (Fig. 5). We found that the area under the receiver operating curve (AUC) for the ANN model and logistic regression model was 0.85 and 0.73, respectively, indicating that the ANN model based on variant pathways had a better prediction accuracy for risk assessment.

Survival analysis validation
The 482 thyroid cancer samples were classified into lowrisk and high-risk groups using ANN model. The results of survival analysis showed that the survival time of thyroid cancer samples in the low-risk group was significantly greater than that of the high-risk group samples (P = 0.0166, Fig. 6), indicating that this model could effectively distinguish the samples with different risks and yielded accurate predictions for the risk of thyroid cancer. Furthermore, the efficiency of this model was validated using the microarray dataset GSE34289, containing 318 samples representing different stages of thyroid cancer (benign, indeterminate, and malignant). The results showed that the accuracy of this model for low-risk (benign) samples and high-risk (indeterminate and malignant) samples was 77.5 and 86.0%, respectively.

Discussion
In this study, we developed an ANN model for thyroid cancer, using 10 significantly enriched pathways related to inflammatory and immune responses. Analysis using the survival data of these samples showed that this model could effectively distinguish the samples with different risks. Analysis of microarray dataset GSE34289 showed that the accuracy of this model for predicting low-risk and high-risk samples was 77.5 and 86.0%, respectively. These findings highlight the efficiency of ANN in predicting the risk of thyroid cancer and merit further discussion.
One of the important findings of this study was that 10 variant pathways were identified to establish an ANN model, all of which were related to inflammatory and immune responses. Approximately 15-20% of human tumors are attributed to infection-driven inflammations, and two interrelated pathways have been reported as the link inflammation and cancer [26]. The incidence of thyroid cancer is increased in autoimmune thyroid diseases, and an inflammatory cell infiltrate is often observed, which contributes to the development of thyroid cancer [27]. In addition, the elevated serum concentration of antithyroglobulin antibody and TSH ≥ 1 μIU/ml in patients with Hashimoto's thyroiditis (a common autoimmune disease) have been confirmed as independent predictors for thyroid cancer [28]. An immunological association has also been reported between Hashimoto's thyroiditis and thyroid cancer [29], and the pathology of Hashimoto's thyroiditis can increase the risk of thyroid cancer [30]. The inflammatory microenvironment  is shown to play a crucial role in thyroid carcinogenesis [31]. Considering the key roles of inflammatory and immune responses in the development of thyroid cancer, we speculate that these variant pathways may be associated with the risk of thyroid cancer. Furthermore, ANN and logistic regression model are the most accepted type of models in biomedicine [32][33][34]. ANN model is considered to be better suited than logistic-regression-based model for predicting outcomes [35]. Consistent with this finding, the ANN model developed here had a higher AUC than the logistic regression model, highlighting the better risk assessment accuracy of the ANN model based on variant pathways. Moreover, the results of survival analysis showed that the survival time of thyroid cancer samples in the low-risk group was greater than that of the high-risk group samples, indicating that this model could effectively distinguish the samples with different risk and was accurate for predicting the risk of thyroid cancer. Furthermore,   based on the analysis of microarray dataset GSE34289, the accuracy of this model for low-risk (benign) samples and high-risk (indeterminate and malignant) samples was computed as 77.5 and 86.0%, respectively. These data further confirm the high accuracy of ANN model for the prediction of disease risk, as reported previously [36]. Nevertheless, the performance of the prediction model still needs to be verified through comparison with multiple computeraided diagnostic models.