Systems analysis of inflammatory bowel disease based on comprehensive gene information

Background The rise of systems biology and availability of highly curated gene and molecular information resources has promoted a comprehensive approach to study disease as the cumulative deleterious function of a collection of individual genes and networks of molecules acting in concert. These "human disease networks" (HDN) have revealed novel candidate genes and pharmaceutical targets for many diseases and identified fundamental HDN features conserved across diseases. A network-based analysis is particularly vital for a study on polygenic diseases where many interactions between molecules should be simultaneously examined and elucidated. We employ a new knowledge driven HDN gene and molecular database systems approach to analyze Inflammatory Bowel Disease (IBD), whose pathogenesis remains largely unknown. Methods and Results Based on drug indications for IBD, we determined sibling diseases of mild and severe states of IBD. Approximately 1,000 genes associated with the sibling diseases were retrieved from four databases. After ranking the genes by the frequency of records in the databases, we obtained 250 and 253 genes highly associated with the mild and severe IBD states, respectively. We then calculated functional similarities of these genes with known drug targets and examined and presented their interactions as PPI networks. Conclusions The results demonstrate that this knowledge-based systems approach, predicated on functionally similar genes important to sibling diseases is an effective method to identify important components of the IBD human disease network. Our approach elucidates a previously unknown biological distinction between mild and severe IBD states.


Background
Inflammatory Bowel Disease (IBD) is a chronic disease of unknown etiology that causes inflammation and ulcer in intestinal mucosa. Although IBD is still much less prevalent in Japan than in Western countries, the number of Japanese IBD patients has rapidly increased in the last 20 years [1]. This rising trend, also observed in the Asia-Pacific region [2,3] indicates that IBD is rapidly becoming a world-wide disease. There are two major sub-categories of IBD: Crohn's disease (CD) and ulcerative colitis (UC) [4]. Although the pathogenesis of IBD is not fully explained, genetic factors are suggested to contribute to dysregulation of intestinal immunity, leading to gastrointestinal injury.
A genetic study of IBD was first reported in 1988 as an epidemiological study of CD patients [5]. Genome-wide scanning (GWS) studies have revealed nine IBD susceptibility loci (IBD1-9) [6] and one susceptibility gene (NOD2) [7]. Genome-Wide Association Studies (GWAS) and corresponding meta-analyses identified 71 susceptibility loci for CD [8] and 47 loci for UC [9]. Recently the genetic susceptibility to IBD was comprehensively reviewed in [10]. Another study based on and analysis of molecular pathways suggested a significant overlap between IBD and autoimmune disorder, type 1 diabetes, ankylosing spondylitis (AS), multiple sclerosis, asthma [11], and rheumatoid arthritis (RA) [12].
IBD medical treatment policy and drug selection is determined according to IBD 'severity' (mild, moderate, or severe) [4,[13][14][15], which reflects the frequency of rectal bleedings and stools as well as mucosal appearances on sigmoidoscopy. Aminosalicylate (e.g., Mesalazine) is a primary drug for the mild state, and anti-TNF antibody (e.g., Infliximab) is a primary drug for the severe state. Both drugs act on inhibition of inflammatory cascades; however, while the anti-TNF antibody specifically interacts with TNF, aminosalicylate is similar to other members of nonsteroidal anti-inflammatory drugs (NSAIDs) having several molecular targets.
Recently, human diseases have been studied with a systems approach to address the need to understand the network of genes and molecules acting in concert to produce the pathogenesis and progression of the disease. A human disease network (HDN) is a representation of the etiology of the disease by protein-protein interaction (PPI) data and information. Goh et al. constructed a comprehensive HDN to cover all diseases and showed that essential human genes encoded hub proteins in the network [16]. Hase et al. investigated physical properties of the comprehensive HDN, and elucidated that there were extensive interconnections among middle-degree nodes that formed the backbone of the network [17]. Hase et al. also investigated drug-target genes in the HDN, and found significant preference of drug targets for middle-and low-degree nodes [17].
Wall et al. studied the HDN for autism. They constructed the network by genes related to autism and other neurological disorders in order to elucidate common features in the diseases [18]. They identified 154 genes not previously linked to autism. Based on this study, Wall et al. developed a web tool for comprehensive genetic annotation of diseases called "Genotator" [19]. Genotator provides an up-to-date and comprehensive collection of disease genes and a reliable gene-to-disease ranking for any disease. They showed that the integration of all the available databases gave a more complete picture than any one database alone, using Alzheimer disease as a case study.
The idea of Genotator inspired us to investigate IBD by a knowledge driven HDN gene and molecular database systems approach. Like autism and Alzheimer disease, IBD is known to be a complex disease with a large number of genes and molecules implicated in the etiology of the disease but without direct experimental evidence. A primary difference between our approach and Wall's is our use of available drugs, their targeted pathways and genes and related drug treatment information to highlight the more important genes to the disease (in this case IBD). We assume that current effective drug treatments and their targets provide essential information that will help identify key pathways important in the pathogenesis and progression of the disease. We collected genes related to IBD and its drugs, constructed a disease network with the genes, and investigated the functional similarity of drug targets to the putative IBD genes. The human disease network results demonstrate a new approach to characterizing IBD and its progression from early stage into the chronic and more malignant state.

Drug information on IBD and sibling diseases
We use the World Gastroenterology Organization [4] guidelines and IBD societies of USA [20], Europe [21][22][23][24], and Asia [2,3] to identify drugs for IBD and sibling diseases. Additional drug information was collected from PubMed and the Cochrane Library databases. The drug indications were investigated by databases: Package Inserts Database (in Japanese) [25] for Japanese drugs, FDA Approved Drug Products [26] for US drugs, and the electronic Medicines Compendium (eMC) [27] for UK drugs. Thereafter, target molecules and pathways of a drug were retrieved from the DrugBank database [28]. We then selected a target molecule of all the drugs (compendium of drugs are found in Table 1).

Collection of disease genes from public databases
The basic gene association approach originated from Genotator [19] which collected evidence for the association of gene to disease from multiple databases and then scored the evidence thus providing an evidenced-based ranked list of genes associated with a given disease. Here, we collected IBD and sibling disease genes from HuGE Navigator, PharmGKB, GeneCards and Genetic Association Database (GAD). We downloaded the PharmGKB and GAD datasets and 'scraped' genes from the HuGE Navigator and GeneCards web applications. All data was selected in August, 2010. On the gene collection, we employed all the variation of synonymous names of IBD and sibling diseases. The collection of genes were then merged by matching EntrezGene IDs. If a database did not provide a gene's EntrezGene IDs, we used the DAVID bioinformatics tools [29,30] to retrieve the EntrezGene IDs.

Protein-Protein Interaction network
A Protein-Protein Interaction (PPI) network was produced by STRING8 (October, 2010) using at least one experimental association result, information from PPI databases, including additional edges of one path (depth = 2), and with high confidence (confidence > 0.9). We used CYTOS-CAPE [31] to visualize the PPI networks.

Semantic similarity between two genes
We employed the same scoring as [32]. First, define where p(c) is the frequency of annotation of the term c and its children in the GO graph. Here, Information Content (IC) is a measure of how informative a term c is relative to other terms [33]. Second, define the similarity between two GO terms c 1 and c 2 as: where A(c 1 ,c 2 ) is the set of common ancestors of c 1 and c 2 . SIM term (c 1 ,c 2 ) is defined as the most informative common ancestor of c 1 and c 2 [34,35]. Then, define the similarity between two GO term sets A and B (SIM termset (A, B)), assuming that the gene g A and the gene g B are annotated respectively with n and m GO terms as A = {GO 1A , GO 2A ,...,GO nA }, B = {GO 1B , GO 2B ,...,GO mB }. SIM termset (A,B) is defined as summation of the maximum similarities between a term in set A and any of terms in set B, normalized by summation of IC of the terms in set A: Please note that SIM termset (A, B) is not equal to SIMtermset (B, A). Then the expected similarity SSM between the g A and the g B is defined as: (symbol for member of) 0 [1]. A pseudo-code for SSM calculation is available in Additional file 1. We calculated the random distribution of the SSM score; 1,000 randomly collected human genes were tested against the drug targets for IBDmild and IBDsevere separately. Both distributions showed upper quartile threshold scores to be 0.5 (0.50760 and 0.50940 for IBDmild and IBDsevere respectively), and we determined that the upper quartile was best suited to define both those genes most likely closely associated with the disease but least likely to be unimportant. Consequently, we used the 0.5 threshold as criteria for our SSM scoring.

Gene score for record frequency
Define Record Frequency RF g as the total frequency of records of a certain gene related to the sibling disease (sib) contained in the database (db): where N db is the total number of tested databases (four in this study: HugeNavigator, PharmGKB, GeneCards, and GAD), N sib is the total number of the sibling diseases, N ref is the total number of papers reporting on any human gene, and DF g is the frequency of papers reporting a certain gene. We calculated DF g and N ref by Entrez Programming Utilities provided by NCBI. When calculating DF g , we replaced a name of a certain gene with its synonyms obtained from a complete gene information table ("Homo_sapiens.gene_info") provided by NCBI. N ref was evaluated by the number of papers annotated with MeSH terms of "homo sapiens" and "gene" or "protein". N ref~2 .3 million (Aug 10th, 2010).

Gene expression data and analysis
We retrieved 12 data sets from GEO and 13 data sets from ArrayExpress by a query with keywords of "Crohn's disease, CD, ulcerative colitis, UC, inflammatory bowel disease, IBD, homo sapiens". After investigation of individual data sets, only three of them (GSE6731 [36], GSE9452 [37], and E-TABM-118 [38]) consist of patients with CD and UC and include patient information on medicinal drugs the patients took. Consistent with our fundamental approach, we classified the patients prescribed aminosalicylate into "patients in mild state" (IBDmild) and the patients prescribed anti-TNF antibody into "patients in severe state" (IBDsevere). Since E-TABM-118 includes very few patients in the severe state, we focused on differential expressed gene analysis with GSE6731 and GSE9452 using the "samr" package in R [39] between IBDmild and IBDsevere patients and data sets. Genes of FDR < 0.05 were selected (2,549 genes from GSE6731 and 17 genes from GSE9452) and mapped the genes to our PPI network produced by STRING8 resulting in 28 genes from GSE6731 (listed in Additional file 2: Table S4) and no genes from GSE 9452 mapped to the PPI network.

Results
Determination of sibling diseases for the mild and severe states of IBD As in autism spectrum disorder [18], we assumed genes common among the IBD network of sibling diseases would represent a molecular mechanism essential to all sibling diseases including IBD. The sibling diseases were determined as follows.
The World Gastroenterology Organization defines three categories of malignancy in IBD, mild, moderate, and severe states in its global guidelines for IBD treatment [4]. There is also clear categorization in drug indication to IBD according to the malignancy: aminosalicylate drugs for mild state patients, immunomodulator drugs and corticosteroid drugs for moderate state patients, and anti-TNF antibody drugs for severe state patients (Table 1). We then assumed the drugs provided the key to determine known sibling diseases for IBD. Table 1 shows diseases to which the same drugs as IBD are indicated in US, UK, and Japan. CD, UC, and RA are representative diseases treated with the drug of the mild state (aminosalicylate), while PS, AS, and BD are representative diseases treated using the drug of the severe state (anti-TNF antibody). Although the severe state also includes CD, UC, and RA, the three diseases were eliminated from the sibling group for the severe state (Table 1) to investigate the net difference between the mild and severe states. We therefore determined these diseases as the sibling diseases of IBD for its mild: IBDmild = {CD, UC, RA} and severe: IBDsevere = {PS, AS, BD} states (as used above). The moderate state was neglected from this study because we aimed at a comparison between the most extreme malignancy IBD states. Figure 1 shows the number of genes related to the two sibling diseases, IBDmild and IBDsevere. Genes for each sibling disease (1,264 for IBDmild and 869 for IBDsevere, Figure 1) were collected from the public databases: HugeNavigator, GeneCards, PharmGKB, and GAD. For each gene, g, we calculated the ranking score RFg by the frequency of the gene in databases and the frequency of the gene in PubMed according to equation (5). RFg was calculated for all genes associated with each sibling disease. We then determined a threshold score for each sibling disease used to exclude genes without significant evidence supporting their association with the sibling disease. To define the thresholds, we identified all genes in the two sibling disease gene sets which have verified association with IBD by direct literature review. We then scored all verified genes and chose the gene with lowest score in each sibling gene set (TNF for IBDmild and CRP for IBDsevere). Applying these thresholds to the RFg ranked list of sibling genes resulted in 250 and 253 genes for IBDmild and IBDsevere respectively. The intersection of the two sibling disease sets was 94 genes.

Collection of disease genes from public databases
All genes in the IBDmild and IBDsevere gene sets were tested for Protein-Protein Interactions (PPIs) with STRING8 [40] and those with at least one published experimental result and information from the PPI databases support were included in subsequent HDN analysis. A total of 146 (IBDmild) and 151 (IBDsevere) genes had at least one validated PPI relationship (Figure 1). We define the collective PPI networks as the "HDN" for IBD.

Significant genes in IBDmild and IBDsevere
Our goal is to identify significant genes representing differences between IBDmild and IBDsevere. We focused on drugs that treat either IBDmild or IBDsevere. Aminosalylate was the drug specific to IBDmild, and anti-TNF antibody was the drug specific to IBDsevere (Table 1). In a cell, aminosalylate inhibits PTGS1, PTGS 2, ALOX5, and PPARG. These four genes were defined as drug targets for IBDmild in this study. On the other hand, anti-TNF antibody inhibits TNF receptor 1A and 1B. These two genes were defined as drug targets for IBDsevere. We consider a gene to be functionally similar to these drug target genes as a significant gene potentially representing differences between IBDmild and IBDsevere. Such a functional similarity can be measured by a score provided by Gene Ontology (GO). There have been reports on application of the functional similarity score to the analysis of PPI networks [33][34][35]. We employed the GO-based similarity score (SSM in equation (4)) to our HDNs to highlight functional similarity of a gene to the drug targets. Figure 2 shows the HDNs whose nodes were selected by the functional similarity of a gene to the drug targets. A node scored less than 0.5 (upper quartile of the random distribution of SSM) were eliminated from the HDNs. A triangular node indicates a gene belongs to both HDNs of IBDmild and IBDsevere, while a circular node represents a gene belongs to either HDN of IBDmild or IBDsevere. Additional file 3: Table S2 lists the genes consist of the HDNs. The similarity score in this study showed a mean score of 0.40050, with a standard deviation of 0.13942, a maximum score of 0.86640, a minimum score of 0.02972, and a upper quartile of 0.50760 against IBDmild drug targets, and a mean score of 0.37570, with a standard deviation of 0.17189, a maximum score of 0.91200, and a minimum score of 0.02417, and a upper quartile of 0.50940 against IBDsevere drug targets. Table 2 shows the functional classification of genes in the HDNs of IBDmild and IBDsevere. Both HDNs consist of similar kinds of functional groups: i.e., inflammation, innate and acquired immune response, apoptosis, tumorigenesis, and tissue remodeling. However, the HDN of IBDsevere includes genes of tumorigenesis and apoptosis larger in number than the HDN of IBDmild (encircled by a green area in Figure 2(A) and 2(B)). The HDN of IBDmild also includes some genes of tumorigenesis and apoptosis, but the genes stay peripherally around a central gene group of inflammation and immunoregulation (encircled by a rose area). In contrast, the HDN of IBDsevere includes close interconnection among genes from the tumorigenesis and apoptosis group. This feature was not observed in the HDN of IBDmild.

Differences between the HDNs of IBDmild and IBDsevere
Both in the HDNs of IBDmild and IBDsevere, some genes of inflammation and immunoregulation have highly condensed interactions with surrounding genes. Interestingly, numbers of them were detected by the GWAS studies [8,9] (Additional file 4: Table S3). These genes (IL2, IL12B, IL23R, IFNG, and JAK2) indicate significant functionalities in the IBD pathogenesis. Besides, differentially expressed gene between IBDmild and IBDsevere are also included in the condensed interconnections of genes. Step 1) The number of genes retrieved from each public database Step 2) The sibling diseases and the number of genes for each state Step 3) The number of genes (nodes) in HDNs for IBD and selected genes by SSM score Figure 1 The number of genes at each step of the analysis. They are IFNG, IL4, IL17A, NFKB1, and TGFB1 (indicated by either blue or red arrow in Figure 2), selected by SAM statistics (FDR < 0.05) on gene expression data of GSE6731. While the genes of inflammation and immunoregulation were up-regulated in IBDmild (IFNG, IL4, and IL17A, indicated by blue arrows), the genes of tumorigenesis and apoptosis were up-regulated in IBDsevere (NFKB1 and TGFB1, indicated by red arrows).
The molecular picture of the progression of IBD IBD is characterized by the progression from early stage into chronic and more malignant states. The characteristic is widely used in the clinical treatment of IBD, but its molecular processes remains unclear. This study provides a molecular picture for the progression of IBD. The molecular picture tells us a difference between the mild and severe states, two extreme states in the progression of IBD. The biological details are described in "Discussion".

Discussion
Our knowledge driven HDN gene and molecular database systems approach consists of the following steps: 1) Determination of sibling diseases for IBD based on drug information, 2) Collection of IBD and sibling disease genes from multiple databases, 3) Scoring the disease genes by evidence-based ranking weighted by "the frequency in databases" and "the frequency in PubMed", 4) Evaluation of the disease genes for Protein-Protein Interaction relations, and 5) Investigation of GO-based functional similarity of drug targets to the putative IBD genes. We summarized the criteria for our selection of genes specific to IBDmild and IBDsevere in Table 3. Our results may lead to an elucidation for IBD pathogenesis that remains largely unknown. Sibling diseases, closely aligned to a complex disease such as IBD, provide a novel opportunity to use comprehensive omics data to identify a core biochemical or treatment pathway not previously identified in medical biology. If correctly defined, a collection of sibling diseases can 'cover' the entire pathophysiological process of a target disease more completely than any one disease. Wall et al. determined autism sibling diseases based on commonly involved genes [18]. We defined IBD sibling diseases based on drug indications. Drugs act on changing pathogenic states of a disease. Accordingly, it is highly likely that the sibling diseases share a common molecular mechanism. This approach may be generalizable if drugs acting on sibling disease states, are available across the developmental progression of a given central disease. Figure 2 indicates that the HDNs for IBD change according to the state of the disease progression. Our results show that JAK2 and STAT3, which are known to be associated with IBD [41], are specific to the HDN of IBDmild (Figure 2(A)) [10]. These genes form the central  IBDmild  immunoregulation  ICAM1, IL2, IL23A, IL23R, IL4, CRP, IL11,  IL2RA, IL4R, IRAK3, TYK2   innate immunity  CD14, TLR2, TLR4, TLR1, TLR6, IFNG, IL8   inflammation and immunoregulation  CCL2, IL12B, IL17, IL1A, IL1RN, NFKB1,  NFKBIA, CCL11, CCL26, IL10RA, IL18RAP   inflammation and apoptosis  IL1B, IRF5, TNFSF1A, IL15, JAK2,  part of the HDN of IBDmild and have links to proinflammatory genes (IL1A/1B, IL12B, IL17A, and IL23A/ R), which are common in both HDNs. These suggest that JAK2 and STAT3 are key factors in the early stage of IBD pathology. On the other hand, the HDN of IBDsevere (Figure 2(B)) indicates a relation of IBD to more malignant diseases like cancer. Patients with long-standing IBD have an increased risk of developing colorectal cancer [42]. Such a transition in functional classes of genes was also observed with differential expressed genes between IBDmild and IBDsevere obtained by GSE6731 (nodes with arrows in Figure 2). The HDNs of IBDmild and IBDsevere include 13 genes obtained by GWAS reported in [8,9]. The 13 genes are highlighted by colored borders in orange in Figure 2 and listed in boldfaced in Additional file 3: Table S2. Most of the genes are interconnected with each other except ICAM1 and IRF5. All interconnected genes belong to the early response of inflammation, which includes cytokines, chemokines, receptors, and cellular signaling molecules. The other two genes, ICAM1 and IRF5, belong to the late response of inflammation, i.e., enhancement of immune response. Our HDNs illustrate what molecules intervene between the two sibling disease pathways. Both in the HNDs of IBDmild and IBDsevere, TP53 and NFKB1/ NFKBIA are common to both disease states, indicating that the transcriptional regulation intervenes between the early (IBDmild) and late (IBDsevere) responses. In this way, our HDN analysis and approach helps to clarify the molecular and therefore disease implications of GWAS candidate genes.
A recent comprehensive review of molecular pathways for IBD pathogenesis [10] supports characteristic genes in our HDNs of IBDmild and IBDsevere. The genes characteristic of the HDN of IBDmild (Jak2, Stat3, and IL23) belong to Th17-cell differentiation in [10], and the genes characteristic of the HDN of IBDsevere (IL12, IFN-gamma, IL18 and FASLG) belong to Th1-cell driven responses in [10,36]. The physiological balance between Th1 and Th17 may be deteriorated by environmental factors such as intestinal bacteria stress, which eventually leads to autoimmune responses composing IBD ("hygiene hypothesis" introduced by Strachan [43]). Our HDNs suggest a transition from Th17 to Th1 dominancy along with progression of malignancy [44]. Our approach stratifying the disease-related genes into IBDmild and IBDsevere enables us to infer a clinically significant transition of a state of a disease such as the Th1/ Th17 transition.

Conclusions
In this study, we employed a knowledge driven human disease systems approach to analyze IBD, whose pathogenesis remains largely unknown. Based on drug indications for IBD, we determined two sibling disease states of IBD (mild and severe). After ranking the genes by the frequency of the records, we obtained 250 and 253 genes for IBDmild and IBDsevere, respectively. We calculated functional similarities of these genes with IBD drug targets and drew their interactions as PPI networks we later defined as the two sibling disease HDNs. The HDNs revealed biological and clinical insights into the molecular differences between IBDmild and IBDsevere. The results demonstrated that knowledge annotation of sibling disease HDNs with focus on high similarity genes is an effective approach to identify common genes and pathways important to the complex disease network.

Additional material
Additional file 1: Pseudo-code for SSM calculation.
Additional file 2: Table S4. Differential expressed genes between IBDmild and IBDsevere. Resulted 28 genes from GSE6731: differentially expressed between IBDmild and IBDsevere (FDR < 0.05) and mapped to our PPI network produced by STRING8.
Additional file 3: Table S2. Genes consisting of the HDNs. Genes consisting of the HDNs of Figure 2 are listed in this Table. These genes were selected when being functionally similar to the drug targets of each state (SSM > 0.5), as well as when having PPI connections among themselves. Accordingly, the genes "common to both HDNs" are different between IBDmild and IBDsevere, because SSM scoring is different between the HDNs. Genes are boldfaced when they are susceptibility loci for IBD detected by GWAS [8,9]. Genes are underlined when they are selected as differential expressed genes by SAM statistics (FDR < 0.05) [39] with GSE6731. Abbreviations for gene names are lilsted in Additional file 5: Table S1.
Additional file 5: Table S1. Abbreviations for gene names. All the abbreviations for gene names used in this manuscript are listed. (2) Rank the genes by the frequency of the IBD-association in the databases (equation (5)). Cut off the genes falling short of the lowest frequency of a gene validated its disease association by literatures (TNF for IBDmild and CRP for IBDsevere).