Disentangling the genetics of sarcopenia: prioritization of NUDT3 and KLF5 as genes for lean mass & HLA-DQB1-AS1 for hand grip strength with the associated enhancing SNPs & a scoring system

Background Sarcopenia is a skeletal muscle disease of clinical importance that occurs commonly in old age and in various disease sub-categories. Widening the scope of knowledge of the genetics of muscle mass and strength is important because it may allow to identify patients with an increased risk to develop a specific musculoskeletal disease or condition such as sarcopenia based on genetic markers. Methods We used bioinformatics tools to identify gene loci responsible for regulating muscle strength and lean mass, which can then be a target for downstream lab experimentation validation. Single nuclear polymorphisms (SNPs) associated with various disease traits of muscles and specific genes were chosen according to their muscle phenotype association p-value, as traditionally done in Genome Wide Association Studies, GWAS. We’ve developed and applied a combination of expression quantitative trait loci (eQTLs) and GWAS summary information, to prioritize causative SNP and point out the unique genes associated in the tissues of interest (muscle). Results We found NUDT3 and KLF5 for lean mass and HLA-DQB1-AS1 for hand grip strength as candidate genes to target for these phenotypes. The associated regulatory SNPs are rs464553, rs1028883 and rs3129753 respectively. Conclusion Transcriptome Wide Association Studies, TWAS, approaches of combining GWAS and eQTL summary statistics proved helpful in statistically prioritizing genes and their associated SNPs for the disease phenotype of study, in this case, Sarcopenia. Potentially regulatory SNPs associated with these genes, and the genes further prioritized by a scoring system, can be then wet lab verified, depending on the phenotype it is hypothesized to affect.


Background
Many diseases known to man originate from more than one genetic locus. Sarcopenia for example is multifactorial [1] degenerative loss of skeletal muscle mass, a condition that might pose a great risk for the aging world population. Since 2006, GWAS have allowed us to trace the multiple genetic factors for various traits using statistical tools that can lead to a more effective research of specific loci of interest [2]. The data produced by these studies, which now rank in the thousands, is available online so further downstream research can be conducted, and new results can be incorporated. This is indeed valuable, since musculoskeletal diseases are one of the leading causes of disability in the world [3]; treatment of these diseases costs the world medical industry around 125 billion dollars annually [3].
In this paper, we present the combination of summary level data from GWAS and publicly available eQTLs such as those from studies by GTEx [4] and Westra et al. [5]. Based on available data and our approach of combining phenotype-associated SNPs (Single Nucleotide Polymorphism) and tissue-relevant gene-associated SNPs TADs (topologically associated domains) were plotted at the regions of interest. A TAD is a self-interacting genomic region, meaning that DNA sequences within a TAD physically interact with each other more frequently than with sequences outside the TAD [6].
We developed and applied a combination of expression quantitative trait loci (eQTLs) and GWAS summary information, to prioritize causative SNP and point out the unique genes associated in the tissues of interest (muscle).

Method
The results of GWAS for lean mass (LM) and hand grip strength (HG) were published in studies by Karasik et al. [7], Zilikens et al. [8], Willems et al. [9] and Tikkanen et al. [10], in various large human populations. According to consensus in the literature, we used the genome-wide significance threshold of 5*10 − 8 to consider SNPs to be associated for a follow-up. The summary of eQTL data was obtained by studies by Westra et al. [5] and by the GTEx [4] consortium. From the study of Westra et al. eQTLs of HSMM (Human skeletal muscle myoblasts) culture were obtains, while the GTEx consortium EQTLs were from human striated muscle samples. With these data sets, we executed "Summary-data-based Mendelian randomization" (SMR) analysis using the method as proposed by Zhu et al. [11] and utilizing the "SMR tool" program, version 0.710. We chose not to investigate the SNPs for further categorization of pleiotropy or causality, which can be done using the Heidi test [11]. For the case of GTEx eQTL summary data, the execution was done for all tissues, and then we observed for genes which were enriched specifically in skeletal muscle tissue or specifically compared to the aggregate of all other tissue types. For the genes of interest as described in the above method, we went on to plot and examine TADs at the relevant regions in corresponding skeletal muscle tissues such as the psoas (striated) and bladder (smooth muscle) as done by Schmitt et al. [12] Results GTEx tissue analysis found that for lean mass, two genes: NUDT3 and KLF5, were enriched in skeletal muscles (Figs. 1, 2), and they were also found in Fig. 1 Gene expression based on SMR of GTEx eQTLs and whole-body lean mass GWAS Westra et al. [5] eQTL analysis for appendicular lean mass (Table 1), although not in whole body in the case of Westra et al. [5] study (Table 2). Venn diagram in Fig. 1 is derived from Tables 3 and 4; Venn diagram in Fig. 2 is derived from Tables 5 and 6. In the GTEx tissue analysis for the hand grip trait, we found one gene, HLA-DQB1-AS1, which was specifically enriched in skeletal tissue compared to other tissues ( Fig. 3), with the associated SNP as rs3129753. The Venn diagram in Fig. 3 is derived from Tables 7 and 8. Many other genes found to be enriched in skeletal muscle tissues and other tissues in common intersection (Figs. 1, 2 and 3) were also found in Westra eQTL analysis with our GWAS summary dataset. The second priority should be given to the genes found to be enriched in skeletal muscle tissue as well as any other tissue. Clearly, NUDT3 and KLF5 are very strong candidate genes for lean mass study, and their associated regulating SNP are rs464553 and rs1028883 respectively. TAD plots for the psoas and bladder tissues (which are skeletal and smooth muscle types, respectively) were plotted (Figs. 4, 5, 6 and 7) where KLF5 is seen to be present within a FIRE [12] (frequently interacting region) within the TAD of chromosome 13 in bladder (M.Detrusor) muscle (Fig. 4).

Discussion
Apart from the combination of GWAS and eQTL summaries for the tissue of interest and searching for exclusivity of gene enhancement and the presence of SNP regulation near TAD boundaries, we also incorporated a novel scoring system to prioritize genes for functional validation of our results. Functional validation is a slow and costly process. Validation can take months and even years to complete without a promise of a positive result, hence a scoring system is vital   for scrutinizing and grading our results to asses which gene might have an effect on muscle health. The process of functional validation is vital for a few reasons. For one, relying on TADs has its limitations. Chromosomes separate active and inactive chromatin into compartments A and B, respectively where compartment A correlates with high gene expression, active histone marks, and early replication timing, whereas the compartment B replicates late, is enriched with repressive histone modifications and has low gene expression. Compartments can be further subdivided into megabase-sized genomic regions known as topologically associating domains (TADs) [13,14]. the function of TADs is not fully understood yet, although disrupting the TADs e.g. because of SNPs or InDels (insertions deletions) may result in the establishment of novel inter-TAD interactions. These have been shown to be associated with misexpression of Hox genes [15], up-regulation of protooncogenes [16], and developmental disorders [14].
Furthermore, functional validation might also allow us to identify drugs that affect muscle in ways unknown before and therefore to reposition existing drugs to other uses, in accordance to their newly found target such as targeted gene therapy as discussed in context of next generation sequencing in drug development [17]. This serves two purposes, first is validation of the scoring system itself as an algorithm for GWAS result validation, and the more important one is validation of new targets for further research and potentially, repositioning of existing drugs. Our approach has its limitations and requires validation, as one can observe from the results. In spite the fact that TADs that were plotted by our approach of combining phenotype-associated SNPs and tissuerelevant gene-associated SNPs show that the genes of    interest are located within a frequently interacting region, the rest of the data regarding these genes doesn't support our hypothesis that they in fact have an effect on muscle health. NUDT3 (Nudix Hydrolase 3) for example, codes for the Nudix protein which act as homeostatic checkpoints at important stages in nucleoside phosphate metabolic pathways, guarding against elevated levels of potentially dangerous intermediates [18]. GWAS associate RSP-NUDT3 readthrough to BMI with a p-value of 4*10 − 12 [19]. The Malacards database also associates NUDT3 with hyperinsulinism and obesity in   specific populations [20]. KLF5 (Kruppel Like Factor 5), encodes a member of the Kruppel-like factor subfamily of zinc finger proteins. The encoded protein is a transcriptional activator that binds directly to a specific recognition motif in the promoters of target genes. This protein acts downstream of multiple different signaling pathways and is regulated by post-translational modification [21]. GWAS catalogue doesn't relate this gene to muscle health phenotypes [19], the same is true for the Malacards [20] database. In contrast, the STRING database [22] finds relation between NUDT3 and ACTA2 (Actin Alpha 2) and GSK3B (Glycogen Synthase Kinase 3 Beta) which are related to actin production and energy metabolism respectively [21]. HLA-DQB1-AS1 (HLA-DQB1 Antisense RNA 1) is an RNA Gene and is affiliated with the lncRNA (Long non-coding RNA) class is related to malignant diseases and doesn't seem to be associated with muscle wasting disorders according to MalaCards [20]. The above information emphasizes that these genes are not directly related to muscle health, yet they may have some indirect regulatory role in defining it. Functional validation is vital in the process of confirming or debunking the hypothesis that these genes are associated with muscle health. We suggest that the above genes be scrutinized using a scoring system as described below, for prioritizing candidate genes for functional validation which will be done by knocking out the gene in C2C12 mouse myoblast cell, assessing gene expression using RT-qPCR and comparing cell morphology to the morphology wild type C2C12 cells. The scoring system has been briefly stated in the form of abstract [23] earlier and we mention it in detail here. The following constitutes the scoring system proposed for functional validation of our results: Potential genes were obtained from the work of Zillikens et al. [8], Karasik et al. [7] for LM and for HG, Willems et al. [9] and Tikkanen et al. [10]. Genes provided by Karasik   datasets were scrutinized, and genes in proximity of SNPs were scaled according to a specifically developed scoring system which utilized the following publicly available databases: Malacards [20], COXPRESdb [24] gene co-expression database, PubMed search engine, Ensembl database [25], the mouse genome informatics database [26], HaploReg [27] and the LDlink [28] database. The above functional validation method combined with our approach to gene prioritization might help in Identifying new loci responsible for LM or HG, and thus identifying new genetic markers for sarcopenia. This approach may also be used by the pharmaceutical industry to identify targets for new pharmaceutical products or reposition existing drugs in accordance to new data on the activity of these drugs. The greatest hurdle with drug repositioning today is lack of solid databases needed to produce good results [24]. Functional validation of the results presented in this study results can serve as a test to whether our approach to gene prioritization can resolve this problem.

Conclusions
The current work focused primarily on the combined bioinformatic approaches using GWAS and eQTLs for SMR. The results of exclusivity of the tissues of interest were further classified for their importance based on Venn diagrams and their corresponding TAD plots to look for the TAD boundaries where the associated regulating SNPs could be localized. NUDT3 and KLF5 for lean mass and HLA-DQB1-AS1 for hand grip strength and their associated SNPs (rs464553, rs1028883 and rs3129753) had the highest priority as candidate targets for further study.
One limitation of this study is that the eQTL analysis was not done on trans-association SNPs. Another is the limited knowledge on TAD function. We propose functional validation by a method of gene knock out in C2C12 mouse myoblast cells to either