Disentangling the Genetics of Sarcopenia: prioritization of NUDT3 and KLF5 as genes for lean mass and HLA-DQB1-AS1 for hand grip strength based on associated SNPs

Background: Sarcopenia is a skeletal muscle disease of clinical importance that occurs commonly in old age and in various disease sub-categories. Widening the scope of knowledge of the genetics of muscle mass and strength is important because may allows us to identify new genetic markers or identify patients with an increased risk to develop a specific musculoskeletal diseases or condition such as sarcopenia. It might also allow us to identify drugs that affect muscle in ways unknown before and therefore to reposition drugs to other uses, in accordance to their newly found target. We used bioinformatics tools to identify gene loci responsible for regulating muscle strength and lean muscle mass, which can then be a target for downstream lab experimentation validation. SNPs associated with various disease traits for the muscles and specific loci were chosen according to their muscle phenotype association p-value, as traditionally done in the GWASs. We developed and applied a combination of expression quantitative trait loci study (eQTLs) and GWAS summary information, to prioritize causative SNP and point out the unique genes associated in the tissues of interest (muscle). Results: We found NUDT3 and KLF5 for lean mass and HLA-DQB1-AS1 for hand grip strength as candidate genes to target for these phenotypes. The associated regulatory SNPs are rs464553, rs1028883 and rs3129753 respectively. Conclusion: TWAS approaches of combining GWAS and eQTL summary statistics proved helpful in statistically prioritizing genes and their associated SNPs for the disease phenotype of study, here, Sarcopenia. Potentially regulatory SNPs associated with these genes can be analyzed with respect to TADs and then targeted for knock out in either C2C12 mouse myoblast cells, adipocytes or any other relevant cell, depending on the phenotype it is hypothesized to affect, as a downstream experimental validation plan.


Background
Many diseases known to man originate from more than one genetic locus. A GWAS (genome-wide association study) of lean muscle mass (LM) or hand grip strength (HG) was done in the studies by Karasik et al. [1], Zilikens et al. [2], Willems et al. [3] and Tikkanen et al. [4], in various large human populations. Various sample GWAS plots for LM and HG are also provided at the author's webpage (www.tinyurl.com/abinarain and then navigate to "Educational Stuffs"). As such, genetic inheritance studies, although outlining genetic architecture, were unable to determine the specific loci responsible for these diseases [5].
Since 2006, GWASs have allowed us to trace the multiple genetic factors for such diseases using statistical tools that can lead to a more effective research of specific locus of interest [6]. The data produced by these studies, which now rank in the thousands, are available online so further downstream research can be conducted, and new results can be incorporated. This is indeed valuable, since, for example, musculoskeletal diseases are one of the leading causes of disability in the world [7]; treatment of these diseases costs the world medical industry around 125 billion dollars annually [7].
As shown in the scientific literature [8], muscle strength, hand grip (HG) in our case, and lean muscle mass (LM) are two of the most important factors that can predict muscle health. These are especially important in the context of sarcopenia, as they allow evaluating its progression. Furthermore, research has shown that with age these two factors tend to dissociate and thus need to be measured independently, as one cannot predict the other [8]. The discovery of new therapeutic targets which might increase LM or HG or both may be useful in the management or treatment of otherwise incurable diseases.
Previous studies have identified potential targets for research [9]. These targets were overlaid with eQTL and GWAS p values. These p values can later be added to data 4 collected from other reliable online databases and summed into a grading system (scale), which can allow us to choose the most relevant genes to be knocked out, for example, in the mouse C2C12 myoblast cell line using the siRNA technique. However, for this paper, we present only the combination of summary level data from GWASs and publicly available eQTLs such as those from studies by GTEx [21] and Westra et. al. [20], and then plot the topologically active domains (TADs), at the loci of interest generated by this combination approach of phenotype-associated SNPs (Single Nucleotide Polymorphism) and tissuerelevant gene-associated SNPs. A TAD is a self-interacting genomic region, meaning that DNA sequences within a TAD physically interact with each other more frequently than with sequences outside the TAD [22]. The function of TADs is not fully understood yet, although disrupting the TADs e.g. because of SNPs or InDels (insertions deletions) can lead to chromosomal structure disruption and could lead to diseases by affecting gene regulation.

Method
The results of GWAS for lean mass (LM) and hand grip strength (HG) was published in studies by Karasik et al,. [1], Zilikens et al. [2], Willems et al. [3] and Tikkanen et al. [4], in various large human populations. The summary of eQTL data was obtained by studies by Westra et. al. [20] and by the GTEx21 consortium for various tissues. With these data, we executed "Summary-data-based Mendelian randomization" (SMR) [23] analysis. We chose not to investigate the SNPs for further categorization of pleiotropy or causality, which can be done using the Heidi test. For the case of GTEx eQTL summary data, the execution was done for all tissues, and then we observed for genes which were enriched specifically in skeletal muscle tissue or specifically compared to the aggregate of all other tissue types.
For the genes of interest as described in the analysis section, we went on to examine the TAD plots for the corresponding skeletal muscle tissues such as the psoas and bladder from the study of Schmitt et. al. [24] Results Our GTEx tissue analysis found that for lean mass, two genes, NUDT3 and KLF5, were enriched in skeletal muscles ( Figure 1, 2), and they were also found in Westra et. al. [20] (Westra) eQTL analysis ( Figure 4). In the GTEx tissue analysis for the hand grip trait, we found one gene, HLA-DQB1-AS1, which was specifically enriched in skeletal tissue compared to other tissues ( Figure 12), with the associated SNP as rs3129753. Many other genes found to be enriched in skeletal muscle tissues and other tissues in common intersection as in Figures 1, 2, 12, were also found in Westra eQTL analysis with our GWAS summary dataset. We propose that those genes that are found to be exclusively enriched only in skeletal muscle tissues by both GTEx and Westra eQTL summary dataset combined with our GWAS summary dataset using the Mendelian randomization approach be given the first priority for further experimental validation, and least priority should be given to those genes which are found to be not enriched in skeletal muscle tissue in either the Westra or GTEx analysis. The second priority should be given to the genes found to be enriched in skeletal muscle tissue as well as any other tissue. Clearly, NUDT3 and KLF5 are very strong candidate genes for lean mass study, and their associated regulating SNP are rs464553 and rs1028883 respectively. Most importantly, the associative SNPs for these genes are also listed in figures 3-11, and RT-qPCR (Reverse Transcriptase quantitative Polymerase Chain Reaction) can be possibly performed on skeletal muscle samples from people with and without these SNP genotypes, to confirm the enrichment of these genes.
Chromosomes separate active and inactive chromatin into compartments A and B, respectively where compartment A correlates with high gene expression, active histone marks, and early replication timing, whereas the compartment B replicates late, is enriched with repressive histone modifications and has low gene expression. 6 Compartments can be further subdivided into megabase-sized genomic regions known as topologically associating domains (TADs) [25] [26], which act as regulatory scaffolds and are demarcated by binding sites of the architectural protein CTCF. Disruption of TAD boundaries results in the establishment of novel inter-TAD interactions. These have been shown to be associated with misexpression of Hox genes [27], up-regulation of protooncogenes [28], and developmental disorders [29]. TAD plots for the psoas and bladder tissues (which are skeletal and smooth muscle types, respectively) were also plotted and one of them is shown as Figure 13 for KLF5 bladder tissue. Clearly, these sites can be further investigated for their corresponding SNP positions and whether they are in TAD boundaries or not by knock down of the SNPs and observing for any disruption in cellular activity. TAD plots for other relevant genes using raw and normalized data can be found at the author's webpage www.tinyurl.com/abinarain and then navigating to the "Educational Section".

Discussion
Apart from the combination of GWAS and eQTL summaries for the tissue of interest and searching for exclusivity of enhancement of gene and the presence of SNP regulation near TAD boundaries, as discussed above, we also incorporated other factors, such as gene score for relevant diseases, phenotype also known in mice to be tested, coexpression of the gene in question in relevant databases, known epigenetic factors, and medical implications for known drugs. Though the results for this study will still need time to be substantiated for publication, we feel that it would make sense to state the method for the sake of completeness.
Potential genes were obtained from the work of Zillikens [2], Karasik [1] for LM and for HG, Willems [3] and Tikkanen [10]. Genes provided by Karasik, Zillikens and Willems were graded as first tier genes, while genes provided by Tikkanen were graded as second tier 7 genes. The reason behind this was that the Tikkanen research [10] was published at the end of the year 2018, while the database was already being collected. The list of SNPs was mined with cis Expression quantitative trait loci analysis (eQTLs) for transcripts within 2 Mb of the SNP position was carried out as described by Zillikens et al [2]. Other similar datasets were scrutinized, and genes in proximity of SNPs were scaled according to a specifically developed scoring system with the following components: GWASs, eQTLs, Malacards, OMIM, TAD and epigenetic data.
For GWASs: 1 point for each significant P value for a locus in any published muscle-related GWAS. For eQTLs, 1 point if the locus has a muscle-specific significant P value (is an eQTL), and 0 if not. Data were provided by three eQTL studies [2,3,10]. Scoring was as follows: 1 point for each time the gene is expressed in relevant tissues (muscle, tendon or the tibial nerve). Data were also extracted from the Ensembl data base (build 95) [11].
One point for was given for each relevant disease the gene was associated with according to "Malacards 1" version 4.9.0.20 and OMIM. The Malacards scoring was on a scale of 0 to infinity. For this study, each disease with a score of more than 3 received one point in the score. OMIM was used for verification of the Malacards data in mice. One point was given for each time the gene was associated with a muscle-related phenotype in mice, in accordance with the mice genome informatics database version 6.3.0.14. One point was subtracted each time the gene was shown to be co-expressed with another gene that has a relevant effect or phenotype according to the COXPRESdb gene co-expression database (version 7.1) [15] The Malacards relevance score algorithm can be found on this webpage: https://www.malacards.org/pages/searchguide available datasets for Homo sapiens). Co-expression reduces the chance that a specific gene is responsible for a given trait; thus a point was subtracted. Genes of interest were 8 also examined in relation to the active TAD to which they are mapped, to reveal whether a gene is in euchromatin and whether it interacts with other genes which might carry out a similar function or have a similar phenotype as the gene of interest. TADs were analyzed from samples of psoas muscle (striated) and bladder muscle (m. detrusor, smooth muscle) For epigenetic data: 1 point was awarded each time the SNP appeared to be an enhancer in a relevant tissue, according to HaploReg [16].
Proxies for every variant in the SNP table was extracted using Ldlink [17], the output was filtered according to the following criteria: LD R² = 0.8, location within the region of the gene of interest and a score of 5 or less according to RegulomDB [18], which means that the proxies had sufficient data validating their annotation.
In case two genes received the same score, biological data will be used to determine which one is more relevant to the study. Relevant loci will be then knocked out in the C2C12 mouse myoblast cell line. Identifying new loci responsible for LM or HG may be used by the pharmaceutical industry as new targets for pharmaceutical products or repositioning of existing drugs in accordance to new data on the activity of these drugs.
The greatest hurdle with drug repositioning today is lack of solid databases needed to conduct efficient repositioning [9]. This study of future can serve as a test to whether this approach to gene prioritization can resolve this problem.

Conclusions
The current work focused primarily on the combined bioinformatic approaches using GWASs and eQTLs for summary-data based Mendelian randomization (SMR). The results of exclusivity of the tissues of interest were further classified for their importance based on Venn diagrams and their corresponding TAD plots to look for the TAD boundaries where the associated regulating SNPs could be localized. NUDT3 and KLF5 for lean mass and HLA-DQB1-AS1 for hand grip strength and their associated SNPs (rs464553, rs1028883 and 9 rs3129753) had the highest priority as candidate targets for new or repositioned drugs.
We propose conducting RT-qPCR to further ascertain the association of enhancement of a gene for patients with a SNP genotype that was associated positively with gene enrichment in the current study. We also proposed further steps that can be taken to further prioritize candidate genes as targets for new drugs or for drug repurposing. One limitation of this study is that the eQTL analysis was not done on trans-association SNPs.  Venn-Diagram Genes found to be eQTL associated as well as GWAS associated (Lean Mass: meta-analysis of the whole body data) as by Summary based Mendelian Randomization approach for GTEx v7 skeletal muscular tissue and union set of other tissues. Clearly Genes NUDT3 and KLF5 seems to be exclusive enrichment targets for SNP association in skeletal muscle Venn-Diagram Genes found to be eQTL associated as well as GWAS associated (Lean Mass: meta-analysis of the appendicular data) as by Summary based Mendelian Randomization approach for GTEx v7 skeletal muscular tissue and union set of other tissues. Clearly Genes NUDT3 and KLF5 seem to be exclusive enrichment targets for SNP association in skeletal muscle tissues.       GTEx consortium, studies of eQTL compared with lean mass (appendicular) GWAS summary combined in Mendelian Randomization output to prioritize gene enrichment for corresponding SNPs for skeletal muscle only.  GTEx consortium, studies of eQTL compared with lean mass (whole body) GWAS summary combined in Mendelian Randomization output to prioritize gene enrichment for corresponding SNPs in skeletal muscle tissues (sample results).
19 Figure 10 GTEx consortium Handgrip GWAS data with eQTL from Skeletal muscle summary Mendelian randomization result