How to set the phenotype in GWAS analysis?

How to set the phenotype in GWAS analysis?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

In the GWAS analysis, my animal sample phenotype is whether an individual dies or survives. And I have every sample death time. How do I set the phenotype in GWAS analysis?

Simply, First of all, you can group samples as survived animals and dead samples and you have a binary phenotype.

In a different approach, you can use survival time (time to death) as an ordinal variable phenotype. If you want you can utilize survived sample by assign a large death time value to them.

In the third approach, you can cluster the survival time (time to death) into the desired $n$ groups by the simple procedure like $k$-means. and you have coarse grain ordinal variable you can add survived sample group at the end of The Spectrum.

12.9 - Statistical Testing in GWAS

In GWAS studies, usually a test is done for every gene. Several tests are available.

In the simplest case, we have a categorical phenotype with two categories. Together with the 3 genotypes, this creates a 2x3 table. The counts in the table are the numbers of samples in the study with a particular genotype and phenotype combination.

(N_<12>) (N_<13>) (R_<1>)
(N_<21>) (N_<22>) (N_<23>) (R_<2>)

Assuming that the samples are independent (e.g. they are not related), there is no population structure and no covariates, then Fisher's exact test or a chi-squared test can be done to determine if the phenotype is associated with the genotype.

Another commonly used test (again for independent samples and no population structure) is the Cochran-Armitage test:

The term (N_<1i>R_<2>-N_<2i>R_<1>) essentially takes the difference in counts between the rows, after reweighting to equalize the row totals. (To see this, note that (sum_^ <3>N_<1i>R_<2>=sum_^3 N_<2i>R_<1>)). The weights (t_i) are selected depending on the pattern you want to test for. E.g. if you hypothesize that the A allele is dominant then the weights are (t_1=t_2=1, t_3=0). If you hypothesize that the effects of A and a are additive, then the weights are (t_1=1, t_2=2, t_3=3). Other patterns are also possible and can be tested using different weights.

When the samples are related, there is population structure or there are environmental covariates, regression models are more flexible than models for tables. For binary traits as in the table above, we can use logistic regression to formulate the probability of one of the phenotypes (compared to the other) which provides a very flexible framework similar to the linear model. When the trait is quantitative, ordinary linear models can be used. The phenotype can be considered categorical (using indicator variables as the predictors) or ordinal (using the 0,1,2 as numerical values.)

The best software I am aware of for GWAS studies is PLINK. Although PLINK is stand-alone software, the authors also provide a link to R called Rplinkseq The authors state: "Rplinkseq is an R package that allows access to PLINK/Seq projects directly from R, so that R's rich set of statistical and visualisation tools can be utilised. " PLINK can handle haplotyping, filtering and all of the currently popular models for GWAS analysis. However, data management such as filtering, selecting samples or features, etc. are probably best done in R.

One problem in GWAS studies is that multiple testing has not been entirely worked out. This is because the multiple testing methods that we know work require independence among the tests. However, because of LD,if you use a dense set of SNPs, the correlations among the tests can be high. Haplotyping can combine multiple SNPs into a smaller number of more complex genotypes (with possibly more than 2 alleles) which usually improves the analysis by having higher association with the phenotype, having fewer features to compare and reducing LD among features. In QTL studies, the genotypes are assumed to be markers of the causal loci, rather than being causal themselves. This takes advantage of LD, as markers more correlated with to the causal regions should have stronger association with the phenotype. Researchers take advantage of the correlation among the p-values and plot the -log10(p-values) against the physical distance on the chromosome in a "Manhattan plot". The x-axis of this plot are the chromosomal positions of each feature within each chromosome, ordered by chromosome number (and usually color coded so that it is easy to see which features are in which chromosome). The y-axis are transformed p-values. Since the smallest p-values are of interest, the y-axis is usually -log10(p-value), which emphasizes the small values. "Real" QTLs are assumed to be indicated by a local peak of small p-values.


Genome-wide association studies (GWAS) identify genetic variants associated with traits or diseases. GWAS never directly link variants to regulatory mechanisms. Instead, the functional annotation of variants is typically inferred by post hoc analyses. A specific class of deep learning-based methods allows for the prediction of regulatory effects per variant on several cell type-specific chromatin features. We here describe “DeepWAS”, a new approach that integrates these regulatory effect predictions of single variants into a multivariate GWAS setting. Thereby, single variants associated with a trait or disease are directly coupled to their impact on a chromatin feature in a cell type. Up to 61 regulatory SNPs, called dSNPs, were associated with multiple sclerosis (MS, 4,888 cases and 10,395 controls), major depressive disorder (MDD, 1,475 cases and 2,144 controls), and height (5,974 individuals). These variants were mainly non-coding and reached at least nominal significance in classical GWAS. The prediction accuracy was higher for DeepWAS than for classical GWAS models for 91% of the genome-wide significant, MS-specific dSNPs. DSNPs were enriched in public or cohort-matched expression and methylation quantitative trait loci and we demonstrated the potential of DeepWAS to generate testable functional hypotheses based on genotype data alone. DeepWAS is available at

1. Introduction

Unraveling the complex genetic patterns underlying complex phenotypes has previously been challenging. While individual Genome-Wide Association Studies (GWAS) can provide insight into the genetic underpinnings of measured phenotypes, they typically involved associations of genetic variants with only one or a few phenotypes. The field of phenomics involves the collection of high-dimensional phenotype data of an organism, with the aim of capturing the overall, comprehensive phenotype (the “Phenome”) of the organism (Houle et al., 2010). Association studies involving many measured phenotypes, for example, Phenome-Wide Association Studies (PheWAS) present many advantages, in that they allow for the complex interconnected networks between phenotypes and their genetic underpinnings to be elucidated, and also allow for the detection of pleiotropy (Pendergrass et al., 2011, 2013, 2015 Hall et al., 2014).

Pleiotropy is the phenomenon in which a gene affects multiple phenotypes (Tyler et al., 2009). One can also have a locus-centric view of pleiotropy involving a single SNP affecting multiple phenotypes (Solovieff et al., 2013). While pleiotropy used to be considered an exception to the rules of Mendelian genetics, it has since been proposed to be a common, central property inherent to biological systems (Tyler et al., 2009). Multi-phenotype associations (MPAs) can be detected in the results of Genome Wide Association Studies (GWASs) as Single Nucleotide Polymorphisms (SNPs) within genes/functional regions having multiple significant phenotype associations. This can be considered to be a pleiotropic pattern when the two phenotypes are seemingly unrelated. Two main MPA patterns exist within GWAS results. Type 1 MPAs occur when a single SNP within a functional region (such as a gene) is associated with more than one phenotype, whereas Type 2 MPAs occur when two different SNPs within a single functional region have different phenotype associations (Solovieff et al., 2013 Hackinger and Zeggini, 2017) (Figures 1A,B).

Figure 1. MPA signatures. (A) Type 1 MPA: a gene is associated with more than one phenotype due to a single variant within the gene associating with multiple phenotypes. (B) Type 2 MPA: a gene is associated with more than one phenotype because of alternate SNPs within the gene having different phenotypic associations (figure created from information presented in Solovieff et al., 2013). (C) Complex combinations of Type 1 and Type 2 signatures.

Multivariate analysis of the results of GWAS studies across many phenotypes have allowed for the investigation of complex relationships between genes and phenotypes, including pleiotropic relationships and the clustering of variants based on their phenotype associations. Many of these studies have involved the analysis of SNP associations with complex human disease traits. Some studies have considered pleiotropy as genes and SNPs associated with more than one phenotype, and found that pleiotropic genes tended to be longer, and that SNPs within pleiotropic genes were more likely to be exonic (Sivakumaran et al., 2011). Weighted Gene Co-expression Network Analysis (WGCNA) has been extended to cluster SNPs based on their phenotype associations using a matrix of beta coefficients, followed by hierarchical clustering of the Topological Overlap Matrix (Levine et al., 2017), and show how the resulting clusters can be used to produce polygenic scores. Gupta et al. (2011) introduced a biclustering algorithm, simultaneously clustering SNPs and phenotypes in a matrix of regression coefficients. Network-based approaches have been developed which construct bipartite networks of gene-disease phenotype associations from GWAS, and constructed network projections of this bipartite network resulting in disease similarity and gene-similarity networks (Goh and Choi, 2012). Though these studies provide a baseline of the use of multivariate and network approaches for the analysis of GWAS results, there is, to our knowledge, no method which characterizes detailed MPA signatures of genes and no method which clusters genes based on these detailed signatures. Simply clustering genes based on their phenotype associations will not capture the vast amount of combinatorial possibilities of type 1 and type 2 signatures any given gene can harbor (Figure 1C), especially when the multi-phenotype GWAS study involves millions of variants and hundreds of phenotypes.

Methods for multi-trait GWAS have also been developed, associating variants to groups of phenotypes (see for example Stephens, 2013 Furlotte and Eskin, 2015 Cichonska et al., 2016 Kaakinen et al., 2017a,b Mägi et al., 2017 Porter and OReilly, 2017 Thoen et al., 2017). Mägi et al. (2017) and Kaakinen et al. (2017a) present interesting methods for identifying the association between SNPs/genes and multiple phenotypes by using the phenotypes as predictors in the modeling of the genotype. These are valuable methods for determining which phenotypes/sets of phenotypes a given gene or SNP is associated with that are more sophisticated than standard univariate GWAS approaches. These methods however do not focus on the ability to characterize and cluster genes based on the collection of topologies of SNP-phenotype associations within the gene.

We present MPA Decomposition and Signature Clustering, a network-based approach involving a constructed powerset space, in which clustering distinguishes between genes based on the detailed topology of their unique MPA signature. MPA decomposition is a post-GWAS/post-PheWAS approach with is designed to take the results of a multi-phenotype genome-wide association-type analysis (such as a standard, univariate GWAS run on several phenotypes or a multi-phenotype approach such as SCOPA (Mägi et al., 2017) and provides a framework allowing the precise mathematical representation of the architecture of variant-phenotype associations within regions (MPA/pleiotropic signatures), and thus allows these regions (such as genes) to be clustered based on these complex signatures.

Results and discussion

Multiple phenotypic associations

Out of a total 3,792,566 rules mined, 765,318 rules of which lift ≥ 1 and confidence ≥ 0.5 were retained. 136,551 rules encoded TG and LDL levels. Of 19,837 rules related to high TG levels, 191 interesting rules represent low LDL-C and high TG in contrast to 509 rules that manifested high TG and high LDL-C levels. Table ​ Table2 2 denotes the representative association rules (see Additional file 3) and interpretation of rules refer to the previous work [14].

Table 2

Representative association rules

Rule #Rule body Rule headSuppConfLift
Rules encoding highTG levels
1LDL5, BMI5, TG5, TCHL5NONHDL50.01571.00005.1732
2GLU1205, TCHL5, LDL5, TG5NONHDL50.01361.00005.1732
3LDL5, TG5, TCHL5, GLU05NONHDL50.01321.00005.1732
4TG5, PLAT5, LDL5, NONHDL5TCHL50.01271.00005.1465
5GLU605, TG5, TCHL5, LDL5NONHDL50.01261.00005.1732
6TG5, LDL5, TCHL5, DS1NONHDL50.01221.00005.1732
7LDL5, INS1205, TCHL5, TG5NONHDL50.01191.00005.1732
8DBP5, TG5, LDL5, TCHL5NONHDL50.01191.00005.1732
9INS605, TCHL5, TG5, LDL5NONHDL50.01161.00005.1732
10TG5, TCHL5, INS05, LDL5NONHDL50.01071.00005.1732
11TG5, LDL5, SBP4, TCHL5NONHDL50.01071.00005.1732
12TG5, TCHL5, WHR5, LDL5NONHDL50.01071.00005.1732
13T_HDL5, NONHDL5, LDL4, HDL2TG50.01020.88754.1105
14TCHL2, LDL1, NONHDL2, PH1TG50.01000.83333.8595
Rules encoding highLDL levels
15TCHL5, NONHDL5, GLU605LDL50.04050.83244.2651
16DS1, NONHDL5, GLU605LDL50.01550.81824.1924
17NONHDL5, SONA4, TCHL5LDL50.02420.84504.3297
18TG3, BUN5, NONHDL5LDL50.01141.00005.1239

Definition of trait names refers to Figure ​ Figure1 1 .

Associations between high TG levels and MS related traits

Association patterns of single traits extracted from 359 rules containing high TG levels were visualized by a connected graph (Figure ​ (Figure1). 1 ). High TG trait ( TG5 in Figure ​ Figure1) 1 ) connected with peculiar nodes representing the 17 distinctive traits: Bone Mineral Density (BMD) measure, distal radius SOS(DS) Blood components, HB, WBC_B, PLAT, and HBA1C and metabolic syndrome (MS) (Daskalopoulou, et al., 2006) measures, obesity (BMI, WHR and SUP), lipids (LDL, HDL, TCHL, T_HDL, TG and NONHDL), hypertension (SBP and DBP) and insulin resistance (GLU0, INS0, GLU60 and GLU120), post-challenge insulin(INS60 and INS120). The abbreviation of single traits refers to Figure ​ Figure1 1 .

Visualization of phenotypic associations with connected graphs.

Associations between high TG levels and a cluster of 4 common traits (obesity, insulin resistance, hypertension, and hyperlipidemia) related to MS, were consistent with the fact that MS increases T2D and cardiovascular diseases (CVD) [20].

Associations between high TG levels and BMD

One of the noteworthy findings is the association between low DS for the measure of BMD and high TG levels. The associations between low DS and a cluster of MS defined by the four common traits i.e. obesity, hypertension, hyperlipidemia and insulin resistance with high glucose levels and dissociation with insulin levels (INS0, INS60, INS120) were in concordance with newly reported work [21] that examined an association between MS and bone health. There are negative relations between low DS associated with high levels of lipids including TG, TCHL and LDL and positive relations between low DS with low levels of HDL [22]. More interesting finding was observed in that low DS are associated with high levels of glucose but not with insulin levels although the association between high glucose levels or insulin resistance with BMD has been inconclusive. In contrast, hyperglycemia is known for a predictor of bone loss and osteoporotic fractures [23]. Our finding can be a suggestive evidence that obesity, hypertension and hyperlipidemia among MS related traits might be associated with osteoporosis.

Associations with high LDL

High levels of LDL were shown positive relations with BMI, glucose levels and plasma lipids including TCHL, TG, and NONHDL as well as negative relations with DS. We did not find associations between high LDL levels and insulin levels. Interestingly, highLDL have positive relations with single traits related to renal function such as Blood Urea Nitrogen (BUN) and Sodium in Urine (SONA).

Pattern of multivariate phenotype highLDLhighTG

Among multiple phenotypic associations with high TG, we considered the phenotypic associations which subdivided samples into feasible sizes of cases and controls for GWAS. We focused on contradictory relationship between high TG levels (TG5 in Figure 1(a) ) with low or high levels of LDL (LDL1 in Figure 1(a) and LDL5 in Figure 1(b) ). That is, there are positive correlations: between TG and LDL-C and TCHL between LDL-C and HDL as well as negative correlations: between TG and LDL-C between HDL and TG, LDL-C and TCHL. Both single traits, high TG levels and high LDL levels, shared common traits (BMI, PLAT, TCHL, and GLU0) associated with themselves.

Combination of two single traits, high LDL and high TG, introduces multivariate phenotype highLDLhighTG which can amplify association strength with correlated single traits by additive effects of the single traits. Out of 17 associated traits, four traits (DS, GLU0, INS0, and SONA) have more power in distinctively classifying samples of highLDLhighTG into cases and controls (Figure ​ (Figure2 2 ).

Distribution of associated traits with multivariate and single traits. 1 and 2 stand for groups of controls and cases in samples of traits respectively. (a), (b) and (c) stand for highLDLhighTG, high LDL and high TG respectively. Out of the 17 associated single traits with high LDL and high TG, 9 single traits were selected for viewing due to keeping image resolution.

The associations between the traits can be substantiated in association rules encoding high TG levels (Rules 1

14) and high LDL-C levels (Rules 15-18). The rules were sorted and selected by their confidences.

As seen from above, there exist complicated associations among single traits. Selection of cases and controls based on single traits without considering those associations may increase confounding effects in samples. Compared with single traits based selection of cases and controls, multivariate based approach can have more power to distinguish cases from controls.

GWAS results of Plasma lipid levels

We identified total 50 variants associated with highLDLhighTG and 15 are located in six genes (PAK7, C20orf103, NRIP1, BCL2, TRPM3, and NAV1) (Table ​ (Table3 3 and Figure ​ Figure3). 3 ). It is interesting to know that rs11700112 of PAK7 on 20p12.2 is in a missense mutation by substitution of arginine (CGA) by proline (CCA). Clinical association has not yet been found with this variant. It is located within a LD block (530kb) with other four SNPs, of which two (rs6140956 and rs6133716) are in intronic region of C20orf103. It is worth to note that C20orf103 contains a frameshift mutation at rs72238296, which is 755 bases upstream of rs6140956 in the same gene (Table. ​ (Table.3 3 and Figure 4(a) ). The frameshift mutation is known for a cause of a hypercholesterolemia [24].

Table 3

Genetic variants associated with highLDLhighTG

SNPChrBase positionSNP typegeneStr- andAllele(+/-)Freq (+)P-value hLDLhTGORmORP LDL mORP TG r 2
rs11700112209495018nonsynPAK7-GC0.316.2휐 -5 1.443.251.581.00
rs6140956209450080intronicC20orf103+CT0.416.3휐 -5 1.402.791.850.56
rs6133716209455931intronicC20orf103+AG0.401.0휐 -4 1.392.581.750.59
rs9967942209503781intronicPAK7-CA0.349.6휐 -5 1.413.241.760.89
rs11087847209504159intronicPAK7-TG0.318.6휐 -5 1.433.021.590.99
rs28229942115282928intronicNRIP1-AG0.418.7휐 -5 1.402.462.591.00
rs28229982115285230intronicNRIP1-CA0.414.1휐 -4 1.352.092.120.97
rs10414042115346738intronicNRIP1-AG0.431.7휐 -4 1.372.382.330.64
rs99598741859045526intronicBCL2-AG0.151.2휐 -4 1.603.402.201.00
rs18935061859044660intronicBCL2-GA0.151.5휐 -4 1.593.312.081.00
rs4744611972551280intronicTRPM3-GA0.491.8휐 -4 1.361.933.141.00
rs7039780972469777intronicTRPM3-GA0.503.7휐 -4 1.341.572.980.90
rs4744608972470797intronicTRPM3-GC0.503.7휐 -4 1.341.522.990.90
rs6657701200014747intronicNAV1+AG0.382.0휐 -4 1.382.651.541.00
rs5295811200016143intronicNAV1+CG0.372.0휐 -4 1.382.591.590.99
rs693221085700codingAPOB-AG0.077.3휐 -4 1.830.042.21-

SNP rs693 reported in a previous study (Kathiresan, et al, 2008) for associations between high TG and high LDL. The rs693 was pruned since its effect was stronger in single traits high LDL than multivariate highLDLhighTG with borderline significance.

A manhattan plot for an association test of highLDLhighTG. Gene symbols in purple represents loci identified in previous GWAS of lipids (Kathiresan, et al., 2008). SNPs in blue were pruned.

P-value distributions of association tests for highLDLhighTG and single traits highTG and highLDL. Points in red are significantly identified SNPs in the association test of highLDLhighTG. highLDLhighTG is presented with hLDLhTG and single traits highTG and highLDL are presented in hTG and hLDL respectively.

A strong LD block (81kb length) with high r 2 values (r 2 ≥ 0.90) detected across three SNPs (rs4744611, rs7039780 and rs4744608) (Figure 5(b) ) of TRPM3 on chromosome 9 (9q21.11-q21.12) that is relatively close to regions linked to coronary artery disease [25]. Among nine splice variants of TRPM3, splice variants 7 and 8 do not include the three SNPs identified (Figure 5(a) ). This observation suggests that SNPs can make different functional effects on splice variants. Although no firm genetic linkage to disease has been established and not much report on the properties of TRPM3, functional activity of TRPM3 is relevant to contractile and proliferating vascular smooth muscle cells. Recent work [26] investigated the relevance and regulation of TRPM3 in vascular biology and showed that elevated cholesterol can act as a negative regulator of TRPM3.

Genomic features for LD structures in HapMap populations.

Two SNPs of BCL2 gene on chromosome 18 (18q21.33) were identified. BCL2, which is involved in a number of cancers including melanoma, breast carcinomas and etc., was recognized as important modulators of cardiac myocyte apoptosis. A distinct support for relevance of BCL2 to cardio vascular disease (CVD) was reported by recent finding [27] that PPARγ protected cardiac myocytes from oxidative stress and apoptosis through upregulating BCL2 expression.

NRIP1 was reported to have an association with HDL [28]. Recent studies identified a hepatocyte specific role for NRIP1 as a cofactor for LXR in different ways, namely serving as a coactivator in lipogenesis and as a corepressor in gluconeogenesis [29]. NAV1 on chromosome 1q32.1, a human homolog of a C. elegans gene, unc-53, is expressed in adult heart and the developing brain. Clinical association has not been established with it. Our results warrant that variants associated with highLDLhighTG should be evaluated for further study.

It is important to emphasize that LD structures for the six genes across three populations (YRI, CEU, JPT+CHB) are distinct. The pattern of the strongest LD was observed in JPT+CHB among the three. Whereas, the weak pattern of LD was appeared to be in CEU (see Additional file 4).

In silico replication

In silico replication analysis was conducted for the 15 SNPs in two regional subcohorts as well as gender groups (Table ​ (Table4). 4 ). Nine of 15 SNPs associated with highLDLhighTG were well reproducible in regional subcohorts (P < 0.05), while p-values of six SNPs (p ≥ 0.05) were on the borderline statistical significance. Five SNPs in NIRP1 (rs2822994, rs2822998 and rs1041404) and NAVI (rs665770 and rs529581) were more reproducible in both regional subcohorts and gender groups.

Table 4

Replication of GWAS of highLDLhighTG

LDLTG n=205LDL-C n=919TG n=936LDLTG n=340LDL-C n=1352TG n=843LDLTG n=545LDLTG n=288LDL-C n=1044TG n=969LDLTG n=257LDL-C n=1227TG n=810LDLTG n=545
rs117001122.6휐 -2 0.18180.38493.1휐 -2 0.71720.01722.8휐 -2 1.8휐 -1 0.41610.63211.0휐 -3 0.09670.00729.2휐 -2
rs61409568.8휐 -4 0.17700.10177.1휐 -2 0.77490.03323.6휐 -2 1.1휐 -1 0.63530.31258.4휐 -4 0.23170.00965.6휐 -2
rs61337161.8휐 -3 0.26910.07489.3휐 -2 0.76030.07014.7휐 -2 1.1휐 -1 0.95500.22311.9휐 -3 0.17570.02595.8휐 -2
rs99679422.6휐 -2 0.34760.48164.1휐 -2 0.67070.01983.3휐 -2 4.1휐 -1 0.29250.83256.9휐 -4 0.15450.00722.0휐 -1
rs110878471.9휐 -2 0.15310.32003.5휐 -2 0.72180.01512.7휐 -2 1.5휐 -1 0.44590.54241.1휐 -3 0.07500.00627.4휐 -2
rs28229941.1휐 -1 0.15820.25362.9휐 -5 0.01580.00195.4휐 -2 1.5휐 -2 0.44000.00602.4휐 -4 0.00450.12827.9휐 -3
rs28229981.2휐 -1 0.19850.24567.4휐 -5 0.01720.00435.8휐 -2 3.2휐 -2 0.48750.00842.9휐 -4 0.01010.13611.6휐 -2
rs10414041.6휐 -3 0.06680.03573.1휐 -3 0.02740.03382.4휐 -3 7.8휐 -3 0.38430.00494.4휐 -4 0.00940.09394.1휐 -3
rs99598741.2휐 -1 0.49530.19421.1휐 -3 0.05400.25365.9휐 -2 4.3휐 -3 0.46200.07201.8휐 -1 0.20070.96069.3휐 -2
rs18935061.2휐 -1 0.49530.19421.3휐 -3 0.05840.26165.9휐 -2 5.3휐 -3 0.46200.07711.8휐 -1 0.20180.97979.4휐 -2
rs47446111.7휐 -2 0.96520.00784.3휐 -2 0.00900.89333.0휐 -2 1.1휐 -3 0.01130.19251.8휐 -1 0.62210.48809.1휐 -2
rs70397807.8휐 -3 0.63340.01071.4휐 -1 0.02380.73877.6휐 -2 2.2휐 -3 0.00800.28932.2휐 -1 0.63170.49471.1휐 -1
rs47446087.8휐 -3 0.58460.01221.6휐 -1 0.02530.69558.2휐 -2 1.8휐 -3 0.00740.26502.6휐 -1 0.63170.59361.3휐 -1
rs6657702.3휐 -2 0.27150.32194.7휐 -2 0.07470.31563.5휐 -2 1.6휐 -2 0.55290.15298.1휐 -2 0.03210.61824.9휐 -2
rs5295812.2휐 -2 0.24980.34134.5휐 -2 0.05240.33373.4휐 -2 2.4휐 -2 0.50120.19326.2휐 -2 0.02600.59764.3휐 -2
rs6938.0휐 -3 0.08890.21481.1휐 -2 0.00450.03969.6휐 -3 2.1휐 -3 0.03730.01472.8휐 -2 0.00560.61461.5휐 -2

n represents number of cases.

Reproducibility of gender difference in the 15 SNPs were as follows: PAK7 and NRIP1 were more effective in women BCL2, TRPM3 and NAV1 were more reproducible in men. highLDLhighTG was more detectable in women than man (χ 2 �.9, p-value = 2.05 × 10 -11 ). PAK7 and NRIP1 may lead to the gender specific susceptibility in concordance with previous work [30] reporting more gender-specific effects for CVD in women than men.

Comparison of general GWAS

Overall distribution of p-values for an association test appeared to be less significant than those for general GWAS. On the other hand, the p-values of significant SNPs identified for multivariate phenotype highLDLhighTG were apparently more significant than those for single traits highLDL and highTG (Figure ​ (Figure6). 6 ). It is noteworthy that effect sizes of the significant SNPs which ranged between modest (odds ratios=1.38-1.60) and intermediate effect sizes were comparable to those for the general GWAS ranged from low to modest ones.

Scatter plots for p-value for a multivariate trait versus single traits.

Pleiotropic patterns of quantitative trait loci

Pleiotropic patterns can be more precisely observed in quantitative trait loci (QTLs) or LD blocks than at SNPs. We examined QTLs and their associated phenotypes for the six genes identified using Phenotype and Disease Association track group in UCSC genome browser. The QTLs and their associated phenotypes were extracted from rat and mouse QTLs from RAT DB and MGI (Mouse Genome Information) (Table ​ (Table5 5 ).

Table 5

Phenotypes associated with QTLs mapped to 6 genes identified

Gene(s) (chromosome band)OMIM phenotype (OMIM number)Phenotypes for rat QTL from RGDPhenotypes for mouse QTL from MGI
C20orf103 (20p12.2)
Body mass index(608559),
Atopic dermatitis(605804),
Systemic lupus erythematosus(610065),
Alzheimer disease(607116)
Blood pressure, Body weight,
Cardiac mass, Stress response,
Non-insulin dependent diabetes mellitu, Renal disease susceptibility, Thymus enlargement suppressive
Blood glucose level,
Type 2 diabetes mellitus,
Bone mineral density, Crescentic glomerulonephritis, Modifier of retinal degeneration
NRIP1 (21q11.2)Myeloproliferative syndrome(159595)
Narcolepsy(609039), Autism(610838)
Testicular tumor resistance
BCL2 (18q21.33)Orthostatic hypotensive disorder
(143850), Insulin-dependent diabetes mellitus(601941), Amyotrophic lateral sclerosis(606640)
Blood pressure, Cardiac cell morphology, Insulin dependent diabetes mellitus, Renal functionBone mineral density
Hematocrit/hemoglobin quantitative trait(609320), Cataract(605749),
Pelvic organ prolapse(613088),
Deafness (chromosome 9q21.11 duplication syndrome)(613558),
Epilepsy(611631), paraplegia(607152), Otosclerosis(612096), Spastic
Amyotrophic lateral sclerosis(105550)
Blood pressure, Body weight,
Heart rate, Stress response, Cardiac mass, Glucose level, Lipid level,
Renal function, Kidney mass, Renin concentration,
Thyroid stimulating hormone level,
Abnormal inflammatory response,
Hepatocarcinoma susceptibility,
Bone mechanical trait,
Autoimmune aoritis,
Cataract severity
Inflammatory bowel disease(612381),
Parkinson disease(613164),
Blood pressure, Cardiac mass,
Stress response, Renal function,
Thymus enlargement,
Abnormal inflammatory response
Crescentic glomerulonephritis,
Bone mineral density

Phenotypes associated with QTLs were extracted from Phenotype and Disease Association tracks in UCSC genome browser. Phenotypes for OMIM, rat QTL, mouse QTL corresponded to OMIM phenotype loci, RGD RAT QTL and MGI Mouse QTL tracks respectively.

The six genes except NRIP1 share QTLs commonly associated traits such as BMD and a cluster of common traits defining MS. Those common traits in MS shared by the six genes are blood pressure, non-insulin dependent diabetes mellitus, renal function, cardiac mass, and body weight. The phenotypic associations of high TG and high LDL levels with low BMD examined through rat and mouse QTL associations except NRIP1 have mapped in the regions of QTLs associated with BMD. Furthermore it can be more support that TRPM3 was mapped to OMIM phenotypes such as osteosclerosis hardening bones, epilepsy, amyotrophic lateral sclerosis (ALS), of which association with CVD was reported in a recent work [31]. Different genetic markers share the same or similar OMIM phenotypes: BCL2 and TRPM3 have in common with associating ALS PAK7 and NAV1 have in common with similar phenotypes Alzheimer disease (AD) and Parkinson disease (PD) where a cross-talk between MS and AD was reported [32].

In summary, our results suggest that the genetic markers identified with multivariate phenotype highLDLhighTG have phenotypic associations with common traits in MS. The common traits in MS, particularly hyperlipidemia, may be linked to pathogenic associations with osteosclerosis and neurodegenerative disorders including AD and PD influenced by pleiotropic genetic factors. Thus, the genetic markers identified in our work can have pleiotropic effect on MS, BMD and neurodegenerative disorders.

Gene network analysis using protein-protein interactions

We explored possible functional relationships between five of six genes associated with highLDLhighTG using STRING, a database of predicted protein-protein interactions (PPI). We obtained 5 different networks of genes interacting with each of five genes by confidence of association evidence (≥ 0.5). Each of the gene networks (Figure ​ (Figure7) 7 ) was mapped to KEGG pathways and examined pathways in common. Four genes i.e. BCL2, NAV1, NIRP1 and TRPM3 interact with genes (CASP7, BACE1, SDHB, TRPC6) in AD and PD pathways, while BCL2 and NIRP1 shared Huntington’s disease as well as AD. In particular, three genes i.e. BCL2, PAK7, and NIRP1 shared pathways in cancer and other pathways, supporting our hypothesis that multivariate phenotypes have common etiology pathways when they are affected by pleiotropic genetic factors.

Gene networks constructed from interacting proteins. Solid lines in red stand for genes in pathways for AD, PD, HD, and ALS. Gene symbols in black are involved in chemokine, MAPK and Wnt signaling pathways. Dashed lines in red represent genes mapping to pathways in cancer from KEGG DB or specific cancer related pathways annotated by PANTHER and DAVID functional annotations.

Pillar 3: individual causal polymorphisms segregate at moderate-to-intermediate frequencies

The debate over ‘common mutation–common disease’ versus ‘rare mutations–common disease’ models has proven to be the source of a great number of manuscripts, including many recent reviews (for example, Manolio et al., 2009). We raise this issue only insofar as it relates to the prospect for GWAS. Recall Hill et al. (2008). If causal polymorphisms are at low frequency, they contribute to additive variation. However, the power to detect such polymorphisms is rather low, potentially requiring prohibitive sample sizes (Zuk et al., 2012). Curiously, as power drops with allele frequency, the resulting expected overestimation of effect size increases (Lynch and Walsh, 1998). Accordingly, less frequent causal polymorphisms will appear to have a stronger effect than is actually the case (Mackay et al., 2012), perhaps biasing the flow of resources from GWAS to the ‘mutational screen’ paradigms that have recently been gaining in popularity (Tennessen et al. (2012).

What proportion of phenotypic variation is, then, due to low-frequency (as compared with intermediate frequency) alleles? Referring back to Mackay et al. (2012), Jordan et al. (2012) and Weber et al. (2012), low-frequency mutations matter. This is problematic for GWAS if effect sizes are relatively small. However, per the analysis of Ober et al. (2012), the intermediate frequency alleles matter. This in itself may pose problems for GWAS analyses that focus on detecting additive effects since, as discussed above, although apparent additivity is likely to be for rare variants, it is not a necessary consequence for alleles of intermediate frequency.

To summarize, GWAS will be most successful, if (i) additive genetic variation is abundant, (ii) individual causal polymorphisms have sizable effects and (iii) they segregate at moderate-to-intermediate frequencies. So, is genetic variation mostly additive? In general, we do not know. Do individual causal polymorphisms have sizable effects? Again, in general we do not know. Do they segregate at moderate-to-intermediate frequencies? Once again, we do not really know. Overall, it appears that we are still some way short of certainty in affirming these requirements, at least in the Drosophila model we feature in this mini-review.

PSEA: Phenotype Set Enrichment Analysis—A New Method for Analysis of Multiple Phenotypes

Supporting Information is available in the online issue at

Institutional Login
Log in to Wiley Online Library

If you have previously obtained access with your personal account, please log in.

Purchase Instant Access
  • View the article PDF and any associated supplements and figures for a period of 48 hours.
  • Article can not be printed.
  • Article can not be downloaded.
  • Article can not be redistributed.
  • Unlimited viewing of the article PDF and any associated supplements and figures.
  • Article can not be printed.
  • Article can not be downloaded.
  • Article can not be redistributed.
  • Unlimited viewing of the article/chapter PDF and any associated supplements and figures.
  • Article/chapter can be printed.
  • Article/chapter can be downloaded.
  • Article/chapter can not be redistributed.


Most genome-wide association studies (GWAS) are restricted to one phenotype, even if multiple related or unrelated phenotypes are available. However, an integrated analysis of multiple phenotypes can provide insight into their shared genetic basis and may improve the power of association studies. We present a new method, called “phenotype set enrichment analysis” (PSEA), which uses ideas of gene set enrichment analysis for the investigation of phenotype sets. PSEA combines statistics of univariate phenotype analyses and tests by permutation. It does not only allow analyzing predefined phenotype sets, but also to identify new phenotype sets. Apart from the application to situations where phenotypes and genotypes are available for each person, the method was adjusted to the analysis of GWAS summary statistics. PSEA was applied to data from the population-based cohort KORA F4 (N = 1,814) using iron-related and blood count traits. By confirming associations previously found in large meta-analyses on these traits, PSEA was shown to be a reliable tool. Many of these associations were not detectable by GWAS on single phenotypes in KORA F4. Therefore, the results suggest that PSEA can be more powerful than a single phenotype GWAS for the identification of association with multiple phenotypes. PSEA is a valuable method for analysis of multiple phenotypes, which can help to understand phenotype networks. Its flexible design enables both the use of prior knowledge and the generation of new knowledge on connection of multiple phenotypes. A software program for PSEA based on GWAS results is available upon request.


Cariaso M, Lennon G: SNPedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic Acids Res. 2012, 40: D1308-1312. 10.1093/nar/gkr798.

Eriksson N, Macpherson JM, Tung JY, Hon LS, Naughton B, Saxonov S, Avey L, Wojcicki A, Pe'er I, Mountain J: Web-based, participant-driven studies yield novel genetic associations for common traits. PLoS genetics. 2010, 6: e1000993-10.1371/journal.pgen.1000993.

Do CB, Tung JY, Dorfman E, Kiefer AK, Drabant EM, Francke U, Mountain JL, Goldman SM, Tanner CM, Langston JW: Web-based genome-wide association study identifies two novel loci and a substantial genetic component for Parkinson's disease. PLoS genetics. 2011, 7: e1002141-10.1371/journal.pgen.1002141.

Futreal PA, Liu Q, Shattuck-Eidens D, Cochran C, Harshman K, Tavtigian S, Bennett LM, Haugen-Strano A, Swensen J, Miki Y, et al.: BRCA1 mutations in primary breast and ovarian carcinomas. Science. 1994, 266: 120-122. 10.1126/science.7939630.

Lancaster JM, Wooster R, Mangion J, Phelan CM, Cochran C, Gumbs C, Seal S, Barfoot R, Collins N, Bignell G: BRCA2 mutations in primary breast and ovarian cancers. Nature genetics. 1996, 13: 238-240. 10.1038/ng0696-238.

Klein TE, Altman RB, Eriksson N, Gage BF, Kimmel SE, Lee MT, Limdi NA, Page D, Roden DM, Wagner MJ: Estimation of the warfarin dose with clinical and pharmacogenetic data. N Engl J Med. 2009, 360: 753-

Ashley EA, Butte AJ, Wheeler MT, Chen R, Klein TE, Dewey FE, Dudley JT, Ormond KE, Pavlovic A, Morgan AA: Clinical assessment incorporating a personal genome. Lancet. 2010, 375: 1525-1535. 10.1016/S0140-6736(10)60452-7.

van der Net JB, Janssens AC, Sijbrands EJ, Steyerberg EW: Value of genetic profiling for the prediction of coronary heart disease. Am Heart J. 2009, 158: 105-110. 10.1016/j.ahj.2009.04.022.

Mihaescu R, Meigs J, Sijbrands E, Janssens AC: Genetic risk profiling for prediction of type 2 diabetes. PLoS Curr. 2011, 3: RRN1208-

Wei Z, Wang K, Qu HQ, Zhang H, Bradfield J, Kim C, Frackleton E, Hou C, Glessner JT, Chiavacci R: From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet. 2009, 5: e1000678-10.1371/journal.pgen.1000678.

Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747-753. 10.1038/nature08494.

Janssens AC, van Duijn CM: An epidemiological perspective on the future of direct-to-consumer personal genome testing. Investig Genet. 2010, 1: 10-10.1186/2041-2223-1-10.

Evans DM, Visscher PM, Wray NR: Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. Hum Mol Genet. 2009, 18: 3525-3531. 10.1093/hmg/ddp295.

He Q, Lin DY: A variable selection method for genome-wide association studies. Bioinformatics. 2011, 27: 1-8. 10.1093/bioinformatics/btq600.

Kooperberg C, LeBlanc M, Obenchain V: Risk prediction using genome-wide association studies. Genet Epidemiol. 2010, 34: 643-652. 10.1002/gepi.20509.

Cho YS, Go MJ, Kim YJ, Heo JY, Oh JH, Ban HJ, Yoon D, Lee MH, Kim DJ, Park M: A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nature genetics. 2009, 41: 527-534. 10.1038/ng.357.

Li MD, Yoon D, Lee JY, Han BG, Niu T, Payne TJ, Ma JZ, Park T: Associations of variants in CHRNA5/A3/B4 gene cluster with smoking behaviors in a Korean population. PLoS One. 2010, 5: e12183-10.1371/journal.pone.0012183.

Yoon D, Kim YJ, Cui WY, Van der Vaart A, Cho YS, Lee JY, Ma JZ, Payne TJ, Li MD, Park T: Large-scale genome-wide association study of Asian population reveals genetic factors in FRMD4A and other loci influencing smoking initiation and nicotine dependence. Human genetics. 2012, 131: 1009-1021. 10.1007/s00439-011-1102-x.

Chen LS, Saccone NL, Culverhouse RC, Bracci PM, Chen CH, Dueker N, Han Y, Huang H, Jin G, Kohno T: Smoking and genetic risk variation across populations of European, Asian, and African American ancestry--a meta-analysis of chromosome 15q25. Genet Epidemiol. 2012, 36: 340-351. 10.1002/gepi.21627.

Scheet P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006, 78: 629-644. 10.1086/502802.

Jakobsdottir J, Gorin MB, Conley YP, Ferrell RE, Weeks DE: Interpretation of genetic association studies: markers with replicated highly significant odds ratios may be poor classifiers. PLoS Genet. 2009, 5: e1000337-10.1371/journal.pgen.1000337.

Xu M, Tantisira KG, Wu A, Litonjua AA, Chu JH, Himes BE, Damask A, Weiss ST: Genome Wide Association Study to predict severe asthma exacerbations in children using random forests classifiers. BMC Med Genet. 2011, 12: 90-

Zou H, Hastie T: Regularization and variable selection via the elastic net. J Roy Statistical Society: Series B. 2005, 67: 301-320. 10.1111/j.1467-9868.2005.00503.x.

Cho S, Kim K, Kim YJ, Lee JK, Cho YS, Lee JY, Han BG, Kim H, Ott J, Park T: Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Ann Hum Genet. 2010, 74: 416-428. 10.1111/j.1469-1809.2010.00597.x.

Fang S, Fang X, Xiong M: Psoriasis prediction from genome-wide SNP profiles. BMC Dermatol. 2011, 11: 1-10.1186/1471-5945-11-1.

Ahdesmaki M, Strimmer K: Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. Annals of Applied Statistics. 2010, 4: 503-519. 10.1214/09-AOAS277.

Burges C: A tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery. 1998, 2: 1-47.

Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction. 2009, New York, NY: Springer, 2

Guyon I, Weston J, Barnhill S, Vapnik V: Gene Selection for Cancer Classification using Support Vector Machines. Mach Learn. 2002, 46: 389-422. 10.1023/A:1012487302797.

Rakotomamonjy A: Variable selection using svm based criteria. J Mach Learn Res. 2003, 3: 1357-1370.

Breiman L: Random Forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.

Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP: A Comparison of Decision Tree Ensemble Creation Techniques. IEEE Trans Pattern Anal Mach Intell. 2007, 29: 173-180.

Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics. 2009, 10 (Suppl 1): S65-10.1186/1471-2105-10-S1-S65.

DeLong ER, DeLong DM, Clarke-Pearson DL: Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988, 44: 837-845. 10.2307/2531595.

Lasko TA, Bhagwat JG, Zou KH, Ohno-Machado L: The use of receiver operating characteristic curves in biomedical informatics. J Biomed Inform. 2005, 38: 404-415. 10.1016/j.jbi.2005.02.008.

Kraft P, Hunter DJ: Genetic risk prediction--are we there yet?. N Engl J Med. 2009, 360: 1701-1703. 10.1056/NEJMp0810107.

Li MD, Cheng R, Ma JZ, Swan GE: A meta-analysis of estimated genetic and environmental effects on smoking behavior in male and female adult twins. Addiction. 2003, 98: 23-31. 10.1046/j.1360-0443.2003.00295.x.

Additional file 1: Description of traits. A table shows 52 traits and their description and measurement. (XLS 66 KB)


Additional file 2: Basic characteristics for traits. Baseline characteristics according to means and standard deviations. (XLS 50 KB)


Additional file 3: Association rules encoding high TG and high LDL levels. Representative association rules encoding high TG and high LDL. (XLS 8 MB)

Strategies for Pathway Analysis Using GWAS and WGS Data

Single-allele study designs, commonly used in genome-wide association studies (GWAS) as well as the more recently developed whole genome sequencing (WGS) studies, are a standard approach for investigating the relationship of common variation within the human genome to a given phenotype of interest. However, single-allele association results published for many GWAS studies represent only the tip of the iceberg for the information that can be extracted from these datasets. The primary analysis strategy for GWAS entails association analysis in which only the single nucleotide polymorphisms (SNPs) with the strongest p-values are declared statistically significant due to issues arising from multiple testing and type I errors. Factors such as locus heterogeneity, epistasis, and multiple genes conferring small effects contribute to the complexity of the genetic models underlying phenotype expression. Thus, many biologically meaningful associations having lower effect sizes at individual genes are overlooked, making it difficult to separate true associations from a sea of false-positive associations. Organizing these individual SNPs into biologically meaningful groups to look at the overall effects of minor perturbations to genes and pathways is desirable. This pathway-based approach provides researchers with insight into the functional foundations of the phenotype being studied and allows testing of various genetic scenarios. © 2018 by John Wiley & Sons, Inc.

Watch the video: How to solve Genotype to Phenotype problems (September 2022).


  1. Scottie

    It is an excellent variant

  2. Xabier

    mmm)) so cool))

  3. Garman

    It is compliant, the useful phrase

  4. Daxton

    fu quality

Write a message