What software/approach to use to build a graph based on microarray gene expression correlation?

What software/approach to use to build a graph based on microarray gene expression correlation?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

What software to use to build a graph based on microarray gene expression mutual correlation?

I have tried Cytoscape's Reactome FI and a recipe from R bioinformatics cookbook, however, need a more reliable robust software or R/Python tutorial on how to make a graph with gene-nodes and edges built on the rule of significant (p<0,05) positive and negative Pearson/Spearman correlations.

A Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest

Abstract: Microarray data plays an essential role in diagnosing and detecting cancer. Microarray analysis allows the examination of levels of gene expression in specific cell samples, where thousands of genes can be analyzed simultaneously. However, microarray data have very little sample data and high data dimensionality. Therefore, to classify microarray data, a dimensional reduction process is required. Dimensional reduction can eliminate redundancy of data thus, features used in classification are features that only have a high correlation with their class. There are two types of dimensional reduction, namely feature selection and feature extraction. In this paper, we used k-means algorithm as the clustering approach for feature selection. The proposed approach can be used to categorize features that have the same characteristics in one cluster, so that redundancy in microarray data is removed. The result of clustering is ranked using the Relief algorithm such that the best scoring element for each cluster is obtained. All best elements of each cluster are selected and used as features in the classification process. Next, the Random Forest algorithm is used. Based on the simulation, the accuracy of the proposed approach for each dataset, namely Colon, Lung Cancer, and Prostate Tumor, achieved 85.87%, 98.9%, and 89% accuracy, respectively. The accuracy of the proposed approach is therefore higher than the approach using Random Forest without clustering.

Keywords: Classification , Clustering , Dimensional Reduction , Microarray , Random Forest


Several computational methods have been developed over the last two decades to infer interaction between genes based on their expression [1]. Early work utilized large compendiums of microarray data [2] while more recent work focused on RNA-Seq and scRNA-Seq [3]. While the identification of pairwise interactions was the goal of several studies that relied on such methods, others used the results as features in a classification framework [4] or as pre-processing steps for the reconstruction of biological interaction networks [5]. Most work to date focused on intra-cellular interactions and network. In such studies, we are looking for interacting genes involved in a pathway or in the regulation of other genes within a specific cell. In contrast, studies of extracellular interactions (i.e., interactions of genes or proteins in different cells) mainly utilized small-scale experiments in which a number of ligand and receptor pairs were studied in the context of a cell line or tissue [6]. However, recently developed methods for spatial transcriptomics are now providing high-throughput information about both, the expression of genes within a single cell and the spatial relationships between cells [7,8,9,10,11]. Such information opens the door to much larger-scale analysis of extracellular interactions.

Current methods for inferring extracellular interactions from spatial transcriptomics have mostly focused on unsupervised correlation-based analysis. For example, the Giotto method calculated the effect upon gene expression from neighbor cell types [12]. While these approaches perform well in some cases, they may not identify interactions that are limited to a specific area, specific cell types, or that are related to more complex patterns (for example, three-way interactions).

To overcome these issues, we present a new method that is based on graph convolutional neural networks (GCNs). GCNs have been introduced in the machine learning literature a few years ago [13]. Their main advantage is that they can utilize the power of convolutional NN even for cases where spatial relationships are not complete [14, 15]. Specifically, rather than encoding the data using a 2D matrix (or a 1D vector), GCNs use the graph structure to encode relationships between samples. The graph structure (represented as a normalized interaction matrix) is deconvolved together with the information for each of the nodes in the graph leading to NN that can utilize both, the values encoded in each node (in our case gene expression) and the relationship between the cells expressing these genes.

To apply GCN to the task of predicting extracellular interactions from gene expression (GCNG), we first convert the spatial transcriptomics data to a graph representing the relationship between cells. Next, for each pair of genes, we encode their expression and use GCNG to convolve the graph data with the expression data. By this way, the NN can utilize not just first-order relationships, but also higher-order relationships in the graph structure. We discuss the specific transformation required to encode the graph and gene expression, how to learn parameters for the GCNG, and how to use it to predict new interactions.

We test our approach on three datasets from the two spatial transcriptomics methods that profile the most number of genes right now, SeqFISH+ [16] and MERFISH [17]. As we show, GCNG greatly improves upon correlation-based methods when trying to infer both autocrine and extracellular gene interactions involved in cell-cell interactions. We visually analyze some of the correctly predicted pairs and show that GCNG can overcome some of the limitations of unsupervised methods by focusing on only a relevant subset of the data. Analysis of the predicted genes shows that many are known to be involved in a similar functional pathway supporting their top ranking.

Results and Discussion

Co-expression network construction and topological analysis

Constructing co-expression networks

Most of the existing co-expression analyses construct value-based networks. We believe that the value-based methods are significantly limited by their use of a homogeneous threshold for all the genes in the network. In reality, genes in different functional pathways may be regulated by different mechanisms, and therefore may exhibit different patterns of co-expression. In particular, genes in one functional pathway may be strongly mutually co-expressed, while genes in another functional pathway may be only weakly co-expressed. As a result, if we choose a stringent global threshold, many genes in the weakly co-expressed pathway may be disconnected. On the other hand, if we attempt to connect the weakly co-expressed genes into the network, the threshold may become so low that the genes in the strongly co-expressed pathway may have many links to genes in other pathways, making further analysis difficult. For example, as shown in Figure 1, to construct a co-expression network for the 3000 yeast genes that we will see in the next subsection, if we allow only 10% of the genes to have no connections, most genes will have more than 300 connections, while if we reduce the median degree to 10, more than one third of the genes will have no connection at all.

Median degree and number of singleton nodes in a value-based yeast co-expression network. Horizontal axis: the Pearson correlation coefficient threshold for the value-based network construction. Left vertical axis: median number of co-expression links per gene. Right vertical axis: number of genes without a co-expression link.

To deal with this problem, we propose a simple rank-based method to construct co-expression networks. We first calculate the Pearson correlation coefficient (or some other similarity measure) between every pair of genes. For each gene g i, we rank all other genes by their similarity to g i. We then connect every gene to the d genes that are most similar to it. Compared to the value-based method, the rank-based method essentially uses different local similarity threshold for different genes. It is important to mention that even with a fixed d, the number of connections for different genes is not constant. This is because of the asymmetric nature of the ranking. In other words, the rank of gene i with respect to gene j is not necessarily equal to the rank of gene j with respect to gene i. Therefore, although gene i has only d genes on its top-d list, other genes that are not on i's list may list i as one of their top-d genes. The mean degree is between d and 2d, the minimum degree is d, and the maximum degree can be as large as n - 1, with n being the number of genes in the network.

The rank-based method may appear to be limited by a similar drawback of the value-based method - the former uses a global rank threshold and the latter uses a global value threshold for all genes. However, as we discussed above, in the rank-based network, different genes can have different number of connections, even though all genes have the same rank threshold, because of the asymmetric nature of ranking. More importantly, our objective is not to identify all co-expressed genes for each gene, but to construct a sparse network such that the modular structure of the system can be successfully identified. To achieve this, a good co-expression network needs to have the following two properties: (i) there are very few false-positive connections, and (ii) nodes within modules are well connected into a single component, while connections between modules are sparse. A value-based network can hardly provide the two properties simultaneously, for reasons given above. In contrast, the key idea in our rank-based method is that by using a uniform small value of d, we ensure that (i) the network only contains highly reliable edges, and (ii) each module of the network is (almost) fully connected into a single component, both theoretically and empirically (see below). As in most clustering algorithms, we assume that gene expressions in different modules are generated by different distributions, while gene expressions in the same module are generated by a common (unknown) distribution. Therefore, the rank-based sub-network of genes from the same module is a nearest neighbor graph constructed on a set of random geometric points. Theoretically, it is known that a nearest neighbor graph on random geometric points has a high probability to be connected even with a very small number of neighbors (d) [31]. To empirically test this as well as to find the range of d for typical microarray data, we randomly generated a data set with 1000 genes and various dimensions (conditions) using Gaussian distribution. We then constructed a rank-based co-expression network using different values of d and measured the number of disconnected components in the resulting network. Remarkably, we find that for data of dimension > 10, a nearest neighbor graph with 1000 nodes is almost always fully connected with d = 2 neighbors. Even for data of smaller dimensions, the graph can be connected with at most 4 neighbors (Figure 2). The results do not vary significantly when the number of genes or the type of distribution is changed. In the next subsection, we also show that the yeast gene co-expression network is connected with d = 2. In practice we find a value of d between 3 and 5 is sufficient for most cases. This simple network construction method can also be combined with other strategies that were developed for value-based networks. For example, the raw similarity values can be refined by considering local neighborhood or shortest path information before rank transformation [16] when selecting edges according to ranks, a threshold based on raw similarity values may be imposed simultaneously to ensure confidence in the edges being created. Ideally, methods can also be developed to automatically select the optimal d, as in [8]. The rank-based method can also be applied to construct networks of other entities, as long as a similarity measure can be defined. One example is to construct a network of samples from microarray data, where the nodes are samples and the similarity between two samples can be measured by the Pearson correlation coefficient between their gene expression profiles. Later we will show an application of a sample co-expression network where each sample is a cell type.

Connectivity of rank-based co-expression networks on random data. Each data set contains 1000 random geometric points in a certain number of dimensions, generated using the standard Gaussian distribution. Y-axis shows the number of disconnected components in the co-expression network constructed by the rank-based approach.

Topology of yeast co-expression networks

Previous studies have analyzed the topologies of various networks, including biological and social networks, and suggested three common topological properties: scale-free, small-world, and hierarchically modular [26, 27, 32–34]. Although debate exists [35, 36], it is generally believed that these properties may be related to the robustness and stability of the underlying systems [26, 27, 32–34]. For example, a small-world network has a small diameter and a large clustering coefficient (see Methods), which is believed to be related to an efficient and controlled flow of information [26, 34]. In a scale-free network, the probability for a node to have k edges follows a power-law distribution, i.e. P(k) = c × k -γ . The implication of the scale-free property is that a few nodes in the network are highly connected, acting as hubs, while most nodes have low degrees. Scale-free networks are believed to be robust to random failures, but vulnerable to deliberate attacks [27, 32]. In comparison, in a random network (specifically, an Erdos-Renyi random network [26]), connections are spread almost uniformly across all nodes [26, 34]. Furthermore, although a random network may also have a small diameter, it usually has a near zero clustering coefficient [26, 34]. Several studies have analyzed the value-based gene co-expression networks, and reported some interesting but controversial results [11–13, 25]. Here we analyze the topologies of both the rank-based and the value-based networks, and compare with previous results.

We obtained a set of yeast gene expression data measured in 173 different time points under various stress conditions [37], and selected 3000 genes that showed the highest variations. We constructed four gene co-expression networks using the rank-based method with d = 2, 3, 4 and 5, respectively. For each rank-based network, we constructed two random networks as follows. First, we randomly permuted the expression data of each gene independently, and constructed a rank-based network using the permuted data. Second, we randomly rewired the connections in a true rank-based network, but preserved the degree for every node [34]. For comparison, we also constructed four value-based networks, using the Pearson correlation coefficient as a similarity measure. The thresholds were chosen such that the average degrees are 10, 30, 50, and 100, respectively, in the resulting networks. Similar to the rank-based networks, we obtained two random networks for each true value-based network, one constructed from randomly permuted data and the other by randomly rewiring the true network.

Table 1 lists some statistics of these networks. In the rank-based networks, almost all genes are linked to the largest component with d as small as 2. Furthermore, compared to both the randomly rewired networks and the networks constructed from randomly permuted data, the true rank-based co-expression networks have slightly larger average path lengths and diameters, but much larger clustering coefficients, indicating that the rank-based co-expression networks have the small-world property. In contrast, the true value-based co-expression networks contain many singletons. For example, with a Pearson correlation coefficient threshold of 0.69, about 900 genes are singletons, even though the average node degree is much higher than in the rank-based networks. Furthermore, although the value-based networks have high clustering coefficients, their randomly rewired counterparts have almost similarly high clustering coefficients. This observation suggests that the high clustering coefficient of the value-based networks is partially because their non-singleton nodes are almost completely connected, in which case the structure cannot be destroyed by any random rewiring.

Figure 3(a) and 3(b) shows the degree distributions of these networks. As indicated by a linear relationship in the log-log plot, the rank-based networks constructed from the real data exhibit a power-law degree distribution for all the d values considered. This suggests that an overall scale-free topology is a fairly robust feature of the co-expression networks. In contrast, the networks constructed from randomly permuted gene expression data contain significantly fewer high-degree nodes, and exhibit exponential degree distributions. The value-based networks appear to follow power-law degree distributions as well however, they have a much larger number of high degree nodes than the rank-based networks.

Topological properties of co-expression networks. (a) Degree distribution of rank-based co-expression networks. (b) Degree distribution of value-based co-expression networks. (c) Relationship between clustering coefficient and degree in rank-based and value-based co-expression networks.

To quantify the difference between the degree distributions of the value-based and rank-based networks, we fitted a power-law function for the degree distribution of each network to determine its γ parameter. The values of γ in the rank-based networks are consistently between two and three. This is typical in many biological networks such as PPI networks and metabolic networks, as well as in real-world social and technology networks [26, 34]. In comparison, the γ values in the value-based networks are below one (Figure 3b). Theoretically, it is known that a scale-free network with γ < 2 has no finite mean degree when its size grows to infinity, and is dominated by nodes with large degrees [26]. Therefore, small values of γ for co-expression networks were reported in several previous studies as a significant difference between the co-expression networks and other biological networks [15, 17]. Our results suggest that this difference may simply be an artifact of the network construction method. Consider that genes in some modules are strongly co-expressed with one another, while genes in some other modules are weakly co-expressed. Using the value-based method, when the similarity cutoff is gradually decreased, the genes within the strongly co-expressed modules will be first connected, up to a point that they are almost completely connected, before any gene in the weakly co-expressed modules can be connected to their within-module partners. As a result, the co-expression network will have many genes with large degrees, resulting in a small slope in the log-log plot. In contrast, with the rank-based method, genes in both strongly and weakly co-expressed modules can be connected, as essentially a different similarity threshold is used for each gene. Therefore, rank-based networks can usually capture the topology of both strongly and weakly co-expressed modules, while value-based networks are often dominated by the strongly co-expressed modules.

Moreover, previous studies have reported that gene co-expression networks lack the hierarchically modular property [11, 12]. This property is characterized by a reciprocal relationship between a node's degree and its clustering coefficient [33]. Again, we have found that this claim only applies to the value-based networks. As shown in Figure 3(c), there is a clear reciprocal relationship between the node degree and node clustering coefficient in the rank-based networks, when compared to the value-based networks. This suggests that gene co-expression network can also have hierarchical structures.

Together, these experiments show that the rank-based co-expression networks have all the common topological properties of many other biological networks, while the value-based networks seem to differ significantly. Although these do not necessarily prove that rank-based networks are biologically more meaningful than value-based networks, the former seems to be able to capture the underlying topological structures better.

Module discovery and analysis in gene co-expression networks

Gene co-expression networks with thousands of nodes are difficult to visualize and comprehend. A useful strategy for analyzing such a network is to partition it into subnetworks, where the nodes within each subnetwork are relatively densely connected to one another but have fewer connections to the other subnetworks. In gene co-expression networks, such subnetworks can be considered as candidates of functional modules, as genes within each subnetwork are mutually co-expressed, while co-expression between genes in different subnetworks are sparse. Many graph partitioning algorithms have been developed in computer science [38]. Similar to clustering, one major difficulty in graph partitioning is to determine the number of partitions. Some methods do not require this to be explicitly determined in advance, but require other parameters, which are also difficult to obtain. For example, MCL, one of the best graph partitioning algorithms, requires an inflation parameter, and setting the parameter to different values may result in very different results [29].

To address this difficulty, we introduce an algorithm that we have developed recently for identifying "communities" in arbitrary networks [28]. The main motivation for the algorithm is that each "community", or a subnetwork, must contain more intra-community edges than would be expected by chance if the connections were random. With this motivation, we developed an algorithm to optimize an objective function called modularity, which is precisely defined as the percentage of intra-community edges minus the random expectation (see Methods). The algorithm, named Qcut, has been shown to be effective in finding statistically significant and practically interesting graph partitions in many synthetic networks, social networks, and biological networks, without any user-tunable parameters, and has outperformed the existing algorithms based on similar motivations [28].

We evaluate the performance of Qcut on gene co-expression networks in several ways. We first use synthetic microarray data where the true modular structure is known, so that we can directly measure the accuracy. We then use two real microarray data sets to evaluate the overall biological significance of the identified gene modules, with two different metrics. The first metric is a commonly used approach based on the enrichment of specific Gene Ontology terms in the modules, which may be biased by the number of modules and module sizes. The second is a new metric that we introduced based on the idea of reference networks, which can be obtained from a variety of sources, such as gene annotations or protein-protein interaction networks (See methods).

Evaluation using synthetic microarray data

To objectively evaluate the accuracy of the modules detected by Qcut, we tested it on a large collection of synthetic gene expression data. The data sets, available at, were used to evaluate many clustering algorithms in a previous study [39]. Each data set contains simulated expression data of approximately 600 genes under 50 conditions. Each gene was pre-assigned to one of fifteen clusters, and the genes in the same cluster had their expression profiles generated from a common log normal distribution. Gaussian noises were then added to the data set to simulate experimental noises. A higher level of Gaussian noise generally makes the data more difficult to cluster. Since the correct clusters are known, we used a well-known metric called the adjusted Rand Index to measure the accuracy of Qcut (see Methods) [40].

We first compared the accuracy of Qcut on co-expression networks constructed by three methods: value-based, rank-based, and CLR [19]. We used Euclidean distance as the basis to measure the dissimilarity between two genes. For the value-based method, we normalized the distance to be between 0 and 1, and constructed a series of co-expression networks for each data set using different threshold values. As shown in Figure 4(a), the threshold that results in the best clustering accuracy varies for different data set. For more noisy data set a larger threshold value is needed, which suggests that choosing a right threshold is critical for the value-based method. The CLR method, in contrast, by converting the raw distances to z-scores, effectively removed such dependency and the best clustering accuracy is achieved at the same z-score equal to 2, corresponding to a p-value 0.05, for all data sets (Figure 4b). Interestingly, for the rank-based method, the clustering accuracy is almost invariant for rank cutoffs between 2 and 8 (Figure 4c). Figure 4(d) shows the best accuracy that can be achieved on the three types of networks. As can be seen, the rank-based networks clearly have the highest accuracy for intermediate levels of noises (SD = 0.4 or 0.8). For data with lower noises, all three methods resulted in perfect accuracy, and for data set with the highest level of noise (SD = 1.2), all three methods converges to about the same accuracy. Next we compared the clustering accuracy of Qcut on rank-based networks with several widely-used clustering algorithms including k-means clustering, hierarchical clustering [1], and tight clustering [39], applied directly to the gene expression data without deriving co-expression networks. In this test, Qcut was applied to rank-based co-expression networks constructed using d values equal to 4. In addition, we also tested one of the best graph partitioning algorithms called the Markov Clustering algorithm (MCL) [29], which is applied to rank-based networks as well. Since the results of MCL depend heavily on the choice of an inflation parameter (I), we applied MCL to the rank-based networks constructed with d fixed at 4, but varied I from 1.3 to 1.7, with an increment of 0.1, and took the best clustering accuracy resulted from these parameters. We used the MATLAB (the MathWorks Inc.) implementation of the k-means and hierarchical clustering algorithms. k-means clustering was run with k equal to 15, and was repeated 50 times for each data set to obtain the best results. The hierarchical clustering was performed using average linkage, and the final cluster tree was cut at an appropriate depth to generate 15 clusters. The accuracies of tight clustering were directly obtained from the original study that was done on the same data set [39]. Our evaluation results show that, even without any parameter tuning, Qcut outperformed the competing algorithms in identifying the true modular structures embedded within the synthetic microarray data. As shown in Figure 5, the clustering accuracy of Qcut is clearly better than that of the hierarchical clustering and tight clustering. The accuracy of Qcut is similar to k-means, except for the data sets of the highest level of noise. The synthetic data set with the highest level of noise may represent an extreme case in practice, as many of the clusters in this data set are not distinguishable visually (Figure S1 in Additional File 1). However, k-means achieved this accuracy with the number of clusters given explicitly, while Qcut did not have this information at all. In these synthetic data sets, the number of clusters is the single most important parameter and k-means is expected to work well when that is known. We tried to combine k-means with several popular methods to automatically determine the number of clusters, including the gap statistic [41], Silhouette [42], and the Dunn's Index [43]. Our results suggest that if the values of k are automatically determined, k-means performs much poorer than our method, especially for data sets with SD ≥ 0.4 (Figure 5). The results of MCL are two-fold. On one hand, when an appropriate inflation parameter is chosen (I = 1.5 in this experiment), MCL has an accuracy similar to that of Qcut, except for the data set with the highest noises, indicating a superior performance of graph-based algorithms in general. On the other hand, the accuracy of MCL depends on the choice of the inflation parameter, and may be much lower than that of clustering algorithms if a suboptimal inflation parameter is used (data not shown).

Effects of network construction methods on the clustering accuracy of Qcut. (a) Clustering accuracy on value-based networks, as a function of the distance cutoffs. (b) Clustering accuracy on CLR co-expression networks, as a function of the Z-score cutoffs. (c) Clustering accuracy on rank-based networks, as a function of the rank cutoffs. (d) Best clustering accuracy on the three types of networks, constructed with the optimal cutoffs. In all four plots, each data point is an average over the results of 100 synthetic microarray data sets.


Genomic studies are producing large databases of molecular information on cancers and other cell and tissue types. As is universally recognized, these databases represent an unparalleled opportunity for pharmaceutical advance. The challenge is to link the data to the drug discovery and development processes. An ‘information–intensive’ approach 6 formulated several years ago (by one of the present authors and colleagues) provided a blueprint for one productive way to meet that challenge. It provided a way to organize and inter-relate potential therapeutic targets, molecular mechanisms of action of compounds tested and modulators of activity within cancer cell lines. It also suggested a way to project genomic information on the cells used for testing through the activity patterns of compounds to molecular structural characteristics of those compounds. 6 However, that suggestion was not pursued, and it was not converted into a fluent methodology for exploration or into a software package for doing so. Required was a way to couple the genomic (or proteomic) information with structure-based data mining to provide insights fruitful for follow-up in experimental structure–activity studies. Here we have presented such a method, based on the relational database system schematized in Figure 1. Included are gene expression levels for 3748 genes in 60 cell lines (T-matrix), activity values for 4463 compounds in 60 cell lines (A-matrix), and binary indices of occurrence of 27 000 structural features in 4463 compounds (S-matrix). As a proof-of-principle example of the approach, we have used it to identify subclasses of quinones well correlated with genes that are selectively expressed either in melanomas or in leukemias. A brief discussion of these agents and their genomic associations follows.

Of the 4463 compounds in the NCI set used in this analysis, 462 (10.4%) are quinones, quinoneimines or quinone methides. The mechanisms of quinone cytotoxicity 49,50,51 are complex and varied. However, two principal pathways are well established. First, quinones act as redox-active molecules that can undergo either 1- or 2-electron reductions, depending on the cellular environment. The mechanism for 1-electron reduction involves redox cycling between quinone and semiquinone radical states, leading to consumption of NADH and formation of hydroperoxy radicals. Depending on the cellular environment, other reactive oxygen species, including superoxides, hydrogen peroxides and hydroxyl radicals, can be formed. These reactive species can, in turn, cause peroxidation of lipids, oxidation and strand breaks in DNA, consumption of reducing equivalents (eg, NAD(P)H or glutathione), and oxidation of other macro-molecules. In the second pathway, unhindered quinones act as Michael acceptors, causing cellular damage through alkylation of thiol or amino groups of glutathione, proteins and DNA. Mitomycin C and E09, for example, undergo reductive alkylation 53 by mechanisms that involve opening the aziridinyl ring.

In the present study, we found that several genes selectively over-expressed in melanomas have expression patterns that are well correlated with the activity patterns of a subclass of benzodithiophenedione compounds. This class shows a distinctive substituent effect: Benzodithiophenediones with strong electron-withdrawing substituents (eg, NSC 682991 see Figure 5) show low or negative correlation with many of the genes that are over-expressed in melanomas (see Table 4a), whereas members with electron-donating substituents (eg, NSC 656238) show high positive correlations with those genes. For example, NSC 656238 is 10 times more potent against the melanoma cell lines than is NSC 682991. Electron-withdrawing substituents such as nitro groups raise the reduction potential of the quinone moiety, making it a better oxidant than it is in compounds with electron-donating groups. A plausible hypothesis for the cytotoxicity of a benzodithiophenedione is that it may disrupt an essential cellular redox process. This hypothesis is consistent with the roles of genes over-expressed in melanomas. In particular, Rab7 31,32,33 is the gene most strongly correlated with the electron-donating benzodithiophenediones. For example, the correlation coefficient with NSC 656238 is 0.67. Genes in the Rab family are small GTP binding proteins that ensure specificity of the docking of transport vesicles. In particular, Rab7 has recently been identified as a key regulatory protein for aggregation and fusion of late endocytic lysosomes. Cells expressing a dominant–negative Rab7 mutant have been reported not to form lysosomal aggregates. 31 The dispersed lysosomes exhibit sharply higher pH, presumably due to disruption of the vacuolar proton pump. Interestingly, in this context, another gene highly correlated with NSC 656238 is ACP5 (r = 0.51). ACP5 (Clone ID 127821) is a unique lysosomal membrane ATPase responsible for maintaining the pH. Several other lysosomal proteins are also well correlated with the electron-donating benzodithiophenediones. Two other ATPases, ATP6B2 (Clone ID 380399) and ATP6E (Clone ID 417475), have correlation coefficients of 0.40 and 0.46, respectively, with NSC 656238. Both of these ATPases are reported to be lysosomal H + transporters. Other lysosomal genes, ASAH (Clone ID 417819) and LAMP2 (357407), also show high correlations (0.50 and 0.40, respectively) with this quinone. Thus, genes well-correlated with this particular quinone class seem to be enriched in lysosomal proteins that are involved in vacuolar proton pump activity.

This substituent effect suggests a possible link between the oxidation potential of quinones, the proton pump, and the electron transport chain. A plausible hypothesis is that NSC656238 may act as a surrogate oxidizing agent in the electron transport chain. Ubiquinone-10 is the electron acceptor for mitochondrial oxidative phosphorylation. Menadione (2-methylnaphthoquinone), a compound known to compete with ubiquinone in the oxidative phosphorylation chain, also shows a reasonable correlation with Rab7 (r = 0.40). The reduction potentials of menadione and ubiquinone are known, 57,58 but the reduction potential of NSC656238 has not been reported. We speculate that its oxidizing potential allows it to compete successfully with ubiquinone in the electron transport chain, as does menadione. The oxidizing potential of the quinone moiety would be a key factor in such a mechanism. Although compound NSC 682991 is a better oxidant, it may be reduced by cellular protective agents such as glutathione. Thus, at low concentrations, it may not be available to compete with ubiquinone and therefore may be effective only at higher concentrations.

We have illustrated a way to couple information on differential gene expression with structure-based data mining. The approach provides insights that may allow selective targeting of cellular mechanisms preferentially operating in specific tissues. The benzodithiophenedione series that emerged from this study is a clear example. This is a well-defined and structurally homogeneous series of quinones, which are well correlated with the expression patterns of Rab7 and other melanoma genes. The substituent effect seen in this series suggests a relationship between the oxidation potential of a compound and its correlation with the expression patterns of specific genes. This relationship prompts new questions that can be pursued experimentally: Is there a quantitative relationship between the oxidation potential of the benzodithiophenedione series and melanoma cytotoxicity? If so, is there a direct relationship between the selective cytotoxicity of NSC 656238 and Rab7 or the ATPases that are over-expressed in melanomas? The data currently available do not permit answers to these questions, but the analyses described here do provide indirect evidence of connections that can be tested in experimental structure–activity studies.

In this article, we have described a general analytical method, designated SAT, for discovering relationships between compound classes and potential molecular targets. The method uses statistical techniques to select genes with characteristic expression patterns, then applies structure-based data mining software to identify compound substructural classes that are well-correlated with the expression patterns of those genes. Selected members of the class identified can then be used as molecular probes to identify additional compound–gene associations and thereby refine hypotheses or focus further experiments. This semi-empirical method projects genomic information from cells through compound activity patterns to molecular structural features of drugs or potential drugs. It can also do the reverse, identifying genes whose expression levels (or other characteristics) correlate strongly with structural features of a particular drug, or drug candidate. The SAT approach to pharmacogenomic analysis can shed light on molecular mechanisms and has the potential to accelerate the process of drug discovery in several ways: (i) it can be used to prioritize genes for follow-up studies as potential therapeutic targets (ii) because the analysis projects genomic information to molecular substructure through the [S] matrix, it allows extraction of a preliminary structure–activity relationship (SAR) directly from the SAT correlations (iii) the preliminary SAR can, in turn, be used for early pharmacophore development or to select new, untested drug candidates from an actual or virtual library of compounds and (iv) it can be used to prioritize candidate compounds for detailed gene expression analysis or other biological studies.


Differential expression.

Identifying genes that are differentially expressed under two or more treatment conditions is a primary goal of most microarray studies. The two main issues in assessing differential expression are determining a method for assessing the extent of differential expression (e.g., fold change, t-test, ANOVA) and adjusting the method for the effects of multiple comparisons, since typically there are thousands of genes being studied. Differential expression is traditionally approached one gene at a time (e.g., fold change, t-test, ANOVA). One important point is the weakness of relying on fold change as the sole criterion, since fold change does not take into account the variability in the data. This can lead to two problems. First, genes with low expression levels yet large fold changes and high variability may be identified as differentially expressed. Second, genes that display small but reproducible (i.e., low variability) changes in gene expression may be missed. There have been some efforts to incorporate variability in methods that rely on fold change (41), but these still suffer from difficulties in assessing the error rates. Also, empirical Bayes methods that shrink individual estimates of variance toward a common value have been suggested for improving the behavior of t-statistics in the many gene settings (19). Recently, a number of high-dimensional methods have been proposed to use covariance structure to assist in identifying differentially expressed genes. These include elastic net (68), gradient-directed regularization (21), and multiple forward search (44). Shrunken centroid ordering by orthogonal projections (SCOOP) is a new method still under testing, with R code available from J. S. Verducci.

Multiple comparisons and false discovery rate.

The issue of multiple comparisons is more complex. Ideally, the probability of a false positive (a gene incorrectly identified as differentially expressed) should be small, and the probability of correctly identifying genes that are differentially expressed should be large. Standard statistical methods are set up to balance these goals in the context of only one comparison, i.e., if the microarray contained only one gene. Without adjustment, standard statistical methods give incorrect results in the context of microarray data. For example, consider a microarray study with m genes, and suppose none is differentially expressed. For various values of m, the probability that a standard statistical tool set to reject the null hypothesis if a P value is <0.05 will yield at least one false positive is given in Table 1. Because most microarrays contain thousands of genes, standard statistical methods are clearly unacceptable.

Table 1. Probability of at least one false positive increases rapidly as the no. m of hypotheses increases

The Bonferroni method is a simple method to correct for multiple testing that is still widely used in microarray data analysis (43). This method just divides the P value cutoff by the number of genes m. For example, if the probability of at least one false positive is to be limited to 0.01, and there are m = 5,000 genes on the array, the Bonferroni method would identify a gene as differentially expressed if its P value was <0.01/5,000 = 0.000002. Although this method is quite generally applicable, it is usually not a good choice for microarray studies because it has very low power, i.e., the probability of correctly identifying differentially expressed genes is very small, so many potentially interesting genes may be missed. For this and other reasons, different criteria than the probability of at least one false positive have been advocated. The most promising of these is the false discovery rate (FDR) (7, 65). FDR is the expected proportion of false positives among all rejected hypotheses. Instead of trying to avoid any false positives, the FDR controls the proportion of positive calls that are false positives. Designing procedures to control the FDR is challenging. The original technique of Benjamini and Hochberg (6), to control the FDR at level α, works as follows. First, P values are computed for each of the m genes, and the P values are ordered from smallest to largest. Second, the ordered P values are plotted vs. their rank along with the line with slope α/m and intercept zero. The last P value, say P*, that lies below the line is noted. This value (P*) is used to reject the hypotheses corresponding to all P values less than or equal to P*. The Benjamini-Hochberg procedure has been shown to control the FDR under certain assumptions on the dependence structure of the genes' expression levels (6). The procedure is in wide use and is recommended by the American Physiological Society (13). Unfortunately, there are many microarray studies not covered by the assumptions underlying the Benjamini-Hochberg algorithm. Thus there is much work in the statistical community aimed at developing a method of controlling the FDR that is more generally applicable than the original Benjamini-Hochberg method. A promising method that relies on the bootstrap technique has been recently analyzed (48, 60, 61). However, this method achieves the FDR asymptotically. Thus it is not suitable for studies involving small numbers (e.g., 4–5) of arrays.

Determining sample size needed to control FDR.

In planning an experiment, there are two major decisions to make about microarrays: 1) the total number of microarrays that should be used and 2) the proportion that will be used for biological vs. technical replication. The first decision is typically based on budget and the second on the reliability of the microarrays being used. The real question is whether a planned experiment has a realistic chance of detecting and identifying important biological processes. Recently, a decision theoretic procedure was introduced (46) where a typical loss function is a weighted sum of the FDR and its counterpart false negative rate (FNR). The idea is to plot the expected loss vs. sample size and judge whether a desired value can be achieved with a realistic sample size. The expected loss is estimated through simulating expression data and recording the behavior of the Benjamini-Hochberg method.

What software/approach to use to build a graph based on microarray gene expression correlation? - Biology

ErmineJ: Tool for functional analysis of gene expression data sets

6 1 269

2005 Lee et al licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

It is common for the results of a microarray study to be analyzed in the context of biologically-motivated groups of genes such as pathways or Gene Ontology categories. The most common method for such analysis uses the hypergeometric distribution (or a related technique) to look for "over-representation" of groups among genes selected as being differentially expressed or otherwise of interest based on a gene-by-gene analysis. However, this method suffers from some limitations, and biologist-friendly tools that implement alternatives have not been reported.

We introduce ErmineJ, a multiplatform user-friendly stand-alone software tool for the analysis of functionally-relevant sets of genes in the context of microarray gene expression data. ErmineJ implements multiple algorithms for gene set analysis, including over-representation and resampling-based methods that focus on gene scores or correlation of gene expression profiles. In addition to a graphical user interface, ErmineJ has a command line interface and an application programming interface that can be used to automate analyses. The graphical user interface includes tools for creating and modifying gene sets, visualizing the Gene Ontology as a table or tree, and visualizing gene expression data. ErmineJ comes with a complete user manual, and is open-source software licensed under the Gnu Public License.

The availability of multiple analysis algorithms, together with a rich feature set and simple graphical interface, should make ErmineJ a useful addition to the biologist's informatics toolbox. ErmineJ is available from .

A difficulty experienced by many (if not all) users of gene expression microarrays is making sense of the complex results. After analyzing each gene in a data set, an experimenter is often left to the task of summarizing the results with little assistance. It is common for experimenters to ask questions at the level of molecular pathways or other functionally relevant groupings of genes. While "ad hoc" manual annotation of data sets is a common approach, there are numerous advantages to using a computational and statistical approach to analyze groups of genes.

The first version of ermineJ was made available in 2003. Recently we have completely revamped the user interface and updated the feature set, releasing ermineJ 2.0 in October 2004 and 2.1 in June 2005.

There are a number of parameters to set and decisions the user must make in order to run the software. The choice of analysis method is the most obvious, and each method has a few other settings that the user can choose to change. For example, for ORA analysis a threshold score must be defined. This is in contrast to most ORA software packages which take as input a list of "genes of interest" instead, ermineJ takes as input all the gene scores for the experiment. This lets ermineJ avoid the problem of selecting the correct "null" gene set 3 : it is defined strictly by the genes analyzed in the experiment but not meeting the user-defined score threshold.

For GSR, the method used to compute the score for a gene set is a key parameter. The two options currently supported are the mean and the median. During the analysis, GSR uses the selected method to compute a summary of the gene scores for each resampled or real gene set, and this aggregate score is used to represent the gene set. Choosing the median will tend to yield slightly more conservative results, as individual genes with very high scores are not given as much weight as in the mean computation.

Some settings are used for multiple methods. For example, when a gene is represented more than once in the data set, a decision has to be made as to how to treat these "replicates" (which might not be replicates per se but represent different transcripts). The options supported are to use the "best" score among the replicates to represent them as a group to use the mean or to treat them as separate entities. Use of the "best" option is somewhat anti-conservative, but is reasonable when most "replicates" are in fact assaying different biological entities. In contrast, treating replicates completely separately is not generally advised as it can lead to spurious positive findings in cases of true replicates, as the gene set gets "adulterated" with multiple copies of the same high-scoring gene. For this reason the last option is not available from the GUI, though it can be accessed from the other interfaces. Another important setting is the range of gene set sizes to analyze. Gene sets that are very small are unlikely to be very informative, because the goal of the analysis is to study genes in groups, while large gene sets may be too non-specific to provide useful information. In addition, analyzing too many gene sets reduces the power of the analysis due to multiple testing costs. In practice we often use a range of 5� or 5�.

In addition to the pre-defined gene sets as defined by the Gene Ontology, users are free to input their own gene sets. These are defined in simple text files that are placed in a directory that ermineJ checks at startup. These text files can be created "off-line" or within the ermineJ GUI. In addition, users can modify gene sets from within ermineJ. This functionality can be used to correct errors or omissions in the Gene Ontology annotations, though care must be exercised to avoid introducing biases into the results.

The ORA, GSR and ROC methods are closely related in that they are based on the gene-by-gene scores, with the goal of finding gene sets that are some sense "enriched" in high-scoring genes (which typically might be "differentially expressed genes"). ORA is sometimes used to analyze genes which are selected by clustering, rather than a continuous score. In this situation, GSR and ROC are not appropriate. However, the correlation method is specifically designed to address this situation. GSR and ROC have the benefit of not requiring a threshold to divide genes into "selected" and "non-selected" genes. The choice of the threshold for ORA can have a substantial effect on the results obtained, because the "selected genes" change 4 .

Gene group correlation analysis (GCA) is based on the similarity of the expression profiles of genes in a gene set: loosely speaking, how well they "cluster together". Thus we propose that GCA can be used as an alternative to using ORA to analyze clusters. There are some differences to be noted between the typical application of ORA to clusters and the ermineJ correlation analysis. GCA is group-centric, not cluster-centric. Thus we ask whether the correlation among the members is higher than expected by chance, not whether a given set of correlated genes is enriched for the genes in the group GCA does not involve clustering. This is not a trivial distinction, because while the highest scores will be obtained for gene groups that have uniform and high correlations among all the members, groups that have two or more "sub-clusters" can also obtain high scores. In the current implementation of GCA, the absolute value of the correlation is always used, which allows. In future versions we may expose this as a user-settable option, as well as implementing other possible correlation metrics other than the current Pearson correlation.

In all methods, for each gene set analyzed, ermineJ computes a score and, based on that score and the gene sets size, a p-value representing the "significance" of that gene set with respect to the null hypothesis. The definition of the raw score and the null hypothesis depends on the method being used. Note that the raw scores are of limited use because it cannot be evaluated in the absence of information about the gene set size. However, they can provide the user with a helpful indication the strength of the result, not just its statistical significance.

Most users of ermineJ will access it through its graphical interface. The GUI of ermineJ was designed to be simple to use and provides "wizards" to guide users through common tasks such as running an analysis. Many settings made by the user during operation of the software are remembered between sessions, facilitating repeated analysis of the same data files and maintaining the user's preferred window sizes, for example. A complete manual is provided and is accessible via an on-line help function, as web pages on our web site, or in portable document format (PDF).

Some aspects of the ermineJ graphical user interface is illustrated in Figures 1 , 2 , 3 . The main panel of the software can be viewed either as a table of gene sets (Figure 1A ) or in a hierarchical (tree) view (Figure 1B ). These views are linked so changes in one are reflected in the other. To facilitate navigation of these displayed, gene sets can be searched by the name of the gene set or by the names of genes they contain. User-defined gene sets are displayed in contrasting colors. Not shown in the figures is the initial startup screen in which the user chooses the gene annotation file to use for the session.

A: The main panel of ErmineJ after several analyses have been performed

A: The main panel of ErmineJ after several analyses have been performed. Gene sets selected at low FDR levels are indicated in color. B: The tree-view panel of ErmineJ, illustrating the ability to browse gene sets in the GO hierarchy. The icons at each node have specific meanings. For example, the yellow "bull's-eye" icon indicates a gene sets selected at an FDR of 0.05 or less. Purple diamonds indicate nodes that have "significant" sub-nodes.

A gene set details view. The controls at the top allow adjustment of the size and contrast of the heat map. The gene scores (in this case p-values) are shown in the second text column. The grey and blue graph, shown only for experiments using p-values, shows the expected (grey) and actual (blue) distribution of p-values in the gene set. This display is provided as an additional aid to evaluation of the results. The last two columns provide information about each gene. The targets of the hyperlinks are configurable by the user.

Examples of screens from ErmineJ Wizards

Examples of screens from ErmineJ Wizards. A: Analysis wizard. This illustrates options to set the range of gene set sizes to analyze, and the method of treating "replicates" of genes. See text for details of the latter. B: Gene set modification wizard. In this screen the user is selecting genes to delete from a gene set. The list of all probe available on the platform is available in the left panel. A "find" function simplifies the location of genes and probes.

Double-clicking on a gene set in the main panel opens a new window that displays the genes in the gene set, along with the expression profiles in a "heat-map" view (if the user has provided the profile data Figure 2 ). The appearance of the heat map is configurable through menus and toolbar controls. The data displayed in the table, as well as the image of the matrix, can be saved to disk using additional menu options. The hyperlinks to external web sites can be configured by the user to point to a web site of their choosing, again through a menu option. All of these capabilities are available even if the user has not performed any analysis, so ErmineJ can be used as a "gene set browser" as well as for analysis.

An important feature of the GUI is the capability to rapidly define and edit gene sets, which is accomplished in a "wizard" that takes the user through the process set-by-step. Alternatively, the user can simply populate the gene set directory with files they have obtained from other sources, for example created in bulk with a Python script or obtained from another user. As far as we know, no tool surveyed by 3 affords the user the ability to define or modify the categories. ErmineJ also allows the user to choose which of the GO aspects (Biological Process, etc.) to use in the analysis.

The GUI version of ermineJ can be installed on the user's computer or run via Java WebStart. The latter option simply involves clicking on a link in the user's web browser, and ensures that the users have the most up-to-date version of the software. The drawback of using WebStart is that the user must be connected to the internet to use the software. With a local installation, no internet connection is needed.

Running an analysis using the ErmineJ GUI involves using a "wizard" to set the parameters (Figure 3 ). The user is asked to choose an analysis method, select the data file to analyze, choose any user-defined gene sets to include in the analysis, and set the various parameters required for the particular analysis. All settings are documented via "tool tips" and in the manual.

Once an analysis is initiated, the user is informed of its progress via a status bar. An analysis can be cancelled any time. On completion, the results are added to the tabular and tree views (Figure 1 ). Multiple results can be displayed simultaneously in the tabular view, allowing easy comparison of different runs. The tree view can display only a single analysis result set at a time, but offers a pull-down menu to selected among the results sets to display. In the tree and tabular views, high-scoring (i.e., significant) gene sets are highlighted in color. The tree view uses a simple system of icons for each node to indicate whether a significant node is contained within a given higher level node. Finally, the results of an analysis can be saved to a tab-delimited file for use in other software or to be reloaded by ermineJ at a later time.

In addition to the GUI, ermineJ offers a command line interface (CLI) and a simple application programming interface (API). The CLI exposes some features of ermineJ that are not available in the GUI, such as different methods for multiple test correction. The CLI is suitable for scripting runs of ermineJ. For example, a simple Perl script can be used to automate runs of ermineJ with different settings or on different data sets. In contrast, the API was introduced to allow programmers to include the analyses available in ermineJ in their own software. The API currently provides more limited access to the functionality of the software than the command line version, but will be expanded in future versions.

We tested the performance of ermineJ using the HG-U133_Plus_2 Affymetrix array design. This is a particularly large array design with over 54,000 probe sets, and represents a something of a worst-case scenario with respect to performance. With our current annotation set, 4844 different GO categories (gene sets) are available for analysis in this array design. We limited our analysis to gene sets with between 5 and 100 genes, leaving about 2700 gene sets. The times reported below are for analyzing the complete set of over 54,000 probe sets with respect to these 2700 gene sets on a on a 1.7 GHz Pentium laptop.

With this array, ermineJ has an initial startup phase that lasts 15󈞀 seconds, most of which is consumed by time it takes for the gene annotation file to be read in and processed for analysis. The time for analysis once startup is completed depends on the method used. For ORA, a complete analysis is completed in 8 seconds (average of 3 runs times are wall clock seconds timed from within the software). While it is difficult to directly compare our benchmarks with previously published benchmarks because the number of gene sets analyzed and the size of the "null" gene set was not reported, and the times reported might in some cases include initial startup times 3 , the fastest reported methods on the largest data sets tested completed ORA analyses in under 10 seconds. This indicates that ErmineJ is at least competitive with and possibly faster than the fastest previously reported tools.

GSR analysis took about 370 seconds if a full resampling is performed (100,000 resampling trials per gene set size in our tests). However, ermineJ implements an approximation, where limited resampling is used to estimate the parameters of a normal distribution. This normal is used to compute the p-values for each gene set. It also takes advantage that, especially for larger class sizes, the shape of the resampled distribution is very similar for similar class sizes, so not all of them need to be computed. In this mode the analysis takes approximately 80 seconds. ROC analysis, which does not involve resampling, took about 100 seconds. Correlation analysis is the most computationally intensive resampling method even with the approximations enabled it currently takes about 400 seconds to run on the test data set (which contained 12 microarrays). This is because computing correlations is computationally intensive, compared to the methods which use pre-computed gene scores such as p-values.

ErmineJ is fairly memory-intensive, because it holds in memory a complex data structure describing the annotations, as well as the microarray data and information about the results for thousands of gene sets and tens of thousands of genes. For the large HG-U133_Plus_2 design, after startup ermineJ occupies approximately 85 Mb of RAM (determined using a Java heap profiler under Windows). After running the correlation analysis, this grew to 105 Mb, reflecting the loading of the complete expression profile set and the results. Therefore we recommend running ermineJ on machines that have at least 256 Mb of RAM.

At this writing, the current version of ermineJ is 2.1.6. New features planned for the software include expanding the API and allowing more flexible creation of user-defined gene sets, including allowing support of alternative nomenclatures such as the Plant Ontology 17 . We also plan to provide annotation files for more platforms and organisms.

ErmineJ is a fast, full-featured, user-friendly, multi-platform open source application for analysis of gene sets. It implements multiple algorithms for performing the analysis, and permits easy modification and creation of new gene sets. These features afford users considerable flexibility in testing different methods and parameters. Perhaps the greatest current limitation to its usability at this date is the availability of gene annotation files for non-Affymetrix array designs we have not encountered frequently. Users who wish to develop annotation files for their platform should contact us for assistance.

Availability and requirements

Project name: ErmineJ

Project home page:

Operating system(s): Platform independent

Programming language: Java

Other requirements: Java 1.4 or higher 256 Mb RAM recommended.

License: GNU GPL and LPGL for helper library.

Any restrictions to use by non-academics: None

ORA: Over-representation analysis

GSR: Gene score resampling

ROC: Receiver operator characteristic

GCA: Gene group correlation analysis

GSEA: Gene Set Enrichment Analysis

GUI: Graphical User Interface

API: Application Programming Interface

CLI: Command Line Interface

PP was the project lead and chief architect of the software, and contributed to the source code. HKL, WB and KK all contributed to the source code.

We thank Shahmil Merchant and Edward Chen for contributions to an early version of ErmineJ, and William Noble for supporting the development of the methods, and Neil Segal for providing the microarray data used in the screen shots. We also thank testers and users who provided bug reports and suggestions for improvements.

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

Ontological analysis of gene expression data: current tools, limitations, and open problems

Comparing functional annotation analyses with Catmap

Exploring gene expression data with class scores

Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex


Data preparation and integration

For the purposes of our analysis, we extracted expression data from many Affymetrix Human Genome U133 Plus 2.0 Array Chip samples, as follows:

Using a PHP script, Simple Omnibus Format in Text (SOFT) files from samples of GPL570 or alternative platforms found in GEO repository [15] were downloaded and parsed. Title and characteristics were searched for keywords such as “healthy”, “normal”, “tissue” or “control” and sample organism for “Homo sapiens”. Additionally, all normal samples from manually curated human clinical microarray database M 2 DB [35] were selected, as follows: From M 2 DB website, we only selected individual samples of Human U133 plus 2.0 platform, applying no further quality control filtering. From them, only “Normal” GEO samples were chosen from “Disease State” clinical characteristics.

The annotations of all selected samples were manually read and only samples of healthy individuals or normal tissues adjacent to pathological ones were kept. Samples from cultured cell lines, pathological tissues or pharmacologically treated individuals were excluded. Each sample was manually classified according to the name of its tissue or organ.

After downloading the raw intensity files (CEL) of the chosen samples from GEO, a quality control was conducted using a PHP script which checked the raw files for errors in the probe intensity values. While the majority of files were in CEL version 3 ASCII format, a substantial number of them were in CEL version 4 binary format and had to be converted to the previous version format with apt-cel-converter, a program from Affymetrix Power Tools (apt-1.14.4) Software Package [36]. A PHP script parsed all ASCII CEL files and checked every intensity value of the 1164x1164 probes of each chip for being within the acceptable value range (0–65535). The script also concatenated all probe intensity values into a single string for each chip. This string was used as input for MD5 (RFC 1321), SHA-1 (RFC 3174) and CRC32 [37] hash algorithms and their outputs were concatenated as a single string, which then served as a characteristic signature of the probe intensities of each sample, in order to filter out duplicate GSMs. A list of unique GSMs was produced and a PHP script selected samples as evenly as possible, amongst all tissues/organs and GSE sample series.

To generate a single value that reflects the amount of each transcript in solution which corresponds to a probe set, apt-mas5, the Affymetrix Power Tools implementation of MAS5.0 algorithm [38], was used with the default Affymetrix Chip Description File (CDF) (HG-U133_Plus_2.cdf). Apt-mas5 output files (CHP) were converted to ASCII with the use of apt-chp-to-txt converter from Affymetrix Power Tools suite. Then, the data were normalised to allow different samples to be comparable, as follows: AFFX-prefixed control probe sets were excluded from the analysis and the rest of the 54613 probe sets were trivially normalised using the Affymetrix standard procedure where all signal values were multiplied by a scaling factor which was calculated by removing the top and bottom 2% of signal values, then calculating a value that adjusts the mean of the remaining 96% to 500. Finally, all signal values were rounded to the nearest 0.5.

Each probe set in our database was enriched with annotations collected from various data sources: Genomic data, such as HUGO Gene Nomenclature Committee gene symbols and descriptions, were collected from ENSEMBL [39], GO terms from the Gene Ontology Database [40], Enzyme Commission (EC) numbers and pathway information from KEGG [41], protein signature data from InterPro [42], genetic phenotypes from OMIM [43, 44] and predicted cis element information by combining TransFac [45] and ENSEMBL [39] data.

Promoter analysis

Regulatory sequences from 500 bps upstream of the Transcription Start Sites (TSSs) of all genes were collected from ENSEMBL [39] and were compared against the Transcription Factor Position-Weight Matrices (PWMs) from TransFac [45] using the MATCH algorithm [46] which is a weight matrix-based tool for searching putative transcription factor binding sites in DNA sequences. In our case, core and matrix similarity cut-offs were set to 0.95 and 0.90 respectively to increase stringency.

Statistical analysis

The Pearson correlation coefficient (r-value) between two probe sets is defined as the covariance of the two probe sets divided by the product of their standard deviations and it is calculated as follows:

where r x,y is the Pearson correlation coefficient, n is the number of microarray experiments and x i and y i are the signal intensities of probe sets x and y in the i th experiment. r- values range between −1 and +1 positive r- values correspond to correlated probe sets, negative values to anti-correlated probe sets and values close to zero to non-correlated. A computationally efficient way to interpret Pearson correlation is to express it as the mean cross-product of the standardised variables [47]:

where z x i and z y i are the standardised variables of the signal intensities of probe sets x and y in the i th experiment.

Assuming that the association between expression profiles is approximately linear, t which is distributed in the null hypothesis (of no correlation) like Student’s t- distribution with ν = n- 2 degrees of freedom, can be calculated as follows [48]:

Its two sided significance level p x,y, is given by Student’s Distribution Probability Function [49]:

To account for multiple sampling, p- values were Bonferroni corrected [50], as follows:

where e- values are Bonferroni corrected p- values. The pairwise r- and e- values were stored in a MySQL database.

Clustering analysis

We created a symmetric correlation matrix R(x,y) between all probe sets stored in the database. The all-against-all correlation matrix has size m xm where m = 54613 is the number of probe sets. We expressed the network as a distance matrix D(x,y) where each value is calculated as D(x,y) = 1-R(x,y). The distance matrix data were stored as a Phylip format file [51] and we applied Neighbour Joining algorithm (NJ) [52] to cluster the data. The algorithm takes as input the Phylip file and constructs a rooted hierarchical tree in Newick format. NJ algorithm is computationally efficient because of its polynomial-time complexity [53] and thus can be applied on very large data sets. We chose the Quick Join [54] implementation which uses heuristics for speeding up the NJ algorithm while still constructing the same tree as the original algorithm.


We setup a PHP-based web site for HGCA, which allows interactive searches for gene names, probe sets or annotation terms. The interface allows querying for two complementary questions. Users interested in a particular probe set can retrieve: a) an r- value-ranked list of the most closely correlated probe sets, b) a tree-based list of the most closely clustered probe sets.

To simplify the navigation, the web interface was designed as simple as possible in a way that makes the navigation friendly and the extraction of knowledge easy, producing a tool that can be used by any experimentalist. The colour scheme allows the easier understanding of information and the simplification of human computer interaction. Thus, lines coloured pink highlight the probe sets that refer to the query gene whereas the green lines highlight the collected co-expressed probe sets to the query gene. The grey lines indicate that the co-expressed probe set appear in the list but the gene for the specific probe set was previously highlighted in the list by another co-expressed probe set.

The tree-based clustering will organise the most closely correlated probe sets to the driver gene in a tree hierarchy. The trees can be either visualised within the HTML web page or downloaded as Newick files to be visualised by external applications [55]. The interactive interface allows adjusting the height of the tree by enlarging or shrinking the neighbourhood of the co-expressed probe sets. A java application that is able to parse the Newick format produced by the NJ algorithm and export a tree in HTML format was implemented. r- value ranked lists of the most closely correlated probe sets, similarly to the first case, are also produced according to the tree hierarchy.

Over-representation analysis

After a probe set list of the mostly correlated genes to a driver gene is produced by either method, users can view the annotations regarding Gene Names, Gene Descriptions, Biological Processes, Cellular Components, Molecular Functions, EC Numbers, OMIM entries, Pathways, InterPro, or TransFac data. To highlight the overrepresented annotation terms, users can also perform a text-based analysis. HGCA produces summary tables showing the overrepresented terms outlining the most prominent terms of the list, which are trimmed by applying a p- value cut-off of 0.05, where the statistical significance of term over-representation is a Benjamini-Hochberg [56] corrected p-value which is based on Hypergeometric Distribution [57]:

where n is the total number of probe sets, m the total number of probe sets that contain the term, c is the number of the probe sets of the list and k the probe sets of the list that contain the term.

Inputs to GSEA.

Expression data set D with N genes and k samples.

Ranking procedure to produce Gene List L. Includes a correlation (or other ranking metric) and a phenotype or profile of interest C. We use only one probe per gene to prevent overestimation of the enrichment statistic (Supporting Text see also Table 8, which is published as supporting information on the PNAS web site).

An exponent p to control the weight of the step.

Independently derived Gene Set S of NH genes (e.g., a pathway, a cytogenetic band, or a GO category). In the analyses above, we used only gene sets with at least 15 members to focus on robust signals (78% of MSigDB) (Table 3).

Enrichment Score ES(S).

Evaluate the fraction of genes in S (“hits”) weighted by their correlation and the fraction of genes not in S (“misses”) present up to a given position i in L.

$mathtex$$mathtex$ [1] $mathtex$$mathtex$

The ES is the maximum deviation from zero of PhitPmiss. For a randomly distributed S, ES(S) will be relatively small, but if it is concentrated at the top or bottom of the list, or otherwise nonrandomly distributed, then ES(S) will be correspondingly high. When p = 0, ES(S) reduces to the standard Kolmogorov–Smirnov statistic when p = 1, we are weighting the genes in S by their correlation with C normalized by the sum of the correlations over all of the genes in S. We set p = 1 for the examples in this paper. (See Fig. 7, which is published as supporting information on the PNAS web site.)

Estimating Significance. We assess the significance of an observed ES by comparing it with the set of scores ESNULL computed with randomly assigned phenotypes.

Randomly assign the original phenotype labels to samples, reorder genes, and re-compute ES(S).

Repeat step 1 for 1,000 permutations, and create a histogram of the corresponding enrichment scores ESNULL.

Estimate nominal P value for S from ESNULL by using the positive or negative portion of the distribution corresponding to the sign of the observed ES(S).

Multiple Hypothesis Testing.

Determine ES(S) for each gene set in the collection or database.

For each S and 1000 fixed permutations π of the phenotype labels, reorder the genes in L and determine ES(S, π).

Adjust for variation in gene set size. Normalize the ES(S, π) and the observed ES(S), separately rescaling the positive and negative scores by dividing by the mean of the ES(S, π) to yield the normalized scores NES(S, π) and NES(S) (see Supporting Text).

Compute FDR. Control the ratio of false positives to the total number of gene sets attaining a fixed level of significance separately for positive (negative) NES(S) and NES(S, π).

Create a histogram of all NES(S, π) over all S and π. Use this null distribution to compute an FDR q value, for a given NES(S) = NES* ≥ 0. The FDR is the ratio of the percentage of all (S, π) with NES(S, π) ≥ 0, whose NES(S, π) ≥ NES*, divided by the percentage of observed S with NES(S) ≥ 0, whose NES(S) ≥ NES*, and similarly if NES(S) = NES* ≤ 0.

GenomeStudio Software

Visualize and analyze data generated on Illumina array platforms with GenomeStudio Software. This powerful solution supports the genotyping analysis of microarray data. Performance-optimized tools and a user-friendly graphical interface enable you to convert data into meaningful results quickly and easily.

GenomeStudio Software Modules

Genotyping Module

The graphical display of genotypes in GenomeStudio is a Genoplot, with data points color coded for the call (red = AA, purple = AB, blue = BB). Genotypes are called for each sample (dot) by their signal intensity (norm R) and Allele Frequency (Norm Theta) relative to canonical cluster positions (dark shading) for a given SNP marker.

The GenomeStudio Genotyping (GT) Module supports the analysis of Infinium and GoldenGate genotyping array data. This module enables efficient genotyping data normalization, genotype calling, clustering, data intensity analysis, loss of heterozygosity (LOH) calculation, and copy number variation (CNV) analysis. Fully integrated with the Infinium LIMS server, the GT Module allows you to access data and manage projects directly from within GenomeStudio.

As in all GenomeStudio modules, the GenomeStudio Framework displays data output in tabular form and enables you to visualize your results quickly and easily using the Illumina Genome Viewer and Illumina Chromosome Browser graphical tools.

GT Module Highlights
  • Analyze SNP and CNV data across 5 million markers
  • Estimate Log R ratio and B-allele frequency for copy number analysis
  • Call genotypes, normalize and cluster data, and generate SNP statistics
  • Export genotype data to various third party applications access multiple CNV algorithms and copy number variation analysis tools
  • Generate a chromosomal heat map for examining copy number aberrations across the entire genome for multiple samples
  • Analyze data from two different product versions within the same project

Genotyping Module Display

Gene Expression Module
  • Analyze differentially expressed genes across different genomes
  • Profile miRNA expression
  • Combine mRNA and microRNA data in a single project

This GenomeStudio heat map dendrogram clusters rows (Target ID) and columns (Differential Scores). Using the Heat Map tools in the GenomeStudio Gene Expression Module enables easy visualization and analysis of large amounts of data.

The GenomeStudio Gene Expression (GX) Module supports the analysis of Direct Hyb and DASL expression array data. It enables the visualization of differential mRNA and microRNA expression analysis as line plots, histograms, dendrograms, box plots, heat maps, scatter plots, samples tables, and gene clustering diagrams. Simplified data management tools in GenomeStudio Software include hierarchical organization of samples, groups, group sets, and all associated project analysis.

As in all GenomeStudio modules, the GenomeStudio Framework displays data output in tabular form and enables you to visualize your results quickly and easily using the Illumina Genome Viewer and Illumina Chromosome Browser graphical tools.

GX Module Highlights
  • Analyze differential expression using gene-level statistical analysis tools
  • Visualize results as line plots, histograms, dendrograms, box plots, heat maps, scatter plots, samples tables, and gene clustering diagrams
  • Simplify data management for hierarchical organization of samples, groups, group sets, and project analysis
  • Identify fold-level changes, perform T-test and ANOVA, and compare results across different sample group sets
  • Combine and merge gene expression data with DNA methylation and miRNA profiling data within the same project
  • Export whole-genome expression and genotyping data to various third party tools for eQTL analysis

Gene Expression Module Display

Methylation Module
  • Detect cytosine methylation at single-base resolution
  • Identify methylation signatures across the entire genome

The GenomeStudio Methylation (M) module supports the analysis of Infinium and GoldenGate methylation array data. This module calculates methylation levels (beta values) and analyzes differential methylation levels between experimental groups. It enables you to view CpG island methylation status across the genome with the llumina Genome Browser and Illumina Chromosome Browser.

Single-site resolution data can be visualized as line plots, bar graphs, scatter plots, histograms, dendrograms, box plots, or heat maps. This module also enables you to combine methylation data with gene expression profiling experiments within the same GenomeStudio project for correlation between levels of methylated sites (beta values) and differential gene expression levels (p values).