Using RNA-seq to compare gene expression across patients instead of between Control and Experimental conditions

Using RNA-seq to compare gene expression across patients instead of between Control and Experimental conditions

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am working with RNA-seq data from the Cancer Genome Atlas TCGA and I have been reading about how people have compared gene expression levels measured by RNA-seq. Many of the papers I have read talk about "differential expression" for comparing each gene's expression levels in the Experimental and Control conditions.

In TCGA data, I typically have a patient cohort who have had the mRNA in their tumors sequenced just once so there is no Experimental-vs-Control dynamic. I am interested in finding which patients' tumors show gene-expression that is significantly higher than the rest of the cohort but I have not had any luck finding literature describing this kind of comparison. I'm thinking maybe I can apply existing differential expression techniques to my situation but that seems cumbersome and not-necessarily appropriate so thought I'd ask the community here if there's a better way of finding which members of a cohort are outliers for specific genes.

Also: all of my RNA-seq data has already been RPKM normalized for me. I have been advised that using RSEM instead would be better for comparing gene expression across multiple samples but, for logistic reasons, I'm probably stuck with my RPKM-normalized expression levels.

Fundamentally, I'm looking for the best way to compare gene expression across samples to determine which samples have outlying high/low gene expression. Intuitively, I figure I could just compute median z-scores for each gene's expression levels within my cohort and consider anyone with a |z-score| greater than 2 to be an "outlier" but I haven't found any literature to support this kind of approach either.

Any suggestions, papers, or advice will be greatly appreciated.

When you say RPKM do you mean crude RPKM or the estimates that you get using expectation maximization methods like cufflinks and eXpress?

It is better if you get your RPKM or FPKM values from one of these programs because you can differentiate between transcript variants.

I have mostly used cufflinks and eXpress. Cufflinks package is better for multiple datasets. You can use cuffquant (which takes SAM/BAM) files to compute FPKM. Cuffquant will also need a referenceGTFfile. Cuffquant gives a binary.cxbfile which you cannot read directly. Once you have generated the.cxbfiles for all your cohort samples, then pass all these files to cuffnorm. It will normalize the data and give you FPKM values for each gene in each sample in the form of a huge table.

Next point is which genes do you wish to compare. Do you want to compare known oncogenes which show consistent upregulation in all cancers? Actually there is a paper in which they have done this (I'll let you know the reference when I find it. Not able to recollect now).

You can then see how many of these genes show consistent expression in your cohort. Basically you need to identify a set of genes before you go on to study which patient shows anomalous expression.

Seems like you want to have a general approach for comparing gene expression signatures.

A recent paper,Clark et al, takes a geometric approach which is elegant. The idea is to do dimensionality reduction (singular value decomposition) of the expression data, and then calculate the cosine distance between the gene expression signatures in the reduced space.

If you apply this methodology you will be able to group together patients with very similar signatures (small distances) and identify outliers (larger distances). Moreover, based on the loadings from the singular value decomposition, you will be able to identify which genes are driving the differences in the measured distances, and thus identify 'relative differentially expressed genes'.

Generating RNA-seq Libraries from RNA

One of the most powerful methods of modern cellular biology is creating and analyzing RNA libraries via RNA-sequencing (RNA-seq). This technique, also called whole transcriptome shotgun sequencing, gives you a snapshot of the transcriptome in question, and can be used to examine alternatively spliced transcripts, post-transcriptional modifications, and changes in gene expression, amongst other applications.

Unlike microarrays, RNA-seq does not depend on prior knowledge of the genome sequence. Therefore, researchers avoid needing any preconceived notions about what to detect (via probes or primers), and this decreases overall bias. RNA-seq is a type of Next Generation Sequencing (NGS) that provides an overview of the transcriptome, which includes mRNA as well as:

  • alternative gene spliced transcripts
  • post-transcriptional modifications
  • gene fusions
  • mutations
  • SNPs
  • small RNAs, such as snoRNA, miRNA, rRNA, and so on
  • ribosomal profiling
  • changes in gene expression over time in one culture
  • comparison of gene expression in control and experimental conditions.


Deconvolution refers to a process that separates a heterogeneous mixture signal into its constituent components. In the biomedical field, researchers have used deconvolution methods to derive cell type-specific signals [1,2,3] from heterogeneous mixture data. Cellular composition information is crucial for developing sophisticated diagnostic techniques because it enables researchers to track each cellular component’s contribution during disease progressions [4]. Although some experimental approaches like fluorescence-activated cell sorting (FACS), immunohistochemistry (IHC), and single-cell RNA-seq can derive cellular composition information [3], all these approaches are either restricted by their throughput or remain too costly and laborious for large-scale clinical applications. Deconvolution, which computationally decomposes mixture signals, provides a cost-effective way to derive cellular composition information and has the potential to bring considerable improvements in the speed and scale of relevant applications.

By January 2018, approximately 50 deconvolution methods had been developed [2]. While the speed of method development is increasing, researchers now face the new challenge of selecting appropriate methods for their analysis. In methodological papers, authors often use small benchmarks to illustrate the improvements of their methods. These benchmarks only contain a limited number of deconvolution methods and samples. Moreover, different research groups applied inconsistent testing frameworks with different simulation strategies, evaluation metrics, and cell type annotations, making it difficult for researchers to reach a solid conclusion on the method’s performance. Therefore, independent benchmarks are usually in need of rigorous and comprehensive comparisons [5]. Previously, Sturm et al. [3] and Cobos et al. [6] have generated independent benchmarks of reference-based and marker-based deconvolution methods on RNA-seq data. Focusing on spill-over effects, minimal detection fraction, and background predictions, Sturm et al. [3] suggested refining signature gene lists to improve deconvolution accuracy. On the other hand, Cobos et al. [6] focused on the impact of different normalization strategies, sequencing platforms of reference data, marker gene selection strategies, and missing cellular components in the reference. Compared with previous benchmarks, which mainly focused on the influence of reference profile and feature selection, our study focused on factors directly related to the mixture samples such as mixture noise level, quantification unit, cellular component number, weight matrix property, and unknown cellular contents. In addition to these factors, we also studied factors related to the testing framework construction, such as simulation model selection, evaluation metric selection, and measurement scale selection.

There are three types of benchmarking frameworks for the evaluation of deconvolution methods: in vivo framework [7], in vitro framework [8], and in silico framework [9, 10] (Additional file 2: Table S1). The in vivo testing framework mainly relies on indirect performance assessments and usually cannot derive a definite conclusion of the method’s performance. Only a few in vivo benchmarking datasets [3, 10] have coupled FACS results for direct performance assessments. Nevertheless, these benchmarking datasets only contain limited sample numbers and cannot provide a comprehensive performance assessment [3, 10]. The in vitro testing framework where mixtures are generated in the tube with predefined mixing compositions also suffers from limited sample numbers. Moreover, most benchmarks generated from the in vitro testing framework used “orthogonal” weights [8] during the mixing process, which would potentially result in over-optimistic conclusions. The in silico testing framework synthesizes heterogeneous mixture data by simulation. The primary goal of this study is to systematically investigate the impact of different biological and technical factors, where numerous finely tuned conditions need to be created [11]. A few biologically relevant cases cannot reveal the systematic biases since all technical and biological factors are confounded. Therefore, both in vivo and in vitro frameworks are not feasible for this type of systematic comparison due to the limitation in sample number and confounding factors. Careful consideration of these issues led us to select the in silico testing framework to systematically examine the impact of different biological and technical factors, which require large amounts of benchmarking datasets under controlled and finely tuned multi-factor testing environments (Fig. 1a and Table 1).

Overview of three in silico testing frameworks. a Three benchmarking frameworks were constructed to investigate the impact of seven factors that affect deconvolution analysis: noise level, noise structure, other noise sources, quantification unit, unknown content, component number, and weight matrix. b Eleven deconvolution methods are tested and have been categorized based on the required reference input: marker-based, reference-based, and reference-free. c Performance of the methods is assessed through Pearson’s correlation coefficient (R) and mean absolute deviance (mAD). Evaluation results are illustrated by heatmaps and scatter plots. When unknown content is involved, we derive evaluation metrics in both relative and absolute measurement scales

To provide a reliable reference for the application and development of deconvolution methods, we compared 11 deconvolution methods (Fig. 1b) that cover three categories: marker-based, reference-based, and reference-free. To establish sophisticated benchmarking frameworks that mimic application scenarios of diverse biological systems, we designed three sets of benchmarking frameworks that simulated up to 1766 biological conditions with varying noise levels, library sizes, cellular component numbers, weight matrix properties, simulation models, and proportions of unknown contents (Fig. 1a and Additional file 2: Table S2). These simulated conditions will enable us to investigate the tipping point where each method deteriorates. To determine the impact of evaluation frameworks, we performed comparisons under different simulation models and measurement scales with two sets of evaluation metrics: correlation (Pearson’s correlation coefficient) and mean absolute deviation (mAD) (Fig. 1c, the “Methods” section). Moreover, we studied the impact of commonly applied simulation strategies, and by comparison to the real mixture data, we derived improved simulation strategies that can generate more complex and yet authentic in silico mixtures. Our results provide a dynamic testing landscape that allows the user to select the right method under the targeted experimental condition.


Identification of candidate reference genes by an unbiased integrative analysis of pooled cancer mRNA-Seq datasets

To identify reference genes with stable expression in serum exosomes, we interrogated RNA-Seq data from 47 serum exosome samples of patients with PAAD, CRC and HCC as well as of 32 healthy control individuals (HC) and applied Deseq2 to evaluate expression levels across samples. Only genes with high expression in both, serum exosomes of cancer patients and of healthy individuals (measured as transcripts per million (TPM)) compared to the average gene expression level (pooled-transcriptome) were considered as potential reference candidates. Our analysis firstly identified 112, 117, and 85 stably expressed genes respectively in serum exosomes of PAAD, CRC and HCC (p value > 0.1), by comparing their patients with healthy control individuals using Deseq2 analysis. Then 48 genes were found to be universally stably expressed in serum exosomes of all cancers. By sorting these genes by their expression level, we identified ten highly expressed candidate reference genes (ADP-ribosylation factor 1 (ARF1), beta-2-microglobulin (B2M), H3 histone pseudogene 6 (H3F3AP4), integral membrane protein 2B (ITM2B), membrane palmitoylated protein 1 (MPP1), ornithine decarboxylase antizyme 1 (OAZ1), protein-L-isoaspartate (D-aspartate) O-methyltransferase domain containing 1 (PCMTD1), superoxide dismutase 2 (SOD2), small EDRK-rich factor 2 (SERF2), and WAS/WASL Interacting Protein Family Member 1 (WIPF1) (Fig. 1a, indicated by red dots and Table 1). The diagonal scatterplot distribution of candidate reference genes indicates consistent expression abundance between exosomes of cancer patients and of healthy control individuals (Fig. 1a), with a correlation coefficient of R = 0.995. Furthermore, expression patterns of candidate reference genes identified by the pooled cancer analysis (including PAAD, CRC and HCC) were recapitulated in each cancer subtype as well (Fig. 1b-d).

Scatterplots of predicted candidate reference genes for serum exosomes using RNA-Seq data. Expression levels of candidate reference genes in serum exosomes are depicted for pooled cancer samples (PAAD, CRC, HCC) (a), for pancreatic adenocarcinoma (PAAD) (b), colorectal cancer (CRC) (c) and hepatocellular carcinoma (HCC) (d) samples and compared to serum exosomes of healthy control individuals. Expression values are shown as the logarithm of transcripts per million (TPM) (log2(TPM + 1)). Red dots represent candidate reference genes and grey dots genome-wide genes

Evaluation of expression levels and stability of candidate reference genes

To further validate our predicted candidate reference genes for exosomes, we compared their respective expression levels and stabilities with those of nine classical housekeeping genes: beta-actin (ACTB), beta-2-microglobulin (B2M), ribosomal protein L13A (RPL13A), tyrosine 3-monooxygenase/tryptophane 5-monooxygenase activation protein zeta (YWHAZ), glyceraldehyde-3-phosphate dehydrogenase (GAPDH), vimentin (VIM), peptidylprolyl isomerase A (PPIA), aldolase A (ALDOA), and ubiquitin C (UBC). Overall, abundance of exosomal candidate reference genes (Fig. 2a) was similar to those of classical housekeeping genes (Fig. 2b). B2M had by far the highest overall expression abundance of all candidate reference genes (Fig. 2a) which was only surpassed by the classical housekeeping gene ACTB (Fig. 2b). We then assessed the expression stability across samples and tumor types by two measures: 1) the coefficient of variation “CV” as the standard deviation divided by the mean of the expression levels (transcripts per million - TPM), and 2) the average expression stability “M” determined by the geNorm algorithm. “CV” values for the exosomal candidate reference genes (0.405 to 0.723) (Fig. 2c) were significantly lower than those for classical housekeeping genes (p = 8.10e-04, Wilcoxon rank-sum test) (Fig. 2d) with “M” values below 1.0, thus indicating higher expression stability across samples and tumor types (Fig. 2e). The “M” values were also significantly lower in candidate reference genes compared to those for classical housekeeping genes (p = 0.0279, Wilcoxon rank-sum test) (Fig. 2f). The candidate reference genes were then sorted according to their expression stability from highest to lowest, and both, the “CV” and “M” criteria achieved similar ranks for most candidates. OAZ1 was identified as the gene with the highest expression stability across samples and tumor types (Table 1). We also identified and validated ten candidate reference genes respectively for each cancer subtype including PAAD (FTL, OAZ1, FYB1, SERF2, SOD2, PCMTD1, ARPC2, NCOA4, HCLS1 and TYROBP), CRC (B2M, RPL41, SNCA, RPS9, BTF3, ADIPOR1, HEMGN, SOD2, PCMTD1 and NCOA4), and HCC (FTL, OAZ1, CD74, DDX5, PCMTD1, HCLS1, LSP1, RPL9, WIPF1 and H3F3AP4) as well (Suppl. Fig. 1).

Gene expression levels and stability of candidate reference genes for exosomes predicted with RNA-Seq data. Expression levels of ten candidate genes sorted by their respective expression levels (a). Expression levels of ten candidate reference genes (blue bars) compared with those of nine commonly used housekeeping genes (green bars) (b). Expression stability of candidate reference genes as measured by the coefficient of variation (“CV”) (c). Comparison of “CV” values between candidate reference genes and classical housekeeping genes (p = 8.10e-04, Wilcoxon rank-sum test) (d). Expression stability of candidate reference genes as measured by the average expression stability value (“M”) (e). Comparison of “M” values between candidate reference genes and classical housekeeping genes (p = 0.0279, Wilcoxon rank-sum test) (f)

Validation of candidate reference genes in exosomes of an independent cohort of ovarian cancer patients

Based on the promising results from the pooled analysis of serum exosomes of patients with different tumour types, we expected our predicted candidate reference genes to be applicable to serum exosomes from patients with various other cancer types as well. Therefore, we next sought to validate the candidate reference genes in a “real-life setting” in samples of serum exosomes of ten ovarian cancer patients and of ten healthy control individuals. The qRT-PCR results showed that as expected from the RNA-Seq data, B2M had the highest expression abundance among all candidates (Fig. 3a). Moreover, absolute abundance of SOD2, H3F3AP4, OAZ1, and SERF2 were comparable to the expression level of 18S rRNA, whereas the abundance of the remaining five genes (ITM2B, ARF1, PCMTD1, WIPF1, MPP1) was lower (Fig. 3a). Interestingly, the abundance of the reference candidate genes in serum exosomes of healthy control individuals and of ovarian cancer patients were highly consistent (Fig. 3a). Most candidate genes also exhibited high expression stability in ovarian cancer and healthy control individuals with “M” and “CV” values lower than 1.0 (Fig. 3b-e), even though some variation occurred regarding the gene order between both stability indicators. Whereas MPP1, WIPF1, SOD2 and OAZ1 exhibited lower “CV” values in exosomes of healthy individuals (Fig. 3c), in both exosome groups, OAZ1 had the lowest “M” value (Fig. 3d-e). The “M” values for OAZ1, ITM2B, SERF2, MPP1, H3F3AP4, and ARF1 were advantageous over 18S rRNA, whereas WIPF1, B2M, SOD2 and PCMTD1 in part had clearly higher “M” values indicating reduced expression stability (Fig. 3d). The expression stability of 18S rRNA was lower (indicated by a higher “M” value”) compared to many of the identified candidate reference genes especially in exosomes of healthy control individuals (Fig. 3d-e).

Experimental validation of candidate reference genes in exosomes of patients with ovarian cancer and healthy control individuals. Expression levels (Ct values) of candidate reference genes in exosomes of ovarian cancer patients (red bars) and healthy control individuals (blue bars) relative to 18S rRNA (a). Expression stability of the candidate reference genes in serum exosomes of ovarian cancer patients (b) and healthy control individuals (c) as measured by the “CV” indicator. Expression stability of the candidate reference genes in serum exosomes of ovarian cancer patients (d) and healthy control individuals (e) as measured by the “M” indicator

To quantify gene expression levels more accurately, multiple reference genes can be used [46]. Therefore, we also determined the expression stability of respective combinations of candidate reference genes by determining the average gene-specific variation with the geNorm algorithm for RNA-Seq datasets in exosomes of the pooled cancer populations and for qRT-PCR data of exosomes from ovarian cancer patients. Overall, three combinations according to their expression stability ranks (Table 1) were evaluated: 1) genes 1–3 (OAZ1, SERF2, MPP1) 2) genes 4–6 (H3F3AP4, WIPF1, PCMTD1) and 3) genes 8–10 (SOD2, B2M, ITM2B). The first group with a combination of OAZ1, SERF2 and MPP1 had the lowest average gene-specific variations in exosomes of the pooled patient group including PAAD, HCC and CRC (RNA-Seq, Suppl. Fig. 2A) as well as in ovarian cancer patients (qRT-PCR, Suppl. Fig. 2B) indicating the highest expression stability.

Identification and validation of candidate reference miRNAs in cancer exosomes

In addition to mRNA, exosomes also contain miRNA. To identify reliable miRNAs for normalization in exosomes, we analyzed miRNA-Seq data of 72 serum exosome samples of patients with HCC, HNSCC, LCA, NBL, OVA, and THCA and 31 serum exosome samples of healthy control individuals. We identified six candidate reference miRNAs with high and stable expression: hsa-miR-125-5p, hsa-miR-192-3p, hsa-miR-4468, hsa-miR-4469, hsa-miR-6731-5p, and hsa-miR-6835-3p (Fig. 4a). Expression levels and stability of the candidate reference miRNAs were evaluated in the exosomes of pooled cancer and further validated in the exosomes of ovarian cancer and healthy control individuals (Fig. 4b-j). Across the pooled exosomes of six cancer types, but also for each individual cancer type, these candidate miRNAs show high expression and similar abundance compared to exosomes of healthy control individuals (depicted as counts per million (CPM)) (Fig. 4b, Suppl. Fig. 3). Among all candidate miRNAs, hsa-miR-6835-3p had the highest expression level across samples and tumor types (Table 2). And hsa-miR-4468 had the highest and hsa-miR-6731-5p the lowest expression stability across samples and cancer types as indicated by low and high “CV” and “M” values, respectively (Fig. 4e, h). Overall, “M” values for all candidate miRNAs were low (< 1.5), indicating their general expression stability and potential utility as candidate reference miRNAs for exosomes. By integrating both stability indicators “CV” and “M”, candidate reference miRNAs were ranked and hsa-miR-4468 showed the highest overall expression stability across samples and tumor types (Table 2). Finally, hsa-miR-6835-3p with high expression level and stability was identified as the best reference miRNA.

Identification and validation of candidate reference miRNAs predicted in exosomes of ovarian cancer patients. Scatterplot of candidate reference miRNA expression levels in pooled cancer samples (HCC, HNSCC, LCA, NBL, OVA, and THCA) and healthy control individuals. Expression values are shown as the logarithm of counts per million (CPM) (log2(CPM + 1)). The red dots represent candidate reference miRNAs, grey dots genome-wide miRNAs (a). Expression levels of six candidate reference miRNAs in exosomes of pooled cancer (b), ovarian cancer patients (relative to ce-miR-39-1, n = 10) (c) and healthy control individuals (relative to ce-miR-39-1, n = 10) (d). Expression stability of candidate reference miRNAs in exosomes of pooled cancer (e), ovarian cancer patients (f) and healthy control individuals (g) as measured by the “CV”. Expression stability of six candidate reference miRNAs in exosomes of pooled cancer (h), ovarian cancer patients (i) and healthy control individuals (j) as measured by the “M” indicator

To further validate the predicted reference miRNA candidates, we measured their expression levels by qRT-PCR in serum exosomes of patients with ovarian cancer (n = 10) and of healthy control individuals (n = 10). miRNA abundance was calculated as cycle threshold numbers (Ct) relative to ce-miR-39-1. ce-miR-39-1 is a frequently used miRNA for normalization (Fig. 4c-d). These results showed the highest expression for hsa-miR-4469 in exosomes of ovarian cancer patients even though all miRNAs were less abundant than ce-miR-39-1 (Fig. 4c-d). In exosomes of ovarian cancer patients, hsa-miR-4469 and hsa-miR-4468 were the miRNAs with the highest and lowest expression levels, reproducing the results for exosomes of healthy control individuals (Fig. 4c-d). Compared to the miRNA-Seq analysis (Fig. 4e, h), hsa-miR-6731-5p, hsa-miR-4468, hsa-miR-192-3p and hsa-miR-6835-3p exhibited lower “CV” and “M” values indicating even higher expression stability in a “real-life” setting (Fig. 4f, g, i, j). Overall, all candidate reference miRNAs in exosome of ovarian cancer and healthy control individuals exhibited “M” values smaller than 1.5 indicating high expression stability (Fig. 4i-j).

Furthermore, the expression stability of combinations of multiple reference miRNAs was determined by the average gene-specific variation. We generated three combinations of two candidate reference miRNAs each according to their expression stability ranks (Table 2): 1) miRNAs 1–2 (hsa-miR-4468 and hsa-miR-6835-3p), 2) miRNAs 3–4 (hsa-miR-192-3p and hsa-miR-125a-5p), and 3) miRNAs 5–6 (hsa-miR-4469 and hsa-miR-6731-5p). The combination of hsa-miR-6835-3p and hsa-miR-4468 had the highest expression stability in exosomes of pooled groups of patients affected by PAAD, HCC and CRC (miRNA-Seq data, Suppl. Fig. 4A) or by ovarian cancer (qRT-PCR data, Suppl. Fig. 4B).


To our knowledge, we have presented the first statistical method to detect differences in scRNA-seq experiments that explicitly accounts for potential multi-modality of the distribution of expressed cells in each condition. Such multi-modal expression patterns are pervasive in scRNA-seq data and are of great interest, since they represent biological heterogeneity within otherwise homogeneous cell populations differences across conditions imply differential regulation or response in the two groups. We have introduced a set of five interesting patterns to summarize the key features that can differ between two conditions. Using simulation studies, we have shown that our method has comparable performance to existing methods when differences (mean shifts) exist between unimodal distributions across conditions, and it outperforms existing approaches when there are more complex differences.


We report the largest systematic transcriptome study of MDD yet conducted. Gene expression differences between MDD cases and controls were found to be numerous but tend to be small. We found evidence that gene expression associated with current MDD may implicate many genes or gene networks but several with effect sizes too small to be detected by our sample size. This is supported by the enrichment of the MDD associated genes with natural killer (NK) cell and IL-6 pathways, and by the network analysis that identified two clusters of genes that were significantly associated with current MDD. Interestingly, we found DVL3 to be upregulated in current MDD, and note that DVL3 had the strongest (but not genome-wide significant) associations in a mega GWAS for MDD. 5 The strongest effects were found in the comparison between current MDD and controls, and not in the comparison between the control and remitted MDD groups, which indicates that MDD state effects in gene expression are stronger than trait effects.

Down-regulation of NK cell, upregulation of IL-6 pathways

A bidirectional reinforcing communication between the brain and the immune system has been proposed as part of the etiology of MDD. 16, 37, 38 The most robust markers for immune suppression and immune activation in MDD are decreased NK cell numbers and cytotoxicity (NKCC) 39, 40 and elevated IL-6, tumor necrosis factor-α (TNF-α) and CRP protein levels. 19, 41, 42 Meta-analyses of longitudinal studies suggest a causal pathway from inflammation to MDD. 33, 43 Small gene expression studies (N<100) have shown an increased expression of IL-6 and TNF-α in MDD. 44, 45 We did not observe differential expression of IL-6, CRP or TNF-α in MDD, which may be due to low expression of these genes in blood. In addition, CRP is synthesized in the liver, and whole-blood CRP gene expression does not probably reflect blood CRP protein levels. However, the genes upregulated in MDD include one of the receptors of TNF-α (TNFRSF10C), MAPK14, the IL-6 receptor and STAT3, and are therefore significantly enriched with genes in the PID IL-6 signaling pathway. Network analysis shows that the latter two genes are part of a large network (447 genes) significantly upregulated in MDD, which is enriched with 19 genes from the REACTOME pathway 'Signaling by interleukins', and 11 genes from the PID IL-6 signaling pathway. Thus, in line with the known increased levels of some inflammatory blood protein markers in MDD, we identified the upregulation of a large gene expression network enriched with inflammatory interleukin genes. This suggests that IL-6, CRP and TNF-α may be part of a larger protein network upregulated in MDD, which may not have been identified due to the limited capacity of measuring large amounts of proteins in MDD cases.

NK cells are a subset of lymphocytes that are important in innate immunity, defending against viruses and tumors. Serotonin increases NKCC 46 and long-term treatment with drugs affecting serotonergic mechanisms (selective serotonin reuptake inhibitors) augments NKCC 47 and may protect NK cells from oxidative damage. 48, 49 NKCC increases with the resolution of depressive symptoms 50 that suggests a state rather than trait effect. To our knowledge, this is the first report of differential NKCC-related gene expression in MDD cases versus controls. These findings may reflect differences in NK cell counts, NKCC or other NK cell function. Six genes involved in NKCC were downregulated in MDD (GZMB, KLRK1, PRF1, SH2D1B, KLRD1, NFATC2) and are part of a larger downregulated cluster (64 genes) that contains nine genes associated with NKCC. Thus, our study confirms the immune suppression and immune activation hypothesis of MDD. This hypothesis was mainly based on observations at the protein and cell level—here we show that this hypothesis extends to the transcriptome level, most likely involving large clusters of genes. Dysregulation of these genes may be the cause or the consequence of MDD status, or be part of a bidirectional relationship between immune system and MDD. 41 The immune system genes we identified may serve as targets to further investigate this immune-MDD relationship.

DVL3 associated with MDD in genomic and transcriptomic studies

Gene-based comparisons of gene expression results to the largest published GWAS for MDD 5 to date revealed a suggestive overlap with DVL3 (rs1969253 GWAS P=4.8 × 10 −6 ). DVL3 is a part of the Wnt (Wingless-related integration site) signaling pathway, which is crucial in the regulation of hippocampal neurogenesis. 51 The role of the Wnt signaling pathway in mood disorders was recently reviewed. 52, 53 DVL3 transcripts have found to be decreased in individuals with MDD in the nucleus accumbens 54 and frontal regions 55 and are upregulated in leukocytes of individuals reporting social isolation. 56 Although the GWAS finding for DVL3 SNPs did not reach genome-wide significance, the parallel findings in expression studies do suggest a tentative role for involvement of DVL3 in the etiology of MDD. Although rs1969253 is not associated with DVL3 expression in whole blood, 8 it is associated with DVL3 average brain expression 57 (P=0.0028, N=134) and hippocampal expression 57 (P=0.009, N=122). We may thus hypothesize that rs1969253 is influencing MDD status via intermediate DVL3 brain expression and the Wnt pathway. From this point of view our findings in blood may be caused by correlated blood and brain gene expression. The parallel discovery of DVL3 in GWAS and our gene expression study shows how, in addition to GWAS, the exploration of additional molecular layers can contribute to understanding the genetics of MDD, similar to what was reported earlier in methylation 58 and RNA-seq 22 studies.

Meta-analysis of gene expression associations with MDD

Comparison of our results with the recent RNA-seq study in MDD 22 revealed 12 genes with consistent effects. We note in particular the findings for SRSF5, CALM1 and NMUR1. Serine/arginine-rich splicing factor 5 (SRSF5) was highlighted in a recent RNA-seq study in bipolar 59 and is involved in circadian regulation. Calmodulin 1 (CALM1) has a role in neurotransmission and calmodulin-related gene expression is altered in lateral habenula 60 and frontal cortex 61, 62 of MDD cases. NMUR1 is a receptor of Neuromedin-U, a neuropeptide associated with anxiety and depression-like behavior in mice. 63, 64 Follow-up work will be needed to further confirm the role of these genes in MDD.

Longitudinal analyses indicates reversibility of MDD effects in gene expression

For the 129 genes identified in cross-sectional analyses, a substantial proportion (15%) showed significant reversibility in expression after 2 years in the MDD group who remitted from their earlier episode. However, for several other genes we could not illustrate such a change over time. This could be caused by the fact that the power of our longitudinal analyses was limited due to the rather small sample size. In addition, gene expression patterns may not change in a very large extent as remitted patients still experience more (sub threshold) symptoms as compared with healthy controls. Also, MDD trait effects in gene expression will not reverse at all. The fact that not all baseline gene expression differences may revert when a MDD patient remits, is in line with the cross-sectional analyses in which the remitted MDD group had gene expression intermediate between the control and current MDD groups for most genes associated with current MDD. These findings, together with the lack of DNA variants underlying the gene expression–MDD associations indicate that MDD state effects in gene expression are more prominent than trait effects, as suggested earlier. 21 Our results do indicate that gene expression studies seem most informative if conducted in currently symptomatic patients. The very important implication of the found reversibility of gene expression profiles is that it points to potential modification by interventions. Consequently, it seems very worthwhile for future research studies to examine gene expression patterns as outcome parameters and examine reversibility of gene expression patterns after remittance.

Strengths and limitations

We report the largest study of gene expression in MDD yet conducted, controlling for various potential confounders and demographic covariates. The distinction between remitted and current MDD groups allowed to distinguish state and trait effects. Longitudinal analyses enabled to confirm some of the observed state effects, by indicating that for several identified genes, expression patterns indeed reversed over 2 years of time in the group of MDD remitted patients. Despite the large sample size, our findings were not highly significant, especially in longitudinal analyses. Also, due to the strong correlation between MDD status and antidepressant use, confounding effects of antidepressants may be present, although our analysis suggests that antidepressant use has no additional impact on gene expression and does not perturb associations between gene expression and MDD. Although meta-analysis with a recent RNA-seq study identified several significant genes, most genes were only significant in one study, and findings from other studies in MDD cases were not replicated. Future larger sample sizes are needed to confirm our findings. Nevertheless, we identified multiple associations between gene expression (clusters) and MDD status. Furthermore, as the results of pathway analysis of the MDD-related genes and gene clusters are in line with the immune suppression and immune activation hypothesis of MDD, the identified immune genes and immune gene clusters should be targets for future research.

Time Series Expression Analyses Using RNA-seq: A Statistical Approach

RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis.

1. Introduction

RNA-sequencing (RNA-seq) has fundamentally become the choice of studies of transcriptome [1–6]. From the conventional technologies in microarray and beginning of digital sequencing SAGE [7], a significant hurdle in the analysis of the transcriptome arises from insufficient samples, specifically, in identification of the temporal patterns of gene expression measured at a series of discrete time points. Several data-mining techniques and statistical methodologies have proven to be useful to search temporal gene expression patterns in microarrays [3, 8–34]. Some people have already started to adapt the way we applied in microarrays for RNA-seq data. The main drawback, however, is the loss of discreteness property of read count on transcriptional level, albeit there are no additional advantages in analytical aspects on counts. Given experimental design with sufficient replicate, time points, and sequencing depth [4, 5, 34], attempts to RNA-seq specific methodologies to preserve the elegant count property in time course will contribute to development and application in this area ahead. The last four years witnessed the astonishing publication of statistical methodology studies to identify differential expression between two or amongst multiple groups. Nonetheless most analysis tools remain tied to a static model approach without respect to time, albeit the incisive ultrahigh-throughput sequencing data now provides time series gene expression profile. As the first step towards understanding temporal dynamics in RNA-seq data, temporal analyses often rely on the simple pairwise comparisons [35–47] to infer differentially expressed genes/isoforms at a specific time point versus a reference time point. Differential expression results are then combined to characterize the dynamics over time. Commonly used microarray data analysis methods, such as limma [40], log linear models [39], and ANOVA [26], after variance-stabilizing transformation have also been used for temporal data analysis in RNA-seq as another alternative. However, the very few replications for such data limit the power of these methods. Statistical inference from such high dimensional data structure with the large number of variables and very few observations has presented substantial challenge. More importantly, the pairwise approaches fail to account for the strong temporal dependencies indeed, higher correlation between neighboring time points is clearly revealed in published gene expression profiles [48] and our real data applications (see Figure 2). Therefore, these pairwise approaches are suboptimal without explicitly modeling the expression dynamics over time nor can the time points that contribute most to time evolution trajectory pattern of a gene’s expression be identified. More descriptive methods, such as clustering methods, have also been applied to identify coexpressed gene sets using RNA-seq data [49–51]. Such unsupervised clustering methods implicitly assume that data collected at different time points are independent, ignoring the sequential structure in time series data. It is apparent that potentially useful information on gene regulation and dynamics may be lost with these suboptimal methods, and there is a need to develop statistical methods that can appropriately model and analyze RNA-seq data. We discuss several methods that explicitly model the time-dependent nature of the time series data in this paper. We describe the identification of temporal differential expression (TDE) analyses as well as the ranking of genes to show temporal trajectories with statistical significance. We also discuss the application of time-lagged autoregressive AR

models to identify TDE genes as well as hidden Markov models (HMM) to classify different expression patterns by posterior probabilities of latent states. These methods can be applied to study complex factorial designs that interrogate multiple biological conditions simultaneously where multiple time points are studied under two or more biological conditions. Multivariate approaches are presented to identify temporal patterns in coexpressed gene groups and quantify coupled relationship of two distinct trajectories. Here we report an in-depth analysis of temporal patterns based on nonparametric and Bayesian approaches that incorporate the context of inherent time dependence of gene expression per se. When these methods are applied for published real datasets, both static and dynamic methods performed well for most temporal genes however, dynamic methods had particularly a slight edge at low and moderate expression levels. That may be particularly advantageous for years to come for application to data with relatively low signals such as depression and aging data, which on expression compared to tumors in disease tissues.

2. Statistical Methods

2.1. Time Series Data Structure

Suppose that a gene expression profile matrix contains

different time stages. The th gene expression profile vector,

,corresponds to a sequential vector of time points and biological replicates within a time point, namely, where

is a vector composed of intraexpression measurements by

biological replicates at time point . We consider a sequence of observations on gene expression profile dataset, made at different time points accordingly dimensional gene expression vector of gene with observed read counts over time is used hereafter.

dimensional gene expression vector of gene , time point . The expression profile is a factorial time course experiment and the vector represents the intraexpression profile of biological condition within a time point. is an dimensional gene expression vector of gene , time point , biological condition, and different biological individual replicates. If there are not any treated biological conditions, the gene expression time series is simplified in .

2.2. Statistical Evolutionary Trajectory Index (SETI)

Existing static methods for testing significance of TDE genes in time series RNA-seq data do not consider temporal stochastic ordering dependency property in time, which differs from a typical gene expression profile data, and all static methods assume samples that are distributed independently and are not related to each other instead. However, it is well known that the considerable genes in gene expression profiles related to many developmental biological processes or disease progression are temporally differentially expressed and current expression level is affected by previous one by inherent Markovian property in time series. In the settings of large numbers of variables and with few observation, distribution-free or Bayesian approaches by using useful prior information are more suitable in RNA-seq. To circumvent the limitations and cope with a variety of particular patterns in time course, we present a statistical framework that enables more precise temporal expression profiling by incorporating autocorrelation measurement to determine relationship between consecutive expression profiles. Residuals in one period (or time point) are correlated with those in previous periods (or time point) and ranking individual SETI based on nonparametric regression fit as a gene-by-gene approach. As above, the gene expression level at

th time, biological condition, and replicates is fitted by smooth spline regression. The autocorrelations of the residuals are computed by the sliding of all possible cases over the original time series, which are referred to as a trajectory index for given gene. The unbiased estimate of the autocorrelation for each gene is

. is a vector to be contained by -length observations of expression measurement.

values for assessing statistical significance are calculated using a permutation test (

), assuming the absence of temporal differential expression. The confidence interval and trimmed mean of trajectory index are derived by bootstrapping analysis (

). The method is based on computing autocorrelations, that is, cross-correlation of gene expression profile across time points to represent temporal pattern. It is applied in a variety of different types of RNA-seq time series data including factorial time course experiments.

2.3. Autoregressive Time-Lagged AR Model

We propose to use an autoregressive time-lagged AR model for the identification of temporal and differential gene expression. Hay and Pettitt [54] demonstrated first-order time lag for an application to the control of an infectious disease with count data over time in which the time series observations are examined to identify significant associations with explanatory variables and counts, the incidence of an infectious disease ESBL-producing Klebsiella pneumoniae in an Australian hospital, and the explanatory variable is the number of grams of antibiotic third-generation cephalosporins used over that time period. In order to essentially propose a universal dynamic method with AR model in RNA-seq, we consider models to allow flexibility without covariates in lieu of taking their initial approaches. The details of our AR model for read count gene expression profile over time as a gene-by-gene TDE identification are discussed in the following with mathematical notations. Bayesian framework is defined by

and to be independently distributed as Poisson model. We employ their model for RNA-seq read count expression data.

2.3.1. Poisson Model in AR

From the time series data structure (Section 2.1), we have time points, biological conditions, and replications. Both maize and zebrafish data with single measurements within a time point are applied in this method.

To identify altered gene expression across time series, for each gene, the AR model is applied and inference of is obtained from noninformative priors and time series random effects for sequential expression profile are assumed. This autoregressive model was originally carried out for longitudinal large-scale historical repeated-measurement data. In our study, using the modified assumptions, RNA-seq time series with short time period (4

8 time points) and single observations as gene-by-gene approach are applied to compare the performance of AR model to static methods in identification of differential expression. The posterior probabilities of parameters in the model are estimated through MCMC simulations with

,000 iteration and 1,000 burn-in. We provide detailed notations and equations for three dynamic approaches in Supplementary data available online at In the results, we are most interested in autocoefficients to represent time series sequential random effects in the model and we implemented a classification between TDE (temporally differential expression over time) and EE (equally expression over time) set of genes. Similar to statistical differential expression testing for each gene in a classical approach, our implementation of testing in AR model is given by a Bayesian interval estimate, 95% credible interval:

where we consider that gene is temporally differentially expressed (TDE) if the 95% credible interval of does not include 0 otherwise it is considered to be equally expressed (EE). Also we obtain the tail probability of of gene , that is,

or for using MCMC. It indicates the significance of differential expression for each gene.

2.3.2. Negative Binomial in AR

A more compelling methodological goal is to infer temporal dynamics when we have replicates within a time point and it is straightforward to establish a negative binomial model with AR :

Other parts of the model remain identical as in (3). Here, means that has its probability function as follows:

This negative binomial distribution has its mean

. The parameter is called the dispersion parameter.

2.4. Hidden Markov Model (HMM)

We consider a Bayesian HMM to analyze factorial time course RNA-seq data. Our model follows the seminal work of Yuan and Kendziorski [48] that characterizes all possible temporally differential expression patterns in time series microarray data with two or more biological conditions. Although this early study was encouraging, the HMM was restricted to represent timing differences between biological conditions with binary EE/DE or multiple cases of latent hidden states depending on the number of given conditions at each time point. The extent of temporal changes was not obvious in significantly differentiating between one time point and the next. Taking a HMM approach, we seek SETI and multivariate coupled relationships among distinct trajectories into HMMs in each condition to investigate biological evolutionary trajectory that can be applied to a comprehensive set of RNA-seq time series data to make probabilistic predictions of temporal patterns for how differential expression will occur under different biological conditions. Also, count specific underlying distributions for RNA-seq time series data are used. First, we introduce a mechanism to use the inference of temporally differentially expressed genes in time series RNA-seq gene expression profiles with multiple biological conditions at a given time point. This was achieved by incorporating GP and NBD with corresponding prior information into the HMM for each gene, allowing samples having either multiple replicates or single observations. We investigate properties of the HMM technique such as how it benefits by incorporating hidden variables when making the predictions of temporal patterns of differential expression for given different biological conditions and how the number of chosen latent variables varies with conditions within a stage over a time period. As per Section 2.1, we present how to express hidden states in the given models with subindices composed of time points, different biological conditions (e.g., drug treatments or tissues), and replicates. As RNA-seq experiments generally have small sample sizes, the identification of statistically significant temporally differentially expressed (DE) genes may have limited power. Also, some studies stress the importance of replication in microarray studies, which have inherent variability [4, 5, 33, 34, 55] regardless of how well constructed DE methods are applied. Thus, without replicates, no statistical significance tests are reliable and powerful on detection of TDE. With the reduction in sequencing costs, well-designed balanced RNA-seq experiments with proper sample sizes and time points will facilitate the use of temporal dynamic methods, including AR model. Here HMM is used with samples and 4 biological conditions (different tissues). Consider that the gene expression dataset has = genes, = time points, = conditions, and = replicates. This algorithm has the Markovian assumption that the expression level at the current time only depends on that at the most recent time. We use hidden states to represent a change in expression levels between different biological conditions. Thus, this framework allows us to detect TDE genes and to facilitate the calculation of the posterior probabilities of all possible TDE patterns. For instance, with three time points, this method can estimate the posterior probability of pattern EE-DE-EE, where EE stands for equally expressed and DE for differentially expressed, respectively. Namely, the main interest is to identify the relationship among the class latent mean values of expression level for each gene g at each time point denoted by

. Hereby, the primary goal of HHM in time course experiment with multiple different conditions is to infer all potential relationships from different conditions for simplest case with two biological conditions, it is binary outcome with EE/DE, and for complicated experimental design with more than two biological conditions, suppose that biological conditions correspond to different tissues, hereafter tissues A, B, C, and D. Correspondingly, there are 4 expression profiles , and , and 15 possible expression pattern states include the following:

More generally, the number of all potential patterns as a function of the number of tissues is equal to the Bell exponential number of possible set partitions. Here each state is not observed and needs to be estimated from the data. Therefore, we refer to such states as hidden. For each gene g at each time point , we want to estimate the probability of each hidden state

and then we associate an observation model with each state and eventually also compute the most likely sequential states over time to derive timing differences for a given gene g. Fitting a hidden Markov model involves estimating the transition probability matrix

, initial probability distribution

, and unobserved hidden state at time , and estimations are done by EM algorithm as described and implemented in the original paper of HMM. The parametric empirical models (PEM) of GP and NBD sample are considered here.

In the GP model, for two biological conditions at each time point and two marginal distributions of hidden states are given the following equations, as shown in Yuan et al., for microarray application. The underlying distributions and joint predictive density (JPD) for discrete count data are incorporated to infer posterior probability distributions:

If represents the proportion of TDE genes at time , then the mixture type of marginal distribution of the data is given by

, . follows a conjugate prior with gamma distribution parameters, shape parameter

, and rate parameter . Thus, three parameters

need to be estimated for a given gene. For the GP model, the Markov chain is assumed to be homogeneous and the marginal distribution of is the finite mixture

. We assume one-step first-order correlation time series structure so that HMM contains with Poisson distributed state-dependent distribution. The goal of this algorithm is to identify a certain set of genes that are TDE in a combination of time series and four different biological conditions, for example, distinct tissue types. To address the utility of HMMs proposed in time course RNA-seq experiments with multiple different tissues, we exploit a parametric hierarchical empirical Bayes model with GP (data w/o replication) and NBD (data w/replications) with beta-prior as a well-modified Bayesian approach [42, 56, 57]. The Newton et al. [57] approach identifies differentially expressed genes for microarray experiment framework in multiple biological conditions at a static time point and similarly Hardcastle and Kelly [42] identify differentially expressed genes either for pairwise comparisons or for multiple group comparisons in an RNA-seq experiment framework at a static time point. For microarray data, Yuan and Kendziorski [48] proposed a HMM for a dynamic time course experiment with multiple conditions Gamma-Gamma (GG) and Log Normal Normal (LNN) to identify genes of interest whose temporal profiles are different across two or more biological conditions. Here, we adapted and extended that approach to a general RNA-seq framework with GP and NBD models as more flexible models. The earlier studies are limited to detect temporal patterns other than ranking/ordering temporal dynamic specific genes during developmental stages, which biologists are more interested in examining. We assume two common underlying distributions for RNA-seq read count. In reality, violation of GP assumptions is very common and in order to account for overdispersion. Alternatively, NBD is applied with a beta-prior. The above inference method provides for continuous trajectory regression involved with timing evolution features to rank temporal genes statistically for a given pattern, as well as such genes’ temporal differential expression patterns among conditions. In addition, we examined multivariate identification of temporal expression using the following several metrics.

2.5. Coupled Multivariate Identification of SETIs
2.5.1. Granger Causality

The concept of Granger causality between two distinct SETIs assumes that the data at the current time point affect the data at the succeeding time point [58]. To determine Granger causality for each pair of trajectories, we employ standard

-statistics to test if the residual values derived from the fitting smoother for gene A are incorporated into the equation for another gene B. If all the coefficients for the measurements of gene B are zero under the null hypothesis, then there is no statistically significant Granger causality between the trajectories for genes A and B.

2.5.2. Cotrajectory with Glass-d-Score

Similarly, each pair of two trajectories, which correspond to two gene expression levels, is explored by another dependency metric score and detailed notations are described in the following, when there is a given pair of two gene expression profiles:

where is the correlation coefficient between the expression profiles of among all possible pairs. The null distribution was assumed to have

the mean and standard deviation of correlation coefficient between gene and all other genes, respectively.

2.5.3. Correlation Approach

As proposed in Ma et al. and Barker et al. we propose a biologically motivated approach to measure the relationship between two different genes based on their temporal expression profiles in RNA-seq. Ma et al. proposed to consider lagged coexpression analysis to capture the scenario that there is a delayed response of gene B to gene A so that the profile of gene B is correlated with the time delayed profile of gene A.

2.6. Pairwise Methods

In this section, we describe the pairwise methods that we consider in our comparisons with the methods discussed above that can explicitly model the time dependencies nature in the data. For comparisons with our dynamic methods, we examined several popular static methods, including Fisher’s exact test for simple two sample comparisons and log linear model for multigroup comparison, which can also be applied for RNA-seq time series data in temporal analysis as intuitive but limited.

DE analyses: we first employed pairwise condition comparison methods in digital measures at a given static status without respect to time. It is no surprise to take a union set of all possible pairwise comparisons using these static techniques to identify temporal dynamics in relatively small experiments, where single sample for each time point and very few number of time points are contained in experimental design. (i) Fisher’s exact test: from Table 1, the 2-sided value for TDE of each gene is computed with (12) [39]:

(ii) Audic-Claverie statistics.

The Audic-Claverie statistics [59] are based on a distribution over read counts in one sample in one given group informed by the read counts under the null hypothesis that the read counts are generated identically and independently from an unknown Poisson distribution. is computed by infinite mixture of all possible Poisson distributions with mixing proportions equal to the posteriors under the flat prior over . When the two libraries in a given Solexa/Illumina RNA-seq experiment are of the same size,

These are Audic-Claverie statistics [59] for given read counts and .

Pooling methods: as with ANOVA in microarray, log linear model and linear models for microarray data (LIMMA), after variance-stabilizing transformation to allow multigroup and multifactor comparisons, can be applied by including a time variable as the main factor in the model [40]. (i) Log linear model with the Poisson link function (or negative binomial when replicates are available) and likelihood ratio test model. In the model, the time factor, biological condition factors, and their interaction terms are included. (ii) LIMMA (linear model for microarray) with -statistics under the linear model setting implemented in R package is also applied for time series RNA-seq read count data after variance stabilizing transformation. Although such static algorithms have demonstrated a successful identification of temporally expressed genes in some degree in the past four years and our study, any temporal dynamic analysis false discovery results in static methods can be introduced due to violation of Markovian assumptions frequently revealed in time series expression profile. As the cost to sequencing continues to decline, there is urgent need for more sophisticated statistical methodologies of power in the identification of temporal expression or for use of characterization of temporal dynamics to assess isoform diversity within a gene level in a future investigation of time series RNA-seq. Ideally, it is very critical to appropriately have a good model to represent observed data since interpretation of a model that does not contain valuable information is useless. For this important purpose, our dynamic methods are compared to these static methods by evaluating the overlap in the number of differentially expressed genes in real data sets.

3. RNA-seq Time Series Data

3.1. Three Different Types of Time Series

There are mainly two types of time series in RNA-seq. The first is factorial time series data that include at least two biological conditions to be compared in a given time point and have multiple developmental patterns over time as the number of conditions. The second type of time series has a single condition and corresponding developmental stage. In the third type of time series, there are subsequently two additional subtypes, circadian rhythmic data and cell cycle data. In this study, we formulate the statistical framework of identification of temporal changes in RNA-seq time series for first two types of data and the periodic data-sets are reviewed in “another review manuscript” with discrete Fourier transformation and other methods in a separation in depth.

3.2. RNA-seq Real Time Series Datasets
3.2.1. Factorial Time Course Experiment: A Sheep Model for Delayed Bone Healing

We consider this published RNA-seq time series data from a sheep model for delayed bone healing. In Jager et al., surgery was conducted as described in [52, 53] and the newly generated tissues were harvested at different days, 7, 11, 14, and 21 after surgery. For each time point, there are 6 biological replications for both groups except one time point, for day 21 (group I, , group II, ), where two groups are defined by standard healing system and delayed healing system. Thus, the authors considered two treatments: standard healing system and treatment with unstable external fixator leading to delayed bone healing. While the standard bone healing system was investigated in a 3?mm tibial osteotomy model stabilized with a medially mounted rigid external fixator, delayed healing was investigated in a 3?mm tibial osteotomy model stabilized with a medially mounted rotationally unstable external fixator. For each treatment, RNA-seq data were collected at 4 time points: 7, 11, 14, and 21 days, with 5-6 individuals’ DNA samples pooled together at each time point. In their differential expression, they used the pooled samples from 5

6 lanes of animal samples at one time point and Audic-Claverie statistics were performed using 4 samples over 4 time points by taking a union set of all possible pairwise comparisons using static methods. We reanalyzed their sheep animal time series data using three dynamic methods to identify TDE genes.

3.2.2. Single Transient Time Course Experiment-I

We applied two single biological condition time series data which are interested in exploring developmental transient patterns during a time period rather than timing difference patterns incorporated with multiple conditions at a time as Section 3.2.1 example. Maize leaf transcriptome with four different developmental zones containing two replicates in each time point [50] was employed. This is one representative for time course experiment with single transient expression profile. Tissues were collected from leaf 3 at 9 days after planting 3 hours into the period from four segments: basal (1?cm above the leaf three ligule), transitional (1?cm below the leaf two ligule), maturing (4?cm above the leaf two ligule), and mature (1?cm below the leaf three tip). Thus, maize leaf data with different developmental stages are generated from mRNA isolated from four developmental zones: basal zone, transitional zone, maturing zone, and mature zone. In the differential expression analysis, they simply applied chi-squared static method and

-means clustering method that both do not take into account time dependency, but all samples are assumed to be independent. This maize leaf time series data are reanalyzed with proposed methods in this study.

3.2.3. Single Transient Time Course Experiment-II

This is a time series experimental design to be composed of eight stages during early zebrafish development, embryogenesis [51]. In their study, wild-type zebrafish embryos (TLAB) were staged according to standard procedures and about 1,000 embryos were collected per stage (two to four cells, 1,000 cells, dome, shield, bud, 28?hpf, 48?hpf, and 120?hpf) within a tight time window of

10?min. Their collection of embryos was ensured that all embryos were at the same developmental stage. The identification of long noncoding RNAs (lncRNAs) expressed during zebrafish embryogenesis was explored to assess a diversity of transcripts that are structurally similar to, but noncoding, mRNAs. The analyses of RNA-seq time series expression profiles focused on the identification of temporal dynamics of lncRNAs using the Cuffdiff method in its time series mode with upper quantile normalization, which is also limited to pairwise comparison from previous time point to right next time point. Here, the data reanalyzed the transcriptomic gene expression profile data with 28,520 annotated protein coding genes. To consider the possibility of similarities and differences in comparisons between static and dynamic methods for time series RNA-seq data, we systematically compared both methods with these data.

3.3. Results in Differential Expression Analysis on Static and Dynamic Methods

For the sheep data, the authors applied the Audic-Claverie method to the normalized expression values, RPKM, to compare later time points to the reference time point (7 day) in both groups. After all pairwise comparisons, they combined the sets of differentially expressed gene sets with 884 genes detected in total from 24,325 mappable genes. Based on these 884 genes, they performed hierarchical clustering to identify gene clusters. Each cluster was then subject to gene ontology analysis to find significant biological functions. The differential analysis performed in original paper is based on static differential analysis method. We reanalyzed their sheep factorial time course experiment data to identify TDE genes over time through dynamic methods, HMM, SETI, and AR model to account for correlated time-dependency structure. HMM identifies temporal patterns with classification of DE/EE at each time point by posterior probabilities, whereas SETI with statistical significance from permutation resampling procedures and AR model with gamma Poisson Bayesian assumption on count data are applied within single biological condition, separately. Results obtained by these dynamic methods compared those of static methods, simple pairwise methods, Audic-Claverie statistics and Fisher’s exact test, and pooling static methods, glmFit in edgeR, LIMMA, and log linear model as shown Figure 3. To identify temporal dynamics by assuming correlated data structure, we performed HMM modeling with Poisson-gamma since there were no replicates. AR model and SETI significance tests were also done within each biological condition. Temporally differential expression gene sets detected by these dynamic methods were compared with the results of simple pairwise tests and pooling methods. From the HMM, 646 temporal dynamics of DE calls are identified to represent DE in at least one time point. The HMM model only explores different temporal patterns of DE/EE states and does not rank the genes by statistical significance, but is classifying gene expression profile into a number of temporal patterns by posterior distribution of latent states. Because of this limitation, we employed the SETI and AR models to discover developmental transient patterns in each condition. The trimmed mean time evolution trajectory index is presented for the top three candidate temporal genes in each bone healing system. The 95% confidence interval of bootstrapping and FDR of permutation re-sampling are shown in Figure 4 and Table 2. To determine temporal dynamics and meaningful biological functions, only HMM-specific TDE genes which are not contained in static methods are further explored in gene clustering and biological functional network analysis as shown in Figures 5(a) and 5(b), respectively. In the results, they obviously showed temporally differential expression implying that loss of information to assumption of stochastic time-dependent structure might lead to false discoveries and less power of detection. To discover temporal transient patterns of differential gene expression within each biological condition healing system, we performed SETI and AR model approaches for each condition, SETI results are given in Figure 4 showing top candidate TDE genes, of which some genes such as gi|119921123 and B6DXC7 are of low expression levels which we were not able to detect in static methods. In the second data for our study, we have reanalyzed maize leaf transcriptome data to identify TDE genes with static and dynamic methods and compare between two. In their paper, they investigated leaf development gradient in time series gene expression data at successive stages (4 time points: base, tip: basal, transitional, and maturing) and identified a gradient of gene expression from base to tip: basal (23,354) > transitional (22,663) > maturing (22,036) > mature (21,332) from a total of 25,800 annotated genes. In the differential analysis in times series RNA-seq data, they used the method proposed in Marioni et al. [40] for pairwise analysis. A total of 16,502 genes were found to be differentially expressed in at least one of the comparisons. They then performed -means clustering and showed eighteen clusters along the four developmental zones (Base, -1?cm, 4?cm, Tip). To compare gene sets detected by our dynamic methods with their gene lists, dynamic methods, SETI, and AR model are applied again in this study and all temporally differentially expressed genes are presented in Supplementary Tables 1 and 3, where filtered gene set to be tested in differential expression has 5273 and 12,322 temporal dynamic transcripts from 42399 transcripts through SETI and AR model, respectively. On the basis of significant temporal expression, we compared dynamic methods to static methods, which were used in the original paper without accounting for correlated data structure type. As the third real data application, to identify temporal dynamics, we have reanalyzed the third data, zebrafish embryonic transcriptome, focusing specifically on the identification and characterization of temporally differential expression using statistical evolutionary trajectory index and autoregressive time-lagged AR model. We furthermore implemented both methods to rank temporal genes by statistical significance. As consequence of the resampling-based procedures and posterior probabilities of autocorrelation, it was possible for gene-by-gene approach to order temporal genes by two dynamic methods and identify genes associated with cotemporal dynamics. To investigate such paired temporal dynamics, we examined the relationships between genes using bivariate identification methods. Glass-s-d score is reported in Supplementary Table 5. Likewise, the statistical evolutionary trajectory index with statistical significance for zebrafish data is given in Supplementary Table 2, where we filtered out genes by coefficient of variation (CV) criteria remaining 12,034 genes. Overall, both methods show more robustness at low and moderate expression levels when compared to existing parametric static methods indicating that our methods achieve relative improvements in test of identification of temporal genes and AR model shows more sensitive TDE calls than SETI resampling procedure in two real data applications. Here, we examined how different results are obtained by dynamic time series methods. For simple pairwise static methods, we employed Audic-Claverie statistics and Fisher’s exact test as these two methods have been widely used in previous studies. They showed highly concordant results on other RNA-seq datasets compared to DEGseq, DESeq, edgeR, and baySeq (data not shown). In differential analysis with simple pairwise methods, we took a union set after all pairwise comparisons across a time period and amongst different biological conditions as these methods only consider two pairwise comparison testing and confirm the results to those of original papers. For pooling static methods, LIMMA, log linear model, and edgeR R package with glmFit are carried out to identify TDE genes. To compare with above static methods, we employed three dynamic methods described in the previous sections. The results are shown in Figures 1 and 3. Figure 2 shows how dependent structure is observed in patterns identified across time points, 36(23), 300(277), and 186(134), of the previous TDE gene set, genes in

percentage are differentially reidentified at the right next time point, implying that there is temporal dependent structure in sheep healing system RNA-seq time series data.


We examined the 3 currently available RNA-seq data sets with the largest number of samples ( Pickrell and others, 2010 Montgomery and others, 2010 Cheung and others, 2010). In all 3 studies, the samples are lymphoblastoid cell lines from unrelated individuals in the HapMap project (International HapMap Consortium, 2003). Montgomery and others (2010) sequenced 60 individuals from the Utah residents with ancestry from northern and western Europe collected by Centre d'Etude du Polymorphisme Humain (CEPH). Cheung and others (2010) sequenced 41 individuals also from the same population with 29 in common with Montgomery and others (2010). Pickrell and others (2010) sequenced 69 individuals from Yoruba in Ibadan, Nigeria. All 3 studies, hereafter referred to as Montgomery, Pickrell, and Cheung, were designed to study the effect of genetics on gene expression and subjects were considered interchangeable. We therefore used these data to assess improvements in precision. The samples that were done in replicate across 2 studies were particularly useful for this purpose.

We also examined 6 human samples from Pai and others (2011). This study sequenced 3 male and 3 female livers and compared the results to other primate species. The samples were from primary tissues, as opposed to cell lines which are generally associated with more stable RNA data. This data set, which we refer to as Pai, served as an example of a small study based on a controlled experiment.

To assess accuracy, we used samples from Bullard and others (2010), in which 2 samples from the microarray quality control study (MAQC Consortium, 2006) were sequenced. These 2 samples are Stratagene's universal human reference RNA (UHR), which is a commercial pool of RNA from 15 different cell lines, and Ambion's human brain reference RNA. The same samples have been assayed extensively on microarrays, and we used data from the MAQC Consortium (2006), in which each of the 2 samples was hybridized to 5 different Affymetrix U133 Plus 2.0 arrays (Affymetrix, Santa Clara, CA). The microarrays served as an independent measurement that permitted an assessment of accuracy. This data set has no biological replicates, and the technical replicates are based on commercially available RNA, making the technical noise smaller than what would be expected from tissue samples.

For all data sets, the original reads were downloaded, mapped, and the gene expression count matrix created as follows. Reads were aligned to the human reference genome sequence (version hg19) using Bowtie ( Langmead and others, 2009), allowing up to 2 mismatches. All reads were trimmed from the 3 ′ end to be 35-bp long, and for the Montgomery data, we used the first read of the paired-end reads. To assign reads to genes, we followed essentially the same procedure used by Bullard and others (2010), except for (a) we determined overlap between a read and a genomic region based on the center base of the trimmed read and not the 5 ′ end and (b) we used union gene representations instead of union–intersection gene representation as discussed in Bullard and others (2010). Sequencing depth was determined as the number of reads mapped to the genome.


The use of RNA-Seq experiments to study organisms’ genomes is becoming ubiquitous, and the explosion in the use of sequencing technology has led to a related explosion in the development of statistical methods for processing and analyzing RNA-Seq data. As previous research has demonstrated [ 10], proper normalization is an essential step in the analysis pipeline. We have seen that incorrect normalization can result in downstream errors such as inflated false positives. The need for normalization arises from the inherent variability in the collection of RNA-Seq data, and a variety of normalization methods have been devised to combat this variability. As we have seen, the literature has not reached a consensus on which normalization method to use.

Both the simulations and the real data allow us to understand the effects of symmetric versus asymmetric differential expression and the effects of differing amounts of mRNA/cell. The simulations isolated all other conditions and allowed for a direct comparison between methods. The real data told the same story as the simulated data with respect to the (a)symmetry of the differential expression, validating the more complete simulation results. In particular, it is worth noting that the performance of Total Count normalization depends on the amount of mRNA/cell and not differential expression symmetry. Indeed, Total Count normalization outperforms the other normalization methods when the data are asymmetric with same mRNA/cell, though we do not know how often such conditions occur in real, full data.

Each normalization procedure relies on assumptions, and when violated, the procedures lead to incorrect results. For each assumption, there is evidence that it may not hold in some experiments. Part of an analysis of RNA-Seq data requires choosing a normalization procedure, and keeping the assumptions of each method in mind can help to make the appropriate choice for the experiment at hand. However, there may be many situations in which the validity of any assumption is unknown for the given experiment. In such cases, normalization with external controls would be the appropriate choice if the external controls can be trusted. Unfortunately, several authors have found problems with spike-ins and so propose additional methods to handle these issues. It is clear that spike-ins are necessary in some circumstances, and we hope that as research progresses their performance will improve.

To the best of our knowledge, there does not exist an extensive analysis of published data, which evaluates the assumptions of normalization methods. Given the potential violations to each normalization assumption, knowledge of the extent to which each assumption holds in a given experiment would be instrumental in helping to choose a normalization method for RNA-Seq analysis. There is no clear way to perform such an evaluation, however, considering that violations of assumptions (such as a global shift) may go undetected without additional information, and the requisite information may not be present in the original experiment.

Assumptions allow normalization to translate raw read counts into meaningful measures of expression.

The correct normalization method to use depends on which assumptions are valid for the biological experiment.

Incorrect normalization leads to problems in downstream analysis, such as inflated false positives, that mean results cannot be trusted.

No normalization method is perfect, and for every method there exists cases for which the assumptions are violated. There are examples of global shifts in expression that violate assumptions of conventional normalization methods, requiring controls.

An understanding of assumptions can help pick the most suitable normalization method for a given experiment.


Now that we have performed the differential expression analysis, we can explore our results for a particular comparison. To denote our comparison of interest, we need to specify the contrast and perform shrinkage of the log2 fold changes.

Let’s compare the stimulated group relative to the control:

We will output our significant genes and perform a few different visualization techniques to explore our results:

  • Table of results for all genes
  • Table of results for significant genes (padj < 0.05)
  • Scatterplot of normalized expression of top 20 most significant genes
  • Heatmap of all significant genes
  • Volcano plot of results

Table of results for all genes

First let’s generate the results table for all of our results:

Table of results for significant genes

Next, we can filter our table for only the significant genes using a p-adjusted threshold of 0.05

Scatterplot of normalized expression of top 20 most significant genes

Now that we have identified the significant genes, we can plot a scatterplot of the top 20 significant genes. This plot is a good check to make sure that we are interpreting our fold change values correctly, as well.

Heatmap of all significant genes

We can also explore the clustering of the significant genes using the heatmap.

Volcano plot of results


  1. Nikojin

    I apologize for interfering, I wanted to express my opinion too.

  2. Kezil

    I congratulate, it seems to me the brilliant thought

  3. Kagataxe

    The post is not bad, I'll bookmark the site.

  4. Mazudal

    I apologize for interfering ... I am here recently. But this topic is very close to me. I can help with the answer. Write to PM.

  5. Andor

    Let's check it out ...

  6. Gora

    In it something is. Thank you for the explanation, easier, better ...

  7. Hsmilton

    Completely I share your opinion. I like this idea, I completely with you agree.

Write a message