Bulk Transcriptomics

Bulk Transcriptomics
A bulk RNA pipeline, implemented in Nextflow and part of the Online Pipelines Platform (OP²).

Pipeline overview

Our OP² bulk transcriptomics pipeline is a bioinformatics analysis workflow used for bulk RNA sequencing data. It allows you to analyze your RNA sequencing data using this gold standard analysis pipeline. You get insights into the quality of your data, differential expression levels of multiple genes, and gene enrichment analysis.

The workflow processes raw data from FastQ inputs, aligns the reads, generates counts relative to genes and performs extensive quality-control on the results. These results are made available to you via two interactive reports, and a data package with all essential intermediate files to perform more in-depth data analysis. The pre-processing workflow processes your raw sequence data until QC approved aligned data. Next, the post-processing workflow enables you to review the biological meaning of your data via a statistical analysis approach.

This pipeline uses a standardised DESeq2 analysis script to get an idea of the reproducibility across samples within the experiment. Please note that this will not suit every experimental design, and if there are other problems with the experiment then it may not work as well as expected.

See the pipeline page for a more detailed overview.

Do you have any question about these results? Just email us at helpdesk@biscglobal.com

Report info

Generated on
2021-06-25 21:06
Experiment
87b199c0-1976-4723-b839-cadda5ff5a04
Pipeline
Bulk Transcriptomics
Report
Post-processing Report
Species
mus_musculus
Species Build
mm10

Metadata

The metadata is collected from samplesheet.

Sample ID Group ID
A A
B A
C B
D B

Sample-level Quality control

High-level quality control of the samples and their behavior based on the count data matrix.

Sample correlation

MDS Plot

Multidimensional scaling plot is generated to inspect how samples are clustered based on their relative normalization factors.

Mean-variance trend

The DESeq2 dispersion, a measure of spread or variability in data, estimates are inversely related to the mean and directly related to variance. Based on this relationship, the dispersion is higher for small mean counts and lower for large mean counts. The dispersion estimates for genes with the same mean will differ only based on their variance. Therefore, the dispersion estimates reflect the variance in gene expression for a given mean value.

The plot of mean versus variance in count data below shows the variance in gene expression increases with the mean expression (each black dot is a gene). Notice that the relationship between mean and variance is linear on the log scale, and for higher means, we could predict the variance relatively accurately given the mean. However, for low mean counts, the variance estimates have a much larger spread; therefore, the dispersion estimates will differ much more between genes with small means.

Differential Gene Expression (DGE) analysis

Differential expression analysis is performed using DESeq2 and is based on the Negative Binomial (a.k.a. Gamma-Poission) distribution.

Gene-level Quality control

DESeq package will also omit genes that have little or no chance of being detected as differentially expressed. This will increase the power to detect differentially expressed genes.

The genes omitted fall into three categories:

  • Genes with zero counts in all samples
  • Genes with an extreme count outlier
  • Genes with a low mean normalized counts

This plot shows per-gene dispersion estimates together with the fitted mean-dispersion relationship.

PCA

Principal Component Analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset (dimensionality reduction). DESeq2 uses a regularized log transform (rlog) of the normalized counts for sample-level QC as it moderates the variance across the mean, improving the clustering. Another technique is variance stabilizing transformation (vst) Both techniques aim to remove the dependence of the variance on the mean. In particular, genes with low expression level and therefore low read counts tend to have high variance, which is not removed efficiently by the ordinary logarithmic transformation.

The chosen technique to transform normalized counts here: rlog

Hierarchical clustering/Heatmap

The hierarchical clustering using the same technique (FALSE) as for PCA, shows the correlation between samples. A high overall correlation suggests no outlying samples. Also similar to PCA plot where samples are clustered by group ID.

Pairwise comparisons

The order of the names determines the direction of fold change that is reported. The name provided in the second element is the level that is used as baseline. So for example, if we observe a log2 fold change of -2 this would mean the gene expression is lower in first element relative to the control (second element). E.g. treatment (first element) vs control (second element).

The lfc.cutoff is set to 0.58; which translated to fold change of 1.5 with log2 fold changes.

(Pairwise) UP, DOWN and TOTAL-regulated significant genes

Below is a summary of up, down and total of significant genes per pairwise comparison.

Comparison Low High Total
A-B 12 6 18

Full table is written to DEG_all.csv.

Top selection of Significant Differentially Expressed Genes

Top selection of significant differentially expressed genes based on their normalized counts.

Volcano plots

Volcano plots of differential expressed genes in pairwise comparisons with threshold p-value adjusted < 0.05 and fold change > 1.5.

Gene Ontology Analysis

Gene ontology analysis is performed using annotation libraries: clusterProfiler and org.Mm.eg.db.

Column description:
Ontology
BP for Biological Process, MF for Molecular Function, and CC for Cellular Component
ID
Gene ontology ID
Description
Description of gene ontology
Gene ratio
Gene ratio
Background ratio
Background ratio
P-value
P-value
P-value adjusted
Method: Benjamini-Hochberg (p-value < 0.05)
qvalue
Q-value
geneID
All genes that occur in this ontology (Hidden below, available in csv file)
Count
Amount of genes found
Constrast
Contrast in which genes found

Software Versions

## R version 4.0.3 (2020-10-10)
        ## Platform: x86_64-pc-linux-gnu (64-bit)
        ## Running under: Ubuntu 18.04.5 LTS
        ##
        ## Matrix products: default
        ## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
        ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
        ##
        ## locale:
        ## [1] C
        ##
        ## attached base packages:
        ## [1] parallel  stats4    stats     graphics  grDevices utils     datasets
        ## [8] methods   base
        ##
        ## other attached packages:
        ##  [1] org.Mm.eg.db_3.12.0         org.Hs.eg.db_3.12.0
        ##  [3] AnnotationDbi_1.52.0        reshape_0.8.8
        ##  [5] DT_0.17                     ggrepel_0.9.0
        ##  [7] clusterProfiler_3.18.0      gplots_3.1.1
        ##  [9] DEGreport_1.26.0            pheatmap_1.0.12
        ## [11] RColorBrewer_1.1-2          forcats_0.5.0
        ## [13] stringr_1.4.0               dplyr_1.0.2
        ## [15] purrr_0.3.4                 readr_1.4.0
        ## [17] tidyr_1.1.2                 tibble_3.0.4
        ## [19] ggplot2_3.3.3               tidyverse_1.3.0
        ## [21] DESeq2_1.30.0               SummarizedExperiment_1.20.0
        ## [23] Biobase_2.50.0              MatrixGenerics_1.2.0
        ## [25] matrixStats_0.57.0          GenomicRanges_1.42.0
        ## [27] GenomeInfoDb_1.26.2         IRanges_2.24.1
        ## [29] S4Vectors_0.28.1            BiocGenerics_0.36.0
        ## [31] knitr_1.30                  edgeR_3.32.0
        ## [33] limma_3.46.0                optparse_1.6.6
        ## [35] rmarkdown_2.6
        ##
        ## loaded via a namespace (and not attached):
        ##   [1] tidyselect_1.1.0            RSQLite_2.2.2
        ##   [3] htmlwidgets_1.5.3           grid_4.0.3
        ##   [5] BiocParallel_1.24.1         scatterpie_0.1.5
        ##   [7] munsell_0.5.0               withr_2.3.0
        ##   [9] colorspace_2.0-0            GOSemSim_2.16.1
        ##  [11] highr_0.8                   rstudioapi_0.13
        ##  [13] DOSE_3.16.0                 labeling_0.4.2
        ##  [15] lasso2_1.2-21.1             GenomeInfoDbData_1.2.4
        ##  [17] mixsqp_0.3-43               mnormt_2.0.2
        ##  [19] polyclip_1.10-0             bit64_4.0.5
        ##  [21] farver_2.0.3                downloader_0.4
        ##  [23] vctrs_0.3.6                 generics_0.1.0
        ##  [25] xfun_0.20                   R6_2.5.0
        ##  [27] clue_0.3-58                 graphlayouts_0.7.1
        ##  [29] invgamma_1.1                locfit_1.5-9.4
        ##  [31] bitops_1.0-6                fgsea_1.16.0
        ##  [33] DelayedArray_0.16.0         assertthat_0.2.1
        ##  [35] scales_1.1.1                ggraph_2.0.4
        ##  [37] enrichplot_1.10.1           gtable_0.3.0
        ##  [39] Cairo_1.5-12.2              tidygraph_1.2.0
        ##  [41] rlang_0.4.10                genefilter_1.72.0
        ##  [43] GlobalOptions_0.1.2         splines_4.0.3
        ##  [45] broom_0.7.3                 BiocManager_1.30.10
        ##  [47] yaml_2.2.1                  reshape2_1.4.4
        ##  [49] modelr_0.1.8                crosstalk_1.1.0.1
        ##  [51] backports_1.2.1             qvalue_2.22.0
        ##  [53] tools_4.0.3                 psych_2.0.12
        ##  [55] logging_0.10-108            ellipsis_0.3.1
        ##  [57] ggdendro_0.1.22             Rcpp_1.0.5
        ##  [59] plyr_1.8.6                  zlibbioc_1.36.0
        ##  [61] RCurl_1.98-1.2              ps_1.5.0
        ##  [63] GetoptLong_1.0.5            viridis_0.5.1
        ##  [65] ashr_2.2-47                 cowplot_1.1.1
        ##  [67] haven_2.3.1                 cluster_2.1.0
        ##  [69] fs_1.5.0                    magrittr_2.0.1
        ##  [71] data.table_1.13.6           DO.db_2.9
        ##  [73] circlize_0.4.12             reprex_0.3.0
        ##  [75] truncnorm_1.0-8             tmvnsim_1.0-2
        ##  [77] SQUAREM_2020.5              hms_0.5.3
        ##  [79] evaluate_0.14               xtable_1.8-4
        ##  [81] XML_3.99-0.5                readxl_1.3.1
        ##  [83] gridExtra_2.3               shape_1.4.5
        ##  [85] compiler_4.0.3              KernSmooth_2.23-18
        ##  [87] crayon_1.3.4                shadowtext_0.0.7
        ##  [89] htmltools_0.5.0             geneplotter_1.68.0
        ##  [91] lubridate_1.7.9.2           DBI_1.1.0
        ##  [93] tweenr_1.0.1                dbplyr_2.0.0
        ##  [95] ComplexHeatmap_2.6.2        MASS_7.3-53
        ##  [97] Matrix_1.3-2                getopt_1.20.3
        ##  [99] cli_2.2.0                   igraph_1.2.6
        ## [101] pkgconfig_2.0.3             rvcheck_0.1.8
        ## [103] xml2_1.3.2                  annotate_1.68.0
        ## [105] XVector_0.30.0              rvest_0.3.6
        ## [107] digest_0.6.27               ConsensusClusterPlus_1.54.0
        ## [109] cellranger_1.1.0            fastmatch_1.1-0
        ## [111] gtools_3.8.2                rjson_0.2.20
        ## [113] lifecycle_0.2.0             nlme_3.1-151
        ## [115] jsonlite_1.7.2              viridisLite_0.3.0
        ## [117] fansi_0.4.1                 pillar_1.4.7
        ## [119] lattice_0.20-41             Nozzle.R1_1.1-1
        ## [121] httr_1.4.2                  survival_3.2-7
        ## [123] GO.db_3.12.1                glue_1.4.2
        ## [125] png_0.1-7                   bit_4.0.4
        ## [127] ggforce_0.3.2               stringi_1.5.3
        ## [129] blob_1.2.1                  caTools_1.18.0
        ## [131] memoise_1.1.0               irlba_2.3.3