Metadatas

Below is an overview of the metadata of the count tables and samples. The high quality count tables are produced and filtered by STARsolo. Filename column corresponds to the prefix of uploaded samples. This is added to the table as this may differ from samplename.

Sample	Group ID	Filename
pbmc10k	healthy	pbmc_10k_v3_S1

Quality Control

Low quality libraries from different cells can cluster together due to similarities in damage-induced expression profiles. These low quality libraries are not removed from the filtered dataset from STARsolo, therefore a quality control has mainly three metrics to check the quality of the data:
- Number of UMI: check for cells with low total counts
- Genes: check for low expressed genes
- Percentage of mitochondrial: High percentage can cause their own distinct clusters.
QC plots with cell density are created, instead of the normal violin plots from Seurat package, as these are more intuitive to understand. To identify cells that are outliers for the various QC metrics, it uses the median absolute deviation (MAD) from the median value of each metric across all cells. Specifically, a value is considered an outlier if it is more than 3 MADs from the median in the “problematic” direction. This is loosely motivated by the fact that such a filter will retain 99% of non-outlier values that follow a normal distribution.

The count tables of each sample are converted to a Seurat object [1]. Filtering is based on MAD (median absolute deviation) with default value MAD = 3.

QC metrics (log10nUMI, log10nGene, percentage mito) have been calculated, added to metadata and up to 5 samples are plotted here. The remaining plots can be found under QC folder in the results folder.

The UMI counts per cell should generally be above 500, that is the low end of what we expect. If UMI counts are between 500-1000 counts, it is usable but the cells probably should have been sequenced more deeply.

Note: Assumption is that batches have high quality to apply MAD. Samples from multiple batches can influence MAD. If sequence coverage is lower in one batch, it will drag down the median and MAD. This will reduce the suitability of adaptive threshold for other batches.

Cell Cluster Annotation and Identity

Normalization focusses on removing technical biases so any differences in comparing expression profiles are due to biological occurences. In case of scrna-seq, library size normalization is sufficient enough where the aim to identify clusters and the top markers that define each cluster.

During feature selection we want to select genes that contain useful information about the biology of the system while removing genes that contain noise.

The simplest approach to feature selection is to select the most variable genes based on their expression across the population.

Integrating seurat objects is only applicable to multiple datasets as this has the following goals:
- Identify cell types that are present in all datasets
- Obtain cell type markers that are conserved in samplesheet indicated conditions
- Compare the datasets to find cell-type specific responses to treated conditions

Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData function shifts the expression of each gene, so that the mean expression across cells is 0. It also scales the expression of each gene, so that the variance across cells is 1. This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate.

Principal components analysis (PCA) discovers axes in high-dimensional space that capture the largest amount of variation. When applying PCA to scRNA-seq data, our assumption is that biological processes affect multiple genes in a coordinated manner. This means that the earlier PCs are likely to represent biological structure as more variation can be captured by considering the correlated behavior of many genes. By comparison, random technical or biological noise is expected to affect each gene independently. There is unlikely to be an axis that can capture random variation across many genes, meaning that noise should mostly be concentrated in the later PCs. This motivates the use of the earlier PCs in our downstream analyses, which concentrates the biological signal to simultaneously reduce computational work and remove noise.

To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. Determining the 'dimensionality' of the dataset can be done by an ElbowPlot, where the cutoff is when the line becomes stagnant. Due to automatization, dimension are pre-determined (resolution = 0.8).

Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis. With UMAP, it should be possible to interpret both the distances between / positions of points and clusters. Therefore, UMAP is plotted.

UMAP map is run on first 10 PCs. SNN (Shared Nearest Neighbor) is built on a dimensionally reduced form (first 20 PCs) of the data. Default setting (resolution = 0.8) were used to determine clusters. This resulted into 20 clusters.

After celltype annotation, there are 6 clusters: CD4+ T-cells, Monocytes, B-cells, CD8+ T-cells, NK cells, HSC. Annotation were automatically assigned using SingleR[2] and AnnotationHub[3]

Differential Analysis

If there are more than three cells in each celltype per condition (group_id), then differential testing is performed.

First the combinations for pairwise comparison is made. This can be on the column group_id if there are 2 or more conditions. For every celltype, the pairwise combination is performed, i.e. control vs treatment in celltype A. In case there is one condition, the celltype is used, i.e. celltype A vs celltype B, celltype B vs celltype C, celltype A vs celltype C.

FindMarkers will find markers between two different identity groups - which are specified in combinations. This is useful for comparing the differences between two specific groups. Default settings are used (log-foldchange = 0.25, minimum percentage = 0.1), with wilcox as test use.

Differentially Expressed Genes are annotated using the species gene database annotation package.

Differential Analysis Table

Below are all the differentially expressed genes found. A filtered subset can be created by applying the preferred value on top of the table and saved in excel or csv format.

Volcano Plot

To show statistical significance (P value) versus magnitude of change (fold change), volcano plots per pairwise comparison or celltype are plotted. Up to 5 comparisons are plotted here. Remaining can be found under Volcano_plots folder. Only qualitative plots (p-value adjusted < 0.05) are created here.

Gene Ontology Table

Gene ontology analysis is performed using annotation libraries: clusterProfiler[4] and org.Hs.eg.db.

Column description:

Ontology: BP for Biological Process, MF for Molecular Function, and CC for Cellular Component
ID: Gene ontology ID
Description: Description of gene ontology
Gene ratio: Gene ratio
Background ratio: Background ratio
P-value: P-value
P-value adjusted: Method: Benjamini-Hochberg (p-value < 0.05)
qvalue: Q-value
geneID: All genes that occur in this ontology
Count: Amount of genes found
Comparison: In which comparison (group id) it occurs
Celltype: Celltype indication

Software Versions

References

1. Hao Y, Hao S, Andersen-Nissen E, III WMM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. bioRxiv. 2020;2020.10.12.335331. doi:10.1101/2020.10.12.335331.

2. Aran D. SingleR: Reference-based single-cell rna-seq annotation. 2020. https://github.com/LTLA/SingleR.

3. Morgan M, Shepherd L. AnnotationHub: Client to access annotationhub resources. 2020.

4. Yu G. ClusterProfiler: Statistical analysis and visualization of functional profiles for genes and gene clusters. 2020. https://guangchuangyu.github.io/software/clusterProfiler.

Single-cell Transcriptomics

Single-cell Transcriptomics
A Single Cell RNA pipeline, implemented in Nextflow and part of the Online Pipelines Platform (OP²).

Pipeline overview

Report info