KC-example - scCancer

G-Lab@THU , 2020-03-02 18:55:17

1 Cell statistics

The input of scCancer pipeline is the matrix generated by Cell Ranger V3.
Here is the summary report from Cell Ranger.

1.1 Cell calling

The number of droplets containing UMI (nUMI > 0) is 434012.
Using the supplied cell calling results(filtered data), 10227 cells are identified (min.nUMI = 500).
Following are two plots showing the distribution of nUMI for cells and empty droplets identified.

plot of chunk nUMI

(Hi-res image: left, right)

1.2 The number of UMIs and detected genes in cells

After the cell calling by Cell Ranger V3, we further perform quality control to filter droplets with low quality cells according to nUMI (total number of UMIs) and nGene (total number of detected genes).

For nUMI :

Suggested threshold to filter cells with extremely large nUMI : 12496.
- Using this threshold, 351 cells will be filtered.

For nGene :

Suggested threshold to filter cells with extremely large nGene : 3268.
- Using this threshold, 128 cells will be filtered.
Suggested threshold to filter cells with extremely small nGene : 200.
- Using this threshold, 48 cells will be filtered.

Comment: The suggested thresholds (except the lower bound of nGene, which is set by convention) are determined based on the their distributions. Using them, the outliers identified will be filtered. The same below.

plot of chunk filter

(Hi-res image: left, right)

2 Gene statistics

The number of genes expressed in at least one cell : 21199.

2.1 Mitochondrial genes

Summary of mitochondrial genes percentage (mito.percent) in cells:

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## "0.000721" "0.049074" "0.084407" "0.095748" "0.120959" "0.970637"

Suggested threshold to filter cells with high mitochondrial genes percentage : 0.229.
- Using this threshold, 334 cells will be filtered.

plot of chunk mito

(Hi-res image: view)

2.2 Ribosome genes

Summary of ribosome genes percentage (ribo.percent) in cells:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## "0.00506" "0.18533" "0.22266" "0.22026" "0.25968" "0.52744"

Suggested threshold to filter cells with high ribosome genes percentage : 0.371.
- Using this threshold, 58 cells will be filtered.

plot of chunk ribo

(Hi-res image: view)

2.3 Dissociation associated genes

Summary of dissociation associated genes percentage (diss.percent) in cells:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## "0.0000" "0.0244" "0.0294" "0.0318" "0.0367" "0.1760"

Suggested threshold to filter cells with high dissociation genes percentage : 0.055.
- Using this threshold, 486 cells will be filtered.

plot of chunk diss

(Hi-res image: view)

2.4 Ambient RNAs

2.4.1 Highly-expressed genes

In order to analyze the gene expression profiles in detail and identify
highly-expressed genes in background mRNAs from lysed cells, we calculate some metrics as shown below.

bg.percent : the expression proportion for each gene in background distribution (all droplets with nUMI <= 10).
prop.median : the median of expression proportions for a gene in each cell.
detect.rate : the detected (#UMI > 0) rate for a gene in all cells.

Here is a plot showing the distributions of gene proportion in cells for the first 100 genes (ordered by their proportion in background bg.percent). And the points (genes) are colored according to whether they belongs to mitochondrial, ribosome, or dissociation associated genes. The red star signs mark the genes’ proportion in background.

plot of chunk genePropPlot

(Hi-res image: view)

The plot below shows the relationship between bg.percent and prop.median, bg.percent and detect.rate.

plot of chunk gene.plot

(Hi-res image: left, right)

3.1.2 Ambient RNAs contamination fraction estimation

We refer to the algorithm of SoupX to estimate the contamination fraction of ambient RNAs from lysed cells.

Here is the plot from SoupX, which visualises the log10 ratios of observed expression counts to expected if the cell is pure background. The algorithm guesses which cells definitely express each gene and estimates the contamination fraction (read lines) using just each gene (i.e., assuming the same contamination for all cells).

plot of chunk soupX

(Hi-res image: view)

Note: The SoupX emphasize that the genes in the plot are heuristic and are just used to help develop biological intuition. It absolutely must not be used to automatically select the top N genes from the list, which may over-estimate the contamination fraction!

By default, we set three default gene sets (immunoglobulin, haemoglobin, and MHC genes) according to the characteristics of cancer microenvironment.

Using the users' input or the default gene sets, following genes are used to eatimate the contamination fraction.

## $igGenes
## [1] "IGLL5"
## 
## $HLAGenes
## [1] "HLA-DRA"  "HLA-DRB1" "HLA-DQA1" "HLA-DQB1" "HLA-DPA1" "HLA-DPB1"

The estimated contamination fraction is 8.04%. Picking right genes which are specific to one of cell population is absolutely vital for the accuracy of the estimated contamination fraction. So the fraction calculated here is for reference only, especially when just the default gene sets are used, without considering the sample-specific features.

3 Output

3.1 Thresholds to filter droplets

According to the results of statistics and visualization, we propose following thresholds to filter cells:

Index	Low.threshold	High.threshold
nUMI	0	12496.000
nGene	200	3268.000
mito.percent	-Inf	0.229
ribo.percent	-Inf	0.371
diss.percent	-Inf	0.055

Hint: In general, Cell Ranger can filter the droplets with low nUMI. So here we set Low.threshold for nUMI as 0. The users need to use the identification results of Cell Ranger or set a suitable threshold first to filter the possible empty droplets with less UMIs.

Using these thresholds, the number of cells vary as follows:

Raw : 434012 -> cellranger3 : 10227 -> nUMI<124963 : 9876 -> nGene>=200 : 9828 -> nGene<3268 : 9806 -> mito.percent<0.229 : 9507 -> ribo.percent<0.371 : 9454 -> diss.percent<0.055 : 8980

3.1 Output files

Running this script generates following files:

Html report : report-scStat.html.
Markdown report : report-scStat.md.
Figure files : figures/.
Figures used in the report: report-figures/.
Text file with cell manifest : cellManifest-all.txt.
Text file with suggested thresholds as above : cell.QC.thres.txt.
Text file with gene manifest : geneManifest.txt.
Text file with the results of SoupX : ambientRNA-SoupX.txt.
Cell ranger html report (Copy from the source data folder): report-cellRanger.html.