G-Lab@THU , 2020-03-02 18:55:17
The input of scCancer
pipeline is the matrix generated by Cell Ranger V3
.
Here is the summary report from Cell Ranger
.
The number of droplets containing UMI (nUMI > 0) is 434012
.
Using the supplied cell calling results(filtered data), 10227
cells are identified (min.nUMI = 500
).
Following are two plots showing the distribution of nUMI
for cells and empty droplets identified.
After the cell calling by Cell Ranger V3
, we further perform quality control to
filter droplets with low quality cells according to nUMI
(total number of UMIs) and nGene
(total number of detected genes).
For nUMI
:
nUMI
: 12496
.
351
cells will be filtered.For nGene
:
nGene
: 3268
.
128
cells will be filtered.nGene
: 200
.
48
cells will be filtered.Comment: The suggested thresholds (except the lower bound of nGene
, which is set by convention) are determined based on the their distributions. Using them, the outliers identified will be filtered. The same below.
The number of genes expressed in at least one cell : 21199
.
Summary of mitochondrial genes percentage (mito.percent
) in cells:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "0.000721" "0.049074" "0.084407" "0.095748" "0.120959" "0.970637"
0.229
.
334
cells will be filtered.(Hi-res image: view)
Summary of ribosome genes percentage (ribo.percent
) in cells:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "0.00506" "0.18533" "0.22266" "0.22026" "0.25968" "0.52744"
0.371
.
58
cells will be filtered.(Hi-res image: view)
Summary of dissociation associated genes percentage (diss.percent
) in cells:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "0.0000" "0.0244" "0.0294" "0.0318" "0.0367" "0.1760"
0.055
.
486
cells will be filtered.(Hi-res image: view)
In order to analyze the gene expression profiles in detail and identify
highly-expressed genes in background mRNAs from lysed cells,
we calculate some metrics as shown below.
bg.percent
: the expression proportion for each gene in background distribution (all droplets with nUMI <= 10
).prop.median
: the median of expression proportions for a gene in each cell.detect.rate
: the detected (#UMI > 0
) rate for a gene in all cells.Here is a plot showing the distributions of gene proportion in cells for the first 100 genes (ordered by their proportion in background bg.percent
). And the points (genes) are colored according to whether they belongs to mitochondrial, ribosome, or dissociation associated genes.
The red star signs mark the genes’ proportion in background.
(Hi-res image: view)
The plot below shows the relationship between bg.percent
and prop.median
, bg.percent
and detect.rate
.
We refer to the algorithm of SoupX
to estimate the contamination fraction of ambient RNAs from lysed cells.
Here is the plot from SoupX
, which visualises the log10 ratios of observed expression counts to expected
if the cell is pure background. The algorithm guesses which cells definitely express each gene and
estimates the contamination fraction (read lines) using just each gene (i.e., assuming the same contamination for all cells).
(Hi-res image: view)
Note: The SoupX
emphasize that the genes
in the plot are heuristic and are just used to help develop biological intuition.
It absolutely must not be used to automatically select the top N genes from the list,
which may over-estimate the contamination fraction!
By default, we set three default gene sets (immunoglobulin, haemoglobin, and MHC genes) according to the characteristics of cancer microenvironment.
Using the users' input or the default gene sets, following genes are used to eatimate the contamination fraction.
## $igGenes
## [1] "IGLL5"
##
## $HLAGenes
## [1] "HLA-DRA" "HLA-DRB1" "HLA-DQA1" "HLA-DQB1" "HLA-DPA1" "HLA-DPB1"
The estimated contamination fraction is 8.04%
.
Picking right genes which are specific to one of cell population is absolutely vital for
the accuracy of the estimated contamination fraction.
So the fraction calculated here is for reference only, especially when
just the default gene sets are used, without considering the sample-specific features.
According to the results of statistics and visualization, we propose following thresholds to filter cells:
Index | Low.threshold | High.threshold |
---|---|---|
nUMI | 0 | 12496.000 |
nGene | 200 | 3268.000 |
mito.percent | -Inf | 0.229 |
ribo.percent | -Inf | 0.371 |
diss.percent | -Inf | 0.055 |
Cell Ranger
can filter the droplets with low nUMI. So here we set Low.threshold
for nUMI as 0
.
The users need to use the identification results of Cell Ranger
or set a suitable threshold first to filter the possible empty droplets with less UMIs.Using these thresholds, the number of cells vary as follows:
Raw : 434012
->
cellranger3 : 10227
->
nUMI<124963 : 9876
->
nGene>=200 : 9828
->
nGene<3268 : 9806
->
mito.percent<0.229 : 9507
->
ribo.percent<0.371 : 9454
->
diss.percent<0.055 : 8980
Running this script generates following files: