KC-example - scCancer


G-Lab@THU , 2020-03-02 18:55:17

1 Cell statistics

1.1 Cell calling

plot of chunk nUMI

(Hi-res image: left, right)

1.2 The number of UMIs and detected genes in cells

After the cell calling by Cell Ranger V3, we further perform quality control to filter droplets with low quality cells according to nUMI (total number of UMIs) and nGene (total number of detected genes).

For nUMI :

For nGene :

Comment: The suggested thresholds (except the lower bound of nGene, which is set by convention) are determined based on the their distributions. Using them, the outliers identified will be filtered. The same below.

plot of chunk filter

(Hi-res image: left, right)

2 Gene statistics

The number of genes expressed in at least one cell : 21199.

2.1 Mitochondrial genes

Summary of mitochondrial genes percentage (mito.percent) in cells:

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## "0.000721" "0.049074" "0.084407" "0.095748" "0.120959" "0.970637"

plot of chunk mito

(Hi-res image: view)

2.2 Ribosome genes

Summary of ribosome genes percentage (ribo.percent) in cells:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## "0.00506" "0.18533" "0.22266" "0.22026" "0.25968" "0.52744"

plot of chunk ribo

(Hi-res image: view)

2.3 Dissociation associated genes

Summary of dissociation associated genes percentage (diss.percent) in cells:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## "0.0000" "0.0244" "0.0294" "0.0318" "0.0367" "0.1760"

plot of chunk diss

(Hi-res image: view)

2.4 Ambient RNAs

2.4.1 Highly-expressed genes

In order to analyze the gene expression profiles in detail and identify
highly-expressed genes in background mRNAs from lysed cells, we calculate some metrics as shown below.

Here is a plot showing the distributions of gene proportion in cells for the first 100 genes (ordered by their proportion in background bg.percent). And the points (genes) are colored according to whether they belongs to mitochondrial, ribosome, or dissociation associated genes. The red star signs mark the genes’ proportion in background.

plot of chunk genePropPlot

(Hi-res image: view)

The plot below shows the relationship between bg.percent and prop.median, bg.percent and detect.rate.

plot of chunk gene.plot

(Hi-res image: left, right)

3.1.2 Ambient RNAs contamination fraction estimation

We refer to the algorithm of SoupX to estimate the contamination fraction of ambient RNAs from lysed cells.

Here is the plot from SoupX, which visualises the log10 ratios of observed expression counts to expected if the cell is pure background. The algorithm guesses which cells definitely express each gene and estimates the contamination fraction (read lines) using just each gene (i.e., assuming the same contamination for all cells).

plot of chunk soupX

(Hi-res image: view)

Note: The SoupX emphasize that the genes in the plot are heuristic and are just used to help develop biological intuition. It absolutely must not be used to automatically select the top N genes from the list, which may over-estimate the contamination fraction!

By default, we set three default gene sets (immunoglobulin, haemoglobin, and MHC genes) according to the characteristics of cancer microenvironment.

Using the users' input or the default gene sets, following genes are used to eatimate the contamination fraction.

## $igGenes
## [1] "IGLL5"
## 
## $HLAGenes
## [1] "HLA-DRA"  "HLA-DRB1" "HLA-DQA1" "HLA-DQB1" "HLA-DPA1" "HLA-DPB1"

The estimated contamination fraction is 8.04%. Picking right genes which are specific to one of cell population is absolutely vital for the accuracy of the estimated contamination fraction. So the fraction calculated here is for reference only, especially when just the default gene sets are used, without considering the sample-specific features.

3 Output

3.1 Thresholds to filter droplets

According to the results of statistics and visualization, we propose following thresholds to filter cells:

Index Low.threshold High.threshold
nUMI 0 12496.000
nGene 200 3268.000
mito.percent -Inf 0.229
ribo.percent -Inf 0.371
diss.percent -Inf 0.055

Using these thresholds, the number of cells vary as follows:

Raw : 434012 -> cellranger3 : 10227 -> nUMI<124963 : 9876 -> nGene>=200 : 9828 -> nGene<3268 : 9806 -> mito.percent<0.229 : 9507 -> ribo.percent<0.371 : 9454 -> diss.percent<0.055 : 8980

3.1 Output files

Running this script generates following files:

  1. Html report : report-scStat.html.
  2. Markdown report : report-scStat.md.
  3. Figure files : figures/.
  4. Figures used in the report: report-figures/.
  5. Text file with cell manifest : cellManifest-all.txt.
  6. Text file with suggested thresholds as above : cell.QC.thres.txt.
  7. Text file with gene manifest : geneManifest.txt.
  8. Text file with the results of SoupX : ambientRNA-SoupX.txt.
  9. Cell ranger html report (Copy from the source data folder): report-cellRanger.html.



© G-Lab, Tsinghua University