Processes


1. Preparing data

Data in this website are downloaded from various sources. We require the following types of data and process them properly to fit our database.

1.1 Clinical tables

We prepare three tables here. The first one describes the cases who suffer from breast cancer. We collect their basic indications (age, sex, height, weight, race, etc), clinical diagnosis (receptor status, tumor type, grade, stage, etc), treatment programs (hormone therapy, radio therapy, surgery, etc) and prognostic indicators (various survival information etc). Missing numeric values are changed to “NA”, missing categorical values are changed to empty (blank), if the feature is presented in the webpage. Some cases may have various samples to correspond to different sampling site or stage. We get sample descriptions or split case descriptions to match the genomic data, as they are measured from samples. This is the second table.
We use the third table to record the correspondence of clinical samples and genomic data.

1.2 Simple somatic mutation

Simple somatic mutations are inferred from DNA-seq or microarray data. We keep them in a MAF (https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) formatted file. The first 32 columns are necessary here. The genome coordinate can be ensembl GRCh37, GRCh38, UCSC hg19, hg38. Hugo symbols are corrected if a symbol cannot be mapped to an entrez id but an entrez id can be mapped to a symbol. And the entrez ids are corrected in the similar way. For some datasets there are no entrez ids supplied. We map the symbols to hgnc_symbol, external_synonym, entrezgene_accession, ensembl_gene_id (see them in biomaRt) sequentially and treat the first mapping results as standard symbols.

1.3 Gene fusion

Gene fusions are inferred from DNA-seq or RNA-seq data. We keep them in a tab separated text file, with the first column "Hugo_Symbol", second "Tumor_Sample_Barcode", third "Fusion", which fusion named in "Left Gene Symbol-Right Gene Symbol" format. Gene symbols are processed as described in section "1.2".

1.4 Gene-level copynumber and copynumber segment

They are inferred from microarray or DNA-seq data. Copynumber segment is a widely used output format of CBS analysis in the DNACopy R-package. The genome coordinate can be ensembl GRCh37, GRCh38, UCSC hg19, hg38. Segment means can be log transformed or not. Gene-level copynumbers are stored in a sample-gene matrix. The five values -2, -1, 0, 1, 2 in the matrix means homozygous deletion (HOMDEL), hemizygous deletion (HETLOSS), neutral or no change (DIPLOID), gain (GAIN), high level amplification (AMP). Missing values are kept as "NA". Gene symbols are processed as described in section "1.2". Besides, some symbols contain "|chr*" suffices (usually output by gistic2). We keep their suffices while correcting symbols. Rounded mean values are calcuated for duplicated symbols.

1.5 Gene expression and gene expression of normal samples

Gene expressions are inferred from microarray or RNA-seq data. We use a gene-sample matrix to keep them. Values can be log2 intensities (microarray), FPKM (Cufflinks output, etc), TPM (RSEM output, etc), raw count/log2 raw count/log2 raw count + 1 (STAR output, etc). Missing values are kept as "NA". Hugo symbols are corrected if a symbol cannot be mapped to an entrez id/ensembl id but an entrez id/ensembl id can be mapped to a symbol. For some datasets there are neither entrez ids nor ensembl ids supplied. We map the symbols to hgnc_symbol, external_synonym, entrezgene_accession, ensembl_gene_id (see them in biomaRt) sequentially and treat the first mapping results as standard symbols. Expression values are summarized for duplicated symbols. The main matrix is for tumor samples. Some datasets may have a corresponding normal group. We keep them in a separated matrix and named it gene expression of normal samples.

1.6 Gene expression z-score

Gene expression z-score is typically z-scores across all samples include normal samples for each gene. We prefer z-scores compared to the expression distribution of each gene tumors that are diploid for this gene, if such data or gene-level copynumbers are supplied. Missing values are kept as "NA". Gene symbols are processed as described in section "1.5".

1.7 miRNA expression and miRNA expression of normal samples

miRNA expressions are inferred from microarray or RNA-seq data. We use a gene-sample matrix to keep them. Values can be log2 intensities, FPKM, TPM, raw count/log2 raw count/log2 raw count + 1. Missing values are kept as "NA". Mirbase id is used to indicate miRNA. Expression values are summarized for duplicated ids. The main matrix is for tumor samples. Some datasets may have a corresponding normal group. We keep them in a separated matrix and named it miRNA expression of normal samples.

1.8 miRNA expression z-score

miRNA expression z-score is typically z-scores across normal samples (or tumor samples if normal samples are not supplied) for each gene. Missing values are kept as "NA". Mirbase id is used to indicate miRNA.

1.9 Gene-level methylation means and gene-level methylation means of normal samples

Gene-level methylation means are inferred from microarray, RRBS or WGBS data. We use a gene-sample matrix to keep them. Mean beta values among 1500bp to -500bp distance to TSS are calculated for each gene in each sample. Values are normalized to sum to 1,000,000 for each sample. Missing values are kept as "NA". Gene symbols are processed as described in section "1.5". Expression values are averaged for duplicated symbols. The main matrix is for tumor samples. Some datasets may have a corresponding normal group. We keep them in a separated matrix and named it gene-level methylation means of normal samples.

1.10 Gene-level methylation z-score

Gene-level methylation z-score is typically z-scores across normal samples (or tumor samples if normal samples are not supplied) for each gene. Missing values are kept as "NA". Gene symbols are processed as described in section "1.5".


2. Counting mutations for a dataset

For simple somatic mutations, we count the mutation numbers for each sample to get a mutation count distribution. For each sample, we call a gene as mutated gene if there are mutations in that gene. We count sample frequencies for all genes and list the top frequent genes. We also count sample frequencies for each mutation. To identify the same mutations across all datasets, we annotate them to GRCh38 ensembl release 106 by VEP to get HGVS identifiers. Thus, mutations cannot be lifted to hg38 (by UCSC liftover chains) or do not have HGVSc identifiers are filtered out. We list mutations mutated in at least 2 samples in a table with some important annotations by VEP and give top frequent mutations (with non-empty HGVSp) in a figure.


3. Counting fusions for a dataset

For gene fusions, we treat it the same fusion if their left and right genes are the same respectively. We count sample frequencies for all fusions, list fusions that affect at least 2 samples in a table, and list top frequent fusions in a figure.


4. Counting CNAs for a dataset

For gene-level copynumbers, we count sample frequencies for the 4 types of CNAs for each gene and list HOMDEL/AMP/total samples in a table. The figure is top frequent genes with either AMP or HOMEL CNAs. We also plot copynumber segment here by R package copynumber.


5. Summarizing genomic events for a dataset

We use oncogrid to present all the above DNA events. For simple somatic mutations, we transfer their variant_classification (a column in MAF file) to the following consequence markers: Frameshift, Inframe, Missense, Nonsense, Tss, Nonstop, Splice, RNA, Others. "Others" will not be present in the figure. For gene fusions, we mark them FUSION. For gene-level copynumbers, we mark HOMDEL and AMP, filter out other types. Some important clinical traits are also listed in the figure.


6. Calculating differential expressed genes for a dataset

We use R package edgeR to calculate differential expressed genes. Genes expressed in at least one sample are included. Then we calculate the fold change, p-value and BH FDR for genes between tumor and normal samples. We mark genes with log2FC>1 and FDR<0.05 as up regulated genes, log2FC<-1 and FDR<0.05 as down regulated genes, and list them in a table. We use R package clusterProfiler (with ensembl 106 in biomaRt) to calculate KEGG and GO enrichments for differential expressed genes (both up and down regulated genes). Top 30 enriched KEGG pathways are listed in a bubble plot. Top 10 GO terms (BP, CC, MF respectively) are listed in a bar plot.


7. Calculating differential expressed miRNAs for a dataset

We use R package edgeR to calculate differential expressed miRNAs. miRNAs expressed in at least one sample are included. Then we calculate the fold change, p-value and BH FDR for genes between tumor and normal samples. We mark genes with log2FC>1 and FDR<0.05 as up regulated genes, log2FC<-1 and FDR<0.05 as down regulated genes, and list them in a table.


8. Calculating differential methylated genes for a dataset

We use R package edgeR to calculate differential methylated genes. Mean beta value among 1500bp to -500bp distance to TSS is calculated to indicate the degree of methylation for each gene. We calculate the fold change, p-value and BH FDR for genes between tumor and normal samples. We mark genes with log2FC>1 and FDR<0.05 as up regulated genes, log2FC<-1 and FDR<0.05 as down regulated genes, and list them in a table. We use R package clusterProfiler (with ensembl 106 in biomaRt) to calculate KEGG and GO enrichments for differential methylated genes (both up and down regulated genes). Top 30 enriched KEGG pathways are listed in a bubble plot. Top 10 GO terms (BP, CC, MF respectively) are listed in a bar plot.


9. Calculating survival analysis

Survival analysis is done by R package survival and survminer. We use several genomic data as the strata. For simple somatic mutations, we use "mutated gene" as the strata. If a gene mutated in at 10%-90% samples, we calculate the survival significance by this strata, and give the plot. For gene fusions, we use "fusion gene" as the strata. We include genes that have fusions in 10%-90% samples. For gene-level copynumber, we use "CNA gene" as the strata. If a gene has at least 2 types of CNAs that affect 10%-90% samples, we use this as a strata. CNA types can be HOMDEL, HETLOSS, GAIN, AMP. For gene expression z-scores, we classify gene "high" when its z-score>0 and "low" with z-score<0. If a gene is marked "high" in 10%-90% samples and "low" in 10%-90% samples, we use this as a strata. For miRNA expression z-scores, we classify miRNA "high" when its z-score>0 and "low" with z-score<0. If an miRNA is marked "high" in 10%-90% samples and "low" in 10%-90% samples, we use this as a strata. For gene-level methylation z-scores, we classify gene "high" when its z-score>0 and "low" with z-score<0. If a gene is marked "high" in 10%-90% samples and "low" in 10%-90% samples, we use this as a strata.


10. Generating genomic tracks for a dataset

Simple somatic mutations are lifted to hg38 (by UCSC liftover chains) in a MAF file. Copynumber segment files are first transfer to bed, then lifted to hg38 (by UCSC liftover chains), and finally transfer back. Records that cannot be lifted are filtered out. Data that cannot be lifted are discarded. The two files above are available tracks for IGV genome browser.


11. Counting affected genes across datasets

A gene is treated as affected for a dataset, if it has simple somatic mutations, fusions or at least 1 of the 4 types of CNAs in that dataset. All affected genes are united across datasets to get a final number. We also count the sample frequencies for mutated genes (gene that has at least 1 simple somatic mutation in a sample of a dataset) across datasets. Top mutated genes across datasets are listed in a stacked bar plot. Top mutated genes in Chinese cohorts are also listed by counting datasets with Chinese cohorts.


12. Counting mutations/fusions/CNAs across datasets

We combine sample frequencies of mutations in step "Counting mutations for a dataset" for all datasets to get sample frequencies across datasets. Mutations mutated in only 1 sample across all datasets are not listed in tables, but are counted in total mutation number. We combine results in step "Counting fusions for a dataset" to get sample frequencies of gene fusions across datasets. Gene fusions affect only 1 sample across all datasets are not listed in tables. We combine results in step "Counting CNAs for a dataset" to get sample frequencies of CNAs across datasets. GAIN and HETLOSS are not listed in tables.


13. Making gene/miRNA expression/methylation patterns on all datasets

We compute z-scores across genes/miRNAs in each sample for all datasets. For a user specified gene/miRNA, we can see its difference expressions/methylation means between case (tumor) and control (normal) groups on all datasets in a violin plot. Note that patterns in the figure can be misleading since total observed genes and observations have different biases in different techniques.


14. Generating co-expression network on all datasets

For each dataset, we use gene expression z-scores to calculate pairwise Spearman correlations between genes. These genes are restricted to protein network data (incl. distinction: direct vs. interologs) in STRING (9606.protein.links.full.v11.0.txt.gz). Then we correct p-values by their ranks. We use Fisher’s method to combine those p-values on all datasets. Combined p-values are also corrected by their ranks among genes. These corrected p-values will be used as thresholds for generating networks. For user specified genes, direct edges are genes that have p≤0.001. Direct edges are restricted to 1-30. Indirect edges are generated as direct edges for genes that directly link to user specified genes. This two-step network is plotted in the webpage for user specified genes.




Programatic APIs


BCMA APIS is a set of RESTful endpoints (programmable interfaces over the Web) that allows third-party developers to build automation scripts and apps. Please see the following API Endpoint Documentation for detailed information on the API endpoints, representations and how the API responds to different requests.

1.Downloading Dataset Files
URL(Request Method: GET)
http://lifeome.net:9001/download?f=<datasetId>/<datasetId>_download.tar.gz
Examples
GET: http://lifeome.net:9001/download?f=fuscctnbc/_download.tar.gz
GET: http://lifeome.net:9001/download?f=brca_tcga_pan_can_atlas_2018/_download.tar.gz

"<datasetId>" can be any of valid dataset Ids which can be archived in the url of dataset


2.Querying Gene Merged Network
URL(Request Method: GET)
http://lifeome.net:9001/f?f=_web_search_single_gene/<geneSymbol>/network.json
Examples
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/BRCA1/network.json
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TIMELESS/network.json
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TOP2A/network.json

"<geneSymbol>" can be any of valid HGNC Gene Symbols.


3.Querying Gene Expression
URL(Request Method: GET)
http://lifeome.net:9001/f?f=_web_search_single_gene/<geneSymbol>/exp.csv
Examples
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/BRCA1/exp.csv
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TIMELESS/exp.csv
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TOP2A/exp.csv

"<geneSymbol>" can be any of valid HGNC Gene Symbols.


4.Querying Dataset Survival
URL(Request Method: POST)
http://lifeome.net:9001/datasetSurvival/list
{
datasetId:<datasetId>,
limit:<limit>,
page:<page>
}
Examples
POST: http://lifeome.net:9001/datasetSurvival/list
{
datasetId:"brca_tcga_pan_can_atlas_2018",
limit:10,
page:1
}

"<limit>" is the number of records wanted to be archived at the current "<page>".





Publications


Please access this website to learn about the publications of our lab.