Processes


1. Preparing data

Data in this website are downloaded from various sources. We require the following types of data and process them properly to fit our database.

1.1 Clinical tables

We prepare three tables here. The first one describes the cases who suffer from breast cancer. We collect their basic indications (age, sex, height, weight, race, etc), clinical diagnosis (receptor status, tumor type, grade, stage, etc), treatment programs (hormone therapy, radio therapy, surgery, etc) and prognostic indicators (various survival information etc). Missing numeric values are changed to “NA”, missing categorical values are changed to empty (blank), if the feature is presented in the webpage. Some cases may have various samples to correspond to different sampling site or stage. We get sample descriptions or split case descriptions to match the genomic data, as they are measured from samples. This is the second table.
We use the third table to record the correspondence of clinical samples and genomic data.

1.2 Simple somatic mutation

Simple somatic mutations are inferred from DNA-seq or microarray data. We keep them in a MAF (https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) formatted file. The first 32 columns are necessary here. The genome coordinate can be ensembl GRCh37, GRCh38, UCSC hg19, hg38. Hugo symbols are corrected if a symbol can not be mapped to an entrez id but an entrez id can be mapped to a symbol. And the entrez ids are corrected in the similar way. For some datasets there are no entrez ids supplied. We map the symbols to hgnc_symbol, external_synonym, entrezgene_accession, ensembl_gene_id (see them in biomaRt) sequentially and treat the first mapping results as standard symbols.

1.3 Gene fusion

Gene fusions are inferred from DNA-seq or RNA-seq data. We keep them in a tab separated text file, with the first column "Hugo_Symbol", second "Tumor_Sample_Barcode", third "Fusion", which fusion named in "Left Gene Symbol-Right Gene Symbol" format. Gene symbols are processed as described in section "1.2".

1.4 Gene level copynumber and copynumber segment

They are inferred from microarray or DNA-seq data. Copynumber segment is a widely used output format of CBS analysis in the DNACopy R-package. The genome coordinate can be ensembl GRCh37, GRCh38, UCSC hg19, hg38. Segment means can be log transformed or not. Gene level copynumbers are stored in a sample-gene matrix. The five values -2, -1, 0, 1, 2 in the matrix means homozygous deletion (HOMDEL), hemizygous deletion (HETLOSS), neutral or no change (DIPLOID), gain (GAIN), high level amplification (AMP). Missing values are kept as "NA". Gene symbols are processed as described in section "1.2". Besides, some symbols contain "|chr*" suffices (usually output by gistic2). We keep their suffices while correcting symbols. Rounded mean values are calcuated for duplicated symbols.

1.5 Gene expression and gene expression of normals

Gene expressions are inferred from microarray or RNA-seq data. We use a gene-sample matrix to keep them. Values can be log2 intensities (microarray), FPKM (Cufflinks output, etc), TPM (RSEM output, etc), raw count/log2 raw count/log2 raw count + 1 (STAR output, etc). Missing values are kept as "NA". Hugo symbols are corrected if a symbol can not be mapped to an entrez id/ensembl id but an entrez id/ensembl id can be mapped to a symbol. For some datasets there are neither entrez ids nor ensembl ids supplied. We map the symbols to hgnc_symbol, external_synonym, entrezgene_accession, ensembl_gene_id (see them in biomaRt) sequentially and treat the first mapping results as standard symbols. Expression values are summarized for duplicated symbols. The main matrix is for tumor samples. Some datasets may have a corresponding normal group. We keep them in a separated matrix and named it gene expression of normals.

1.6 Gene expression z-score

Gene expression z-score is typically z-scores across all samples include normals for each gene. We prefer z-scores compared to the expression distribution of each gene tumors that are diploid for this gene, if such data or gene level copynumbers are supplied. Missing values are kept as "NA". Gene symbols are processed as described in section "1.5".

1.7 miRNA expression and miRNA expression of normals

miRNA expressions are inferred from microarray or RNA-seq data. We use a gene-sample matrix to keep them. Values can be log2 intensities, FPKM, TPM, raw count/log2 raw count/log2 raw count + 1. Missing values are kept as "NA". Mirbase id is used to indicate miRNA. Expression values are summarized for duplicated ids. The main matrix is for tumor samples. Some datasets may have a corresponding normal group. We keep them in a separated matrix and named it miRNA expression of normals.

1.8 miRNA expression z-score

miRNA expression z-score is typically z-scores across normal samples (or tumor samples if normal samples are not supplied) for each gene. Missing values are kept as "NA". Mirbase id is used to indicate miRNA.

1.9 Gene level methylation means and gene level methylation means of normals

Gene level methylation means are inferred from microarray, RRBS or WGBS data. We use a gene-sample matrix to keep them. Mean beta values among 1500bp to -500bp distance to TSS are calculated for each gene in each sample. Values are normalized to sum to 1,000,000 for each sample. Missing values are kept as "NA". Gene symbols are processed as described in section "1.5". Expression values are averaged for duplicated symbols. The main matrix is for tumor samples. Some datasets may have a corresponding normal group. We keep them in a separated matrix and named it gene level methylation means of normals.

1.10 Gene level methylation z-score

Gene level methylation z-score is typically z-scores across normal samples (or tumor samples if normal samples are not supplied) for each gene. Missing values are kept as "NA". Gene symbols are processed as described in section "1.5".


2. Counting mutations for a dataset

For simple somatic mutations, we count the mutation numbers for each sample to get a mutation count distribution. For each sample, we call a gene as mutated gene if there are mutations in that gene. We count sample frequencies for all genes and list the top frequent genes. We also count sample frequencies for each mutations. To identify the same mutations across all datasets, we annotate them to GRCh38 ensembl release 106 by VEP to get HGVS identifiers. Thus, mutations can not be lifted to hg38 (by UCSC liftover chains) or do not have HGVSc identifiers are filtered out. We list mutations mutated in at least 2 samples in a table with some important annotations by VEP and give top frequent mutations (with non-empty HGVSp) in a figure.


3. Counting fusions for a dataset

For gene fusions, we treat it the same fusion if their left and right genes are the same respectively. We count sample frequencies for all fusions, list fusions that affect at least 2 samples in a table, and list top frequent fusions in a figure.


4. Counting CNAs for a dataset

For gene level copynumbers, we count sample frequencies for the 4 types of CNAs for each gene and list HOMDEL/AMP/total samples in a table. The figure is top frequent genes with either AMP or HOMEL CNAs. We also plot copynumber segment here by R package copynumber.


5. Summarizing genomic events for a dataset

We use oncogrid to present all the above DNA events. For simple somatic mutations, we transfer their variant_classification (a column in MAF file) to the following consequence markers: Frameshift, Inframe, Missense, Nonsense, Tss, Nonstop, Splice, RNA, Others. "Others" will not be present in the figure. For gene fusions, we mark them FUSION. For gene level copynumbers, we mark HOMDEL and AMP, filter out other types. Some important clinical traits are also listed in the figure.


6. Calculating differential expressed genes for a dataset

We use R package edgeR to calculate differential expressed genes. Genes expressed in at least one sample are included. Then we calculate the fold change, p-value and BH FDR for genes between tumor and normal samples. We mark genes with log2FC>1 and FDR<0.05 as up regulated genes, log2FC<-1 and FDR<0.05 as down regulated genes, and list them in a table. We use R package clusterProfiler (with ensembl 106 in biomaRt) to calculate KEGG and GO enrichments for differential expressed genes (both up and down regulated genes). Top 30 enriched KEGG pathways are listed in a bubble plot. Top 10 GO terms (BP, CC, MF respectively) are listed in a bar plot.


7. Calculating differential expressed miRNAs for a dataset

We use R package edgeR to calculate differential expressed miRNAs. miRNAs expressed in at least one sample are included. Then we calculate the fold change, p-value and BH FDR for genes between tumor and normal samples. We mark genes with log2FC>1 and FDR<0.05 as up regulated genes, log2FC<-1 and FDR<0.05 as down regulated genes, and list them in a table.


8. Calculating differential methylated genes for a dataset

We use R package edgeR to calculate differential methylated genes. Mean beta value among 1500bp to -500bp distance to TSS is calculated to indicate the degree of methylation for each gene. We calculate the fold change, p-value and BH FDR for genes between tumor and normal samples. We mark genes with log2FC>1 and FDR<0.05 as up regulated genes, log2FC<-1 and FDR<0.05 as down regulated genes, and list them in a table. We use R package clusterProfiler (with ensembl 106 in biomaRt) to calculate KEGG and GO enrichments for differential methylated genes (both up and down regulated genes). Top 30 enriched KEGG pathways are listed in a bubble plot. Top 10 GO terms (BP, CC, MF respectively) are listed in a bar plot.


9. Calculating survival analysis

Survival analysis is done by R package survival and survminer. We use several genomic data as the strata. For simple somatic mutations, we use "mutated gene" as the strata. If a gene mutated in at 10%-90% samples, we calculate the survival significance by this strata, and give the plot. For gene fusions, we use "fusion gene" as the strata. We include genes that have fusions in 10%-90% samples. For gene level copynumber, we use "CNA gene" as the strata. If a gene has at least 2 types of CNAs that affect 10-90% samples, we use this as a strata. CNA types can be HOMDEL, HETLOSS, GAIN, AMP. For gene expression z-scores, we classify gene "high" when its z-score>0 and "low" with z-score<0. If a gene is marked "high" in 10%-90% samples and "low" in 10%-90% samples, we use this as a strata. For miRNA expression z-scores, we classify miRNA "high" when its z-score>0 and "low" with z-score<0. If an miRNA is marked "high" in 10%-90% samples and "low" in 10%-90% samples, we use this as a strata. For gene level methylation z-scores, we classify gene "high" when its z-score>0 and "low" with z-score<0. If a gene is marked "high" in 10%-90% samples and "low" in 10%-90% samples, we use this as a strata.


10. Generating genomic tracks for a dataset

Simple somatic mutations are lifted to hg38 (by UCSC liftover chains) in a MAF file. Copynumber segment files are first transfer to bed, then lifted to hg38 (by UCSC liftover chains), and finally transfer back. Records that cannot be lifted are filtered out. Data that cannot be lifted are discarded. The two files above are available tracks for IGV genome browser.


11. Counting affected genes across datasets

A gene is treated as affected for a dataset, if it has simple somatic mutations, fusions or at least 1 of the 4 types of CNAs in that dataset. All affected genes are united across datasets to get a final number. We also count the sample frequencies for mutated genes (gene that has at least 1 simple somatic mutation in a sample of a dataset) across datasets. Top mutated genes across datasets are listed in a stacked bar plot. Top mutated genes in Chinese cohorts are also listed by counting datasets with Chinese cohorts.


12. Counting mutations/fusions/CNAs across datasets

We combine sample frequencies of mutations in step "Counting mutations for a dataset" for all datasets to get sample frequencies across datasets. Mutations mutated in only 1 sample across all datasets are not listed in tables, but are counted in total mutation number. We combine results in step "Counting fusions for a dataset" to get sample frequencies of gene fusions across datasets. Gene fusions affect only 1 sample across all datasets are not listed in tables. We combine results in step "Counting CNAs for a dataset" to get sample frequencies of CNAs across datasets. GAIN and HETLOSS are not listed in tables.


13. Making gene/miRNA expression/methylation patterns on all datasets

We compute z-scores across genes/miRNAs in each sample for all datasets. For a user specified gene/miRNA, we can see its difference expressions/methylation means between case (tumor) and control (normal) groups on all datasets in a violin plot. Note that patterns in the figure can be misleading since total observed genes and observations have different biases in different techniques.


14. Generating co-expression network on all datasets

For each dataset, we use gene expression z-scores to calculate pairwise Spearman correlations between genes. These genes are restricted to protein network data (incl. distinction: direct vs. interologs) in STRING (9606.protein.links.full.v11.0.txt.gz). Then we correct p-values by their ranks. We use Fisher’s method to combine those p-values on all datasets. Combined p-values are also corrected by their ranks among genes. These corrected p-values will be used as thresholds for generating networks. For user specified genes, direct edges are genes that have p≤0.001. Direct edges are restricted to 1-30. Indirect edges are generated as direct edges for genes that directly link to user specified genes. This two-step network is plotted in the webpage for user specified genes.




Programatic APIs


BCMA APIS is a set of RESTful endpoints (programmable interfaces over the Web) that allows third-party developers to build automation scripts and apps. Please see the following API Endpoint Documentation for detailed information on the API endpoints, representations and how the API responds to different requests.

1.Downloading Dataset Files
URL(Request Method: GET)
http://lifeome.net:9001/download?f=<datasetId>/<datasetId>_download.tar.gz
Examples
GET: http://lifeome.net:9001/download?f=fuscctnbc/_download.tar.gz
GET: http://lifeome.net:9001/download?f=brca_tcga_pan_can_atlas_2018/_download.tar.gz

"<datasetId>" can be any of valid dataset Ids which can be archived in the url of dataset


2.Querying Gene Merged Network
URL(Request Method: GET)
http://lifeome.net:9001/f?f=_web_search_single_gene/<geneSymbol>/network.json
Examples
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/BRCA1/network.json
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TIMELESS/network.json
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TOP2A/network.json

"<geneSymbol>" can be any of valid HGNC Gene Symbols.


3.Querying Gene Expression
URL(Request Method: GET)
http://lifeome.net:9001/f?f=_web_search_single_gene/<geneSymbol>/exp.csv
Examples
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/BRCA1/exp.csv
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TIMELESS/exp.csv
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TOP2A/exp.csv

"<geneSymbol>" can be any of valid HGNC Gene Symbols.


4.Querying Dataset Survival
URL(Request Method: POST)
http://lifeome.net:9001/datasetSurvival/list
{
datasetId:<datasetId>,
limit:<limit>,
page:<page>
}
Examples
POST: http://lifeome.net:9001/datasetSurvival/list
{
datasetId:"brca_tcga_pan_can_atlas_2018",
limit:10,
page:1
}

"<limit>" is the number of records wanted to be archived at the current "<page>".


Software/Database


This section features key databases and software tools developed by the project for tumor microenvironment exploration, single-cell multi-omics, and spatial transcriptomics.

I. Optimization Models and Algorithms for Combinatorial Problems in Multi-omics Data Integration and Analysis
Software/Database Access Key Publication
ARBic Source Code Liu et al., NAR Genom. Bioinform. (2023)
NoiBic Source Code
MarsGT Source Code Wang et al., Nat. Commun. (2024)
SemiLT Source Code Chen et al., Adv. Sci. (2025)
DeepGFT Source Code Sun et al., Genome Biol. (2025)
TransMeta Source Code Yu et al., Genome Res. (2022)
DESSO-DB Web Database Wang et al., Comput. Struct. Biotechnol. J. (2022)
TESA Source Code Li et al., Patterns (2024)
CEMIG Source Code Wang et al., Brief. Bioinform. (2024)
MPCHG Not available Wu et al., Front. Genet. (2024)
HycDemux Source Code Han et al., Genome Biol. (2023)
TDFPS-Designer Source Code Qi et al., Genome Biol. (2024)
PRO Source Code Yu et al., Fundam. Res. (2024)
DriverMP Web Server Liu et al., GigaScience (2023)
Spatom Web Server Wu et al., Brief. Bioinform. (2023)
TriNet Web Server Zhou et al., Patterns (2023)
GraphRBF Web Server Zhang et al., GigaScience (2024)
GeoNet Web Server Han et al., Structure (2024)
SpatConv Web Server Guan et al., Research (2025)
PepNet Web Server Han et al., Commun. Biol. (2024)
SpatPPI Web Server Xu et al., Genome Biol. (2025)
nTChap Source Code Gao et al., BMC Genomics (2026)
II. Graph Models and Algorithms for Cancer Regulatory Network Construction and Pathway Analysis
Software/Database Access Key Publication
CoxReg Source Code Li et al., J. Transl. Med. (2021)
ComCovEx Source Code Gao et al., Global Chall. (2021)
MOLI Source Code Yu et al., Physica A (2022)
RE-GOA Source Code Lu et al., Bioinformatics (2022)
IPJGL Source Code Leng et al., Bioinformatics (2022)
mRank Source Code Shang et al., Comput. Struct. Biotechnol. J. (2022)
LncPNet Source Code Zhao et al., Front. Genet. (2022)
PST-PRNA Web Server Li et al., Bioinformatics (2022)
CNet-RLR Source Code Li et al., Appl. Intell. (2022)
GENELink Source Code Chen et al., Bioinformatics (2022)
GreyNet Source Code Chen et al., Front. Bioeng. Biotechnol. (2022)
NAE Source Code Wang et al., IEEE JBHI (2022)
IBTA Source Code Leng et al., Brief. Bioinform. (2022)
NR Source Code Yu et al., Physica A (2023)
PENCIL Source Code Ren et al., Nat. Mach. Intell. (2023)
PathExpSurv Source Code Hou et al., BMC Bioinformatics (2023)
LogBTF Source Code Li et al., Bioinformatics (2023)
GeoBind Source Code Li et al., Nucleic Acids Res. (2023)
PKI Source Code Wang et al., BBA-Gene Regul. Mech. (2023)
CNet-SVM Source Code Li et al., Expert Syst. Appl. (2023)
EFSmarker Source Code Li et al., Curr. Bioinform. (2023)
CNEReg Web Tool Pan et al., Genomics Proteomics Bioinformatics (2023)
RENDOR Source Code Yu et al., Bioinformatics (2024)
GeneLink+ Source Code Zhang et al., Brief. Bioinform. (2025)
RegGAIN Source Code Guan et al., Adv. Sci. (2025)
NetWalkRank Source Code Keikha et al., IEEE TCBB (2025)
MOFNet Source Code Zhang et al., Mol. Omics (2025)
MTPrior Source Code Keikha & Liu, IEEE JBHI (2025)
SCILD Source Code Yu et al., Commun. Biol. (2026)
HyperNetWalk Source Code Xu et al., arXiv (2026)
III. Breast Cancer Multi-omics Integration Analysis Platform and Intelligent Algorithms
Software/Database Access Key Publication
scCancer2 Web Server Chen et al., Bioinformatics (2024)
OWAS Source Code Song et al., Bioinformatics (2021)
CITEdb Web Database Shan et al., Bioinformatics (2022)
MAT2 Source Code Zhang et al., Bioinformatics (2021)
metaMIC Source Code Lai et al., Genome Biol. (2022)
stSurvTrans Source Code
HighlyRegionalGenes Web Server Wu et al., J. Genet. Genomics (2022)
scStateDynamics Web Server Guo et al., Genome Biol. (2024)
TIST Web Server Shan et al., GPB (2022)
scLT-kit Source Code Guo et al., Front. Comput. Sci. (2025)

Publications


Please access this website to learn about the publications of our lab.