BCMA: breast cancer molecular atlas

Processes

1. Preparing data

Data in this website are downloaded from various sources. We require the following types of data and process them properly to fit our database.

1.1 Clinical tables

We prepare three tables here. The first one describes the cases who suffer from breast cancer. We collect their basic indications (age, sex, height, weight, race, etc), clinical diagnosis (receptor status, tumor type, grade, stage, etc), treatment programs (hormone therapy, radio therapy, surgery, etc) and prognostic indicators (various survival information etc). Missing numeric values are changed to “NA”, missing categorical values are changed to empty (blank), if the feature is presented in the webpage. Some cases may have various samples to correspond to different sampling site or stage. We get sample descriptions or split case descriptions to match the genomic data, as they are measured from samples. This is the second table.
We use the third table to record the correspondence of clinical samples and genomic data.

1.2 Simple somatic mutation

Simple somatic mutations are inferred from DNA-seq or microarray data. We keep them in a MAF (https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) formatted file. The first 32 columns are necessary here. The genome coordinate can be ensembl GRCh37, GRCh38, UCSC hg19, hg38. Hugo symbols are corrected if a symbol can not be mapped to an entrez id but an entrez id can be mapped to a symbol. And the entrez ids are corrected in the similar way. For some datasets there are no entrez ids supplied. We map the symbols to hgnc_symbol, external_synonym, entrezgene_accession, ensembl_gene_id (see them in biomaRt) sequentially and treat the first mapping results as standard symbols.

1.3 Gene fusion

Gene fusions are inferred from DNA-seq or RNA-seq data. We keep them in a tab separated text file, with the first column "Hugo_Symbol", second "Tumor_Sample_Barcode", third "Fusion", which fusion named in "Left Gene Symbol-Right Gene Symbol" format. Gene symbols are processed as described in section "1.2".

1.4 Gene level copynumber and copynumber segment

They are inferred from microarray or DNA-seq data. Copynumber segment is a widely used output format of CBS analysis in the DNACopy R-package. The genome coordinate can be ensembl GRCh37, GRCh38, UCSC hg19, hg38. Segment means can be log transformed or not. Gene level copynumbers are stored in a sample-gene matrix. The five values -2, -1, 0, 1, 2 in the matrix means homozygous deletion (HOMDEL), hemizygous deletion (HETLOSS), neutral or no change (DIPLOID), gain (GAIN), high level amplification (AMP). Missing values are kept as "NA". Gene symbols are processed as described in section "1.2". Besides, some symbols contain "|chr*" suffices (usually output by gistic2). We keep their suffices while correcting symbols. Rounded mean values are calcuated for duplicated symbols.

1.5 Gene expression and gene expression of normals

Gene expressions are inferred from microarray or RNA-seq data. We use a gene-sample matrix to keep them. Values can be log2 intensities (microarray), FPKM (Cufflinks output, etc), TPM (RSEM output, etc), raw count/log2 raw count/log2 raw count + 1 (STAR output, etc). Missing values are kept as "NA". Hugo symbols are corrected if a symbol can not be mapped to an entrez id/ensembl id but an entrez id/ensembl id can be mapped to a symbol. For some datasets there are neither entrez ids nor ensembl ids supplied. We map the symbols to hgnc_symbol, external_synonym, entrezgene_accession, ensembl_gene_id (see them in biomaRt) sequentially and treat the first mapping results as standard symbols. Expression values are summarized for duplicated symbols. The main matrix is for tumor samples. Some datasets may have a corresponding normal group. We keep them in a separated matrix and named it gene expression of normals.

1.6 Gene expression z-score

Gene expression z-score is typically z-scores across all samples include normals for each gene. We prefer z-scores compared to the expression distribution of each gene tumors that are diploid for this gene, if such data or gene level copynumbers are supplied. Missing values are kept as "NA". Gene symbols are processed as described in section "1.5".

1.7 miRNA expression and miRNA expression of normals

miRNA expressions are inferred from microarray or RNA-seq data. We use a gene-sample matrix to keep them. Values can be log2 intensities, FPKM, TPM, raw count/log2 raw count/log2 raw count + 1. Missing values are kept as "NA". Mirbase id is used to indicate miRNA. Expression values are summarized for duplicated ids. The main matrix is for tumor samples. Some datasets may have a corresponding normal group. We keep them in a separated matrix and named it miRNA expression of normals.

1.8 miRNA expression z-score

miRNA expression z-score is typically z-scores across normal samples (or tumor samples if normal samples are not supplied) for each gene. Missing values are kept as "NA". Mirbase id is used to indicate miRNA.

1.9 Gene level methylation means and gene level methylation means of normals

Gene level methylation means are inferred from microarray, RRBS or WGBS data. We use a gene-sample matrix to keep them. Mean beta values among 1500bp to -500bp distance to TSS are calculated for each gene in each sample. Values are normalized to sum to 1,000,000 for each sample. Missing values are kept as "NA". Gene symbols are processed as described in section "1.5". Expression values are averaged for duplicated symbols. The main matrix is for tumor samples. Some datasets may have a corresponding normal group. We keep them in a separated matrix and named it gene level methylation means of normals.

1.10 Gene level methylation z-score

Gene level methylation z-score is typically z-scores across normal samples (or tumor samples if normal samples are not supplied) for each gene. Missing values are kept as "NA". Gene symbols are processed as described in section "1.5".

2. Counting mutations for a dataset

For simple somatic mutations, we count the mutation numbers for each sample to get a mutation count distribution. For each sample, we call a gene as mutated gene if there are mutations in that gene. We count sample frequencies for all genes and list the top frequent genes. We also count sample frequencies for each mutations. To identify the same mutations across all datasets, we annotate them to GRCh38 ensembl release 106 by VEP to get HGVS identifiers. Thus, mutations can not be lifted to hg38 (by UCSC liftover chains) or do not have HGVSc identifiers are filtered out. We list mutations mutated in at least 2 samples in a table with some important annotations by VEP and give top frequent mutations (with non-empty HGVSp) in a figure.

3. Counting fusions for a dataset

For gene fusions, we treat it the same fusion if their left and right genes are the same respectively. We count sample frequencies for all fusions, list fusions that affect at least 2 samples in a table, and list top frequent fusions in a figure.

4. Counting CNAs for a dataset

For gene level copynumbers, we count sample frequencies for the 4 types of CNAs for each gene and list HOMDEL/AMP/total samples in a table. The figure is top frequent genes with either AMP or HOMEL CNAs. We also plot copynumber segment here by R package copynumber.

5. Summarizing genomic events for a dataset

We use oncogrid to present all the above DNA events. For simple somatic mutations, we transfer their variant_classification (a column in MAF file) to the following consequence markers: Frameshift, Inframe, Missense, Nonsense, Tss, Nonstop, Splice, RNA, Others. "Others" will not be present in the figure. For gene fusions, we mark them FUSION. For gene level copynumbers, we mark HOMDEL and AMP, filter out other types. Some important clinical traits are also listed in the figure.

6. Calculating differential expressed genes for a dataset

We use R package edgeR to calculate differential expressed genes. Genes expressed in at least one sample are included. Then we calculate the fold change, p-value and BH FDR for genes between tumor and normal samples. We mark genes with log2FC>1 and FDR<0.05 as up regulated genes, log2FC<-1 and FDR<0.05 as down regulated genes, and list them in a table. We use R package clusterProfiler (with ensembl 106 in biomaRt) to calculate KEGG and GO enrichments for differential expressed genes (both up and down regulated genes). Top 30 enriched KEGG pathways are listed in a bubble plot. Top 10 GO terms (BP, CC, MF respectively) are listed in a bar plot.

7. Calculating differential expressed miRNAs for a dataset

We use R package edgeR to calculate differential expressed miRNAs. miRNAs expressed in at least one sample are included. Then we calculate the fold change, p-value and BH FDR for genes between tumor and normal samples. We mark genes with log2FC>1 and FDR<0.05 as up regulated genes, log2FC<-1 and FDR<0.05 as down regulated genes, and list them in a table.

8. Calculating differential methylated genes for a dataset

We use R package edgeR to calculate differential methylated genes. Mean beta value among 1500bp to -500bp distance to TSS is calculated to indicate the degree of methylation for each gene. We calculate the fold change, p-value and BH FDR for genes between tumor and normal samples. We mark genes with log2FC>1 and FDR<0.05 as up regulated genes, log2FC<-1 and FDR<0.05 as down regulated genes, and list them in a table. We use R package clusterProfiler (with ensembl 106 in biomaRt) to calculate KEGG and GO enrichments for differential methylated genes (both up and down regulated genes). Top 30 enriched KEGG pathways are listed in a bubble plot. Top 10 GO terms (BP, CC, MF respectively) are listed in a bar plot.

9. Calculating survival analysis

Survival analysis is done by R package survival and survminer. We use several genomic data as the strata. For simple somatic mutations, we use "mutated gene" as the strata. If a gene mutated in at 10%-90% samples, we calculate the survival significance by this strata, and give the plot. For gene fusions, we use "fusion gene" as the strata. We include genes that have fusions in 10%-90% samples. For gene level copynumber, we use "CNA gene" as the strata. If a gene has at least 2 types of CNAs that affect 10-90% samples, we use this as a strata. CNA types can be HOMDEL, HETLOSS, GAIN, AMP. For gene expression z-scores, we classify gene "high" when its z-score>0 and "low" with z-score<0. If a gene is marked "high" in 10%-90% samples and "low" in 10%-90% samples, we use this as a strata. For miRNA expression z-scores, we classify miRNA "high" when its z-score>0 and "low" with z-score<0. If an miRNA is marked "high" in 10%-90% samples and "low" in 10%-90% samples, we use this as a strata. For gene level methylation z-scores, we classify gene "high" when its z-score>0 and "low" with z-score<0. If a gene is marked "high" in 10%-90% samples and "low" in 10%-90% samples, we use this as a strata.

10. Generating genomic tracks for a dataset

Simple somatic mutations are lifted to hg38 (by UCSC liftover chains) in a MAF file. Copynumber segment files are first transfer to bed, then lifted to hg38 (by UCSC liftover chains), and finally transfer back. Records that cannot be lifted are filtered out. Data that cannot be lifted are discarded. The two files above are available tracks for IGV genome browser.

11. Counting affected genes across datasets

A gene is treated as affected for a dataset, if it has simple somatic mutations, fusions or at least 1 of the 4 types of CNAs in that dataset. All affected genes are united across datasets to get a final number. We also count the sample frequencies for mutated genes (gene that has at least 1 simple somatic mutation in a sample of a dataset) across datasets. Top mutated genes across datasets are listed in a stacked bar plot. Top mutated genes in Chinese cohorts are also listed by counting datasets with Chinese cohorts.

12. Counting mutations/fusions/CNAs across datasets

We combine sample frequencies of mutations in step "Counting mutations for a dataset" for all datasets to get sample frequencies across datasets. Mutations mutated in only 1 sample across all datasets are not listed in tables, but are counted in total mutation number. We combine results in step "Counting fusions for a dataset" to get sample frequencies of gene fusions across datasets. Gene fusions affect only 1 sample across all datasets are not listed in tables. We combine results in step "Counting CNAs for a dataset" to get sample frequencies of CNAs across datasets. GAIN and HETLOSS are not listed in tables.

13. Making gene/miRNA expression/methylation patterns on all datasets

We compute z-scores across genes/miRNAs in each sample for all datasets. For a user specified gene/miRNA, we can see its difference expressions/methylation means between case (tumor) and control (normal) groups on all datasets in a violin plot. Note that patterns in the figure can be misleading since total observed genes and observations have different biases in different techniques.

14. Generating co-expression network on all datasets

For each dataset, we use gene expression z-scores to calculate pairwise Spearman correlations between genes. These genes are restricted to protein network data (incl. distinction: direct vs. interologs) in STRING (9606.protein.links.full.v11.0.txt.gz). Then we correct p-values by their ranks. We use Fisher’s method to combine those p-values on all datasets. Combined p-values are also corrected by their ranks among genes. These corrected p-values will be used as thresholds for generating networks. For user specified genes, direct edges are genes that have p≤0.001. Direct edges are restricted to 1-30. Indirect edges are generated as direct edges for genes that directly link to user specified genes. This two-step network is plotted in the webpage for user specified genes.

Programatic APIs

BCMA APIS is a set of RESTful endpoints (programmable interfaces over the Web) that allows third-party developers to build automation scripts and apps. Please see the following API Endpoint Documentation for detailed information on the API endpoints, representations and how the API responds to different requests.

1.Downloading Dataset Files

URL(Request Method: GET)

http://lifeome.net:9001/download?f=<datasetId>/<datasetId>_download.tar.gz

Examples

GET: http://lifeome.net:9001/download?f=fuscctnbc/_download.tar.gz
GET: http://lifeome.net:9001/download?f=brca_tcga_pan_can_atlas_2018/_download.tar.gz

"<datasetId>" can be any of valid dataset Ids which can be archived in the url of dataset

2.Querying Gene Merged Network

URL(Request Method: GET)

http://lifeome.net:9001/f?f=_web_search_single_gene/<geneSymbol>/network.json

Examples

GET: http://lifeome.net:9001/f?f=_web_search_single_gene/BRCA1/network.json
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TIMELESS/network.json
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TOP2A/network.json

"<geneSymbol>" can be any of valid HGNC Gene Symbols.

3.Querying Gene Expression

URL(Request Method: GET)

http://lifeome.net:9001/f?f=_web_search_single_gene/<geneSymbol>/exp.csv

Examples

GET: http://lifeome.net:9001/f?f=_web_search_single_gene/BRCA1/exp.csv
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TIMELESS/exp.csv
GET: http://lifeome.net:9001/f?f=_web_search_single_gene/TOP2A/exp.csv

"<geneSymbol>" can be any of valid HGNC Gene Symbols.

4.Querying Dataset Survival

URL(Request Method: POST)

http://lifeome.net:9001/datasetSurvival/list
{
datasetId:<datasetId>,
limit:<limit>,
page:<page>
}

Examples

POST: http://lifeome.net:9001/datasetSurvival/list
{
datasetId:"brca_tcga_pan_can_atlas_2018",
limit:10,
page:1
}

"<limit>" is the number of records wanted to be archived at the current "<page>".

Software/Database

This section features key databases and software tools developed by the project for tumor microenvironment exploration, single-cell multi-omics, and spatial transcriptomics.

I. Optimization Models and Algorithms for Combinatorial Problems in Multi-omics Data Integration and Analysis

Software/Database	Access	Key Publication
ARBic	Source Code	Liu et al., NAR Genom. Bioinform. (2023)
NoiBic	Source Code
MarsGT	Source Code	Wang et al., Nat. Commun. (2024)
SemiLT	Source Code	Chen et al., Adv. Sci. (2025)
DeepGFT	Source Code	Sun et al., Genome Biol. (2025)
TransMeta	Source Code	Yu et al., Genome Res. (2022)
DESSO-DB	Web Database	Wang et al., Comput. Struct. Biotechnol. J. (2022)
TESA	Source Code	Li et al., Patterns (2024)
CEMIG	Source Code	Wang et al., Brief. Bioinform. (2024)
MPCHG	Not available	Wu et al., Front. Genet. (2024)
HycDemux	Source Code	Han et al., Genome Biol. (2023)
TDFPS-Designer	Source Code	Qi et al., Genome Biol. (2024)
PRO	Source Code	Yu et al., Fundam. Res. (2024)
DriverMP	Web Server	Liu et al., GigaScience (2023)
Spatom	Web Server	Wu et al., Brief. Bioinform. (2023)
TriNet	Web Server	Zhou et al., Patterns (2023)
GraphRBF	Web Server	Zhang et al., GigaScience (2024)
GeoNet	Web Server	Han et al., Structure (2024)
SpatConv	Web Server	Guan et al., Research (2025)
PepNet	Web Server	Han et al., Commun. Biol. (2024)
SpatPPI	Web Server	Xu et al., Genome Biol. (2025)
nTChap	Source Code	Gao et al., BMC Genomics (2026)

II. Graph Models and Algorithms for Cancer Regulatory Network Construction and Pathway Analysis

Software/Database	Access	Key Publication
CoxReg	Source Code	Li et al., J. Transl. Med. (2021)
ComCovEx	Source Code	Gao et al., Global Chall. (2021)
MOLI	Source Code	Yu et al., Physica A (2022)
RE-GOA	Source Code	Lu et al., Bioinformatics (2022)
IPJGL	Source Code	Leng et al., Bioinformatics (2022)
mRank	Source Code	Shang et al., Comput. Struct. Biotechnol. J. (2022)
LncPNet	Source Code	Zhao et al., Front. Genet. (2022)
PST-PRNA	Web Server	Li et al., Bioinformatics (2022)
CNet-RLR	Source Code	Li et al., Appl. Intell. (2022)
GENELink	Source Code	Chen et al., Bioinformatics (2022)
GreyNet	Source Code	Chen et al., Front. Bioeng. Biotechnol. (2022)
NAE	Source Code	Wang et al., IEEE JBHI (2022)
IBTA	Source Code	Leng et al., Brief. Bioinform. (2022)
NR	Source Code	Yu et al., Physica A (2023)
PENCIL	Source Code	Ren et al., Nat. Mach. Intell. (2023)
PathExpSurv	Source Code	Hou et al., BMC Bioinformatics (2023)
LogBTF	Source Code	Li et al., Bioinformatics (2023)
GeoBind	Source Code	Li et al., Nucleic Acids Res. (2023)
PKI	Source Code	Wang et al., BBA-Gene Regul. Mech. (2023)
CNet-SVM	Source Code	Li et al., Expert Syst. Appl. (2023)
EFSmarker	Source Code	Li et al., Curr. Bioinform. (2023)
CNEReg	Web Tool	Pan et al., Genomics Proteomics Bioinformatics (2023)
RENDOR	Source Code	Yu et al., Bioinformatics (2024)
GeneLink+	Source Code	Zhang et al., Brief. Bioinform. (2025)
RegGAIN	Source Code	Guan et al., Adv. Sci. (2025)
NetWalkRank	Source Code	Keikha et al., IEEE TCBB (2025)
MOFNet	Source Code	Zhang et al., Mol. Omics (2025)
MTPrior	Source Code	Keikha & Liu, IEEE JBHI (2025)
SCILD	Source Code	Yu et al., Commun. Biol. (2026)
HyperNetWalk	Source Code	Xu et al., arXiv (2026)

III. Breast Cancer Multi-omics Integration Analysis Platform and Intelligent Algorithms

Software/Database	Access	Key Publication
scCancer2	Web Server	Chen et al., Bioinformatics (2024)
OWAS	Source Code	Song et al., Bioinformatics (2021)
CITEdb	Web Database	Shan et al., Bioinformatics (2022)
MAT2	Source Code	Zhang et al., Bioinformatics (2021)
metaMIC	Source Code	Lai et al., Genome Biol. (2022)
stSurvTrans	Source Code
HighlyRegionalGenes	Web Server	Wu et al., J. Genet. Genomics (2022)
scStateDynamics	Web Server	Guo et al., Genome Biol. (2024)
TIST	Web Server	Shan et al., GPB (2022)
scLT-kit	Source Code	Guo et al., Front. Comput. Sci. (2025)

Publications

Please access this website to learn about the publications of our lab.

Single-Gene

Multi-Gene

Processes

1. Preparing data

2. Counting mutations for a dataset

3. Counting fusions for a dataset

4. Counting CNAs for a dataset

5. Summarizing genomic events for a dataset

6. Calculating differential expressed genes for a dataset

7. Calculating differential expressed miRNAs for a dataset

8. Calculating differential methylated genes for a dataset

9. Calculating survival analysis

10. Generating genomic tracks for a dataset

11. Counting affected genes across datasets

12. Counting mutations/fusions/CNAs across datasets

13. Making gene/miRNA expression/methylation patterns on all datasets

14. Generating co-expression network on all datasets

Programatic APIs

1.Downloading Dataset Files

2.Querying Gene Merged Network

3.Querying Gene Expression

4.Querying Dataset Survival

Software/Database

I. Optimization Models and Algorithms for Combinatorial Problems in Multi-omics Data Integration and Analysis

II. Graph Models and Algorithms for Cancer Regulatory Network Construction and Pathway Analysis

III. Breast Cancer Multi-omics Integration Analysis Platform and Intelligent Algorithms

Publications