This is the JEBIN algorithm developed for learning the consensus representations of genes by joint embedding of multiple bipitite networks.
Before running JEBIN on Linux system, packages GSL and Eigen are required to be installed. Users must modify the package paths before using the makefile to compile the code.
./JEBIN_linux -consensus_nodes_file consensus_nodes_list.txt -network_filenames_file network_filenames_list.txt -num_network 2 -output_directory output/ -binary 0 -size 200 -negative 5 -samples 200 -threads 10 -gamma 1 -rho 0.025
- -consensus_nodes_file: the consensus gene set
- -network_filenames_file: the file contains all the bipartite network filenames (using absolute paths)
- -num_network: the number of networks
- -output_directory: the output directory of all kinds of the nodes' vectors
- -binary: save the resulting vectors in binary mode; default is 0 (off)
- -size: the dimension of the embedding vectors; default is 200
- -negative: the number of negative examples used in negative sampling; default is 5
- -samples: the total number of training samples (*Million)
- -threads: the total number of threads used; the default is 1
- -gamma: the regularizing coefficient; the default is 1.0
- -rho: the starting value of the learning rate; the default is 0.025
This file contains the union of genes of all the datasets to be integrated (Gene Entrez ID is recommended). An example is shown bellow:
100
1000
10001
10006
10007
10008
10009
100093630
This file contains the absolute paths of all the bipartite network files with each constructed from one gene expression dataset. An example is shown below:
/home/gywu/multiset/data/bulkHCC/edgelist_Dataset1_HCCDB1.txt
/home/gywu/multiset/data/bulkHCC/edgelist_Dataset2_HCCDB13.txt
/home/gywu/multiset/data/bulkHCC/edgelist_Dataset3_HCCDB15.txt
/home/gywu/multiset/data/bulkHCC/edgelist_Dataset4_HCCDB17.txt
/home/gywu/multiset/data/bulkHCC/edgelist_Dataset5_HCCDB18.txt
/home/gywu/multiset/data/bulkHCC/edgelist_Dataset6_HCCDB3.txt
/home/gywu/multiset/data/bulkHCC/edgelist_Dataset7_HCCDB4.txt
/home/gywu/multiset/data/bulkHCC/edgelist_Dataset8_HCCDB6.txt
Each bipartite network file contains the edges between genes (first column) and samples (second column), the last column is the normalized gene expression value. An example is shown below:
100 HCCDB-1.S3 7.5599
100 HCCDB-1.S5 8.3189
100 HCCDB-1.S6 7.4908
100 HCCDB-1.S8 7.1945
100 HCCDB-1.S10 8.104
100 HCCDB-1.S12 7.9868
100 HCCDB-1.S14 8.5352
100 HCCDB-1.S16 7.6303
The /data folder contains an example of the input data.
The /scHCC folder contains the single-cell RNA-seq data of HCC (gene filtered), which is in "rds" format.
The /output folder contains an example of the output results of JEBIN.
- "output_u_consensus.txt": the consensus representation vectors for genes across all networks.
- "output_u_net1.txt": the dataset-specific representation vectors for genes in the first input network.
- "output_v_net1.txt": the dataset-specific representation vectors for samples in the first input network.
Guiying Wu (email: wuguiying_start@163.com)
@article{wu2022jebin,
title={JEBIN: analyzing gene co-expressions across multiple datasets by joint network embedding},
author={Wu, Guiying and Li, Xiangyu and Guo, Wenbo and Wei, Zheng and Hu, Tao and Shan, Yiran and Gu, Jin},
journal={Briefings in Bioinformatics},
volume={23},
number={2},
pages={bbab603},
year={2022},
publisher={Oxford University Press}
}