ageas.Launch()
This notebook demonstrate how to use ageas.Launch() to launch AGEAS in extracting key genetic regulatory elements from RNA-seq based Gene Expression Matrices (GEMs) in differentiating sample classes.
[ ]:
import ageas
Currently, AGEAS support input data under two different formats:
‘gem_files’: Gene expression of all classes are presented as dataframes under CSV or TXT format with rows representing genes and columns representing samples.
Example:
SRR1039509
SRR1039512
SRR1039513
SRR1039516
SRR1039508
ENSG00000000003
679
448
873
408
1138
ENSG00000000005
0
0
0
0
0
ENSG00000000419
467
515
621
365
587
ENSG00000000457
260
211
263
164
245
ENSG00000000460
60
55
40
35
78
ENSG00000000938
0
0
2
0
1
Genes must either be named with official gene symbols or Ensembl gene IDs(ENS***).
There is no requirement for sample name type. Barcodes, numbers, any artificial names can work.
‘mex_folders’: Each folder containing Market Exchange Format (MEX) files output by cellranger pipeline representing samples of same class. For more information: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices
Mouse Embryonic Fibroblast(MEF) vs Embryonic Stem Cells(ESC) example:
This example represent how to use ageas.Launch() with ‘gem_files’.
Here, we attempt to extract key genetic factors to perform cell reprogramming from MEF into Induced Pluripotent Stem Cell(iPSC), one of the most well known cell reprogramming case, with AGEAS.
The scRNA-seq based gene expression data for both MEF and ESC are retrieved from GSE103221.
Either raw data in GSE103221_RAW.tar or normalized counts in GSE103221_normalized_counts.csv.gz can be processed with AGEAS.
For using raw data
With all setting remain default, we can launch AGEAS as:
[ ]:
ageas.Launch(
# since 'gem_files' is the default setting, this line can be deleted
database_type = 'gem_files',
# ageas.Data_Preprocess() args
class1_path = 'GSE103221_RAW/GSM3629847_10x_osk_mef.csv.gz',
class2_path = 'GSE103221_RAW/GSM3629848_10x_osk_esc.csv.gz',
)
However, section above could be too computational expensive for PC!
Few adjustments can be made:
[ ]:
test_raw = ageas.Launch(
protocol = 'multi',
unit_num = 4,
# ageas.Data_Preprocess() args
class1_path = 'GSE103221_RAW/GSM3629847_10x_osk_mef.csv.gz',
class2_path = 'GSE103221_RAW/GSM3629848_10x_osk_esc.csv.gz',
std_value_thread = 3.0,
)
With std_value_thread = 3.0, more genes with relatively low expression variability will be ruled out during meta-processing and, thus, limit amount of Gene Regulatory Pathways(GRPs) in meta level processed Gene Regulatory Network(GRN) and pseudo-sample GRNs.
Instead of using only one AGEAS extractor units by default, four units are used with unit_num = 4. Result now is generalized from GRPs extracted with every unit.
To reduce running time, protocol = ‘multi’ set units to run parallelly with multithreading.
For more API information, please visit documentaion page.
Extraction reports can be saved as files with:
[ ]:
test_raw.save_reports(
folder_path = 'report_files/',
save_unit_reports = True
)
Within folder report_files, there should have following files:
report_files/
│
├─ no_1/
│ ├─ grps_importances.txt
│ ├─ outlier_grps.js
│
├─ no_2/
| ├─ ...
│
├─ no_3/
| ├─ ...
│
├─ no_4/
| ├─ ...
│
├─ full_atlas.js
├─ grp_scores.csv
├─ key_atlas.js
├─ meta_GRN.js
├─ meta_report.csv
├─ pseudo_sample_GRNs.js
├─ report.csv
Folders no_1, no_2, no_3, no_4 contain GRP importance scores returned by each extractor unit as grps_importances.csv which has GRPs ranked with importance scores and outlier_grps.js which has GRPs once removed during feature selection due to extremly high importance score. If these information not needed, keep save_unit_reports as False by default.
full_atlas.js: networks reconstructed with every important GRP extracted by every extractor unit.
grp_scores.csv: GRPs with max importance score each can achieve after extracted by every extractor unit.
key_atlas.js: pruned networks reconstructed only with genes being capable to regulate other genes in full atlas.
meta_GRN.js: meta-level processed GRN cast with all samples.
meta_report.csv: summary of every gene in meta-processed GRN. By default, records are ranked by Degree. Top few rows should look like:
ID |
Gene Symbol |
Type |
Degree |
Log2FC |
---|---|---|---|---|
Pou5f1 |
Pou5f1 |
TF |
786 |
18.0654266535883 |
Trim28 |
Trim28 |
TF |
727 |
16.7739633684336 |
Trp53 |
Trp53 |
TF |
725 |
15.9708922902521 |
Rest |
Rest |
TF |
695 |
15.0240141813129 |
Sox2 |
Sox2 |
TF |
687 |
15.4459524481466 |
Junb |
Junb |
TF |
683 |
14.1196706687477 |
Cebpb |
Cebpb |
TF |
682 |
13.8599229728618 |
Type can either be Gene or TF(Transcription Factor).
Degree here stands for degree in graph theory.
Log2FC is calcualted with all expression values in sample classes.
pseudo_sample_GRNs.js: GRNs cast for each pseudo-sample generated with Sliding Window Algorithm.
report.csv: information about key regulatory-source genes in full_atlas.js. By default, records are ranked by Log2FC. Top few rows should look like:
ID |
Gene Symbol |
Network |
Type |
Source_Num |
Target_Num |
Meta_Degree |
Log2FC |
---|---|---|---|---|---|---|---|
Pou5f1 |
Pou5f1 |
network_0 |
TF |
5 |
68 |
786 |
18.065426653588275 |
Klf2 |
Klf2 |
network_0 |
TF |
0 |
71 |
592 |
17.77230812962195 |
Trim28 |
Trim28 |
network_0 |
TF |
8 |
136 |
727 |
16.77396336843356 |
Trp53 |
Trp53 |
network_0 |
TF |
7 |
133 |
724 |
15.97089229025213 |
Sox2 |
Sox2 |
network_0 |
TF |
3 |
83 |
687 |
15.44595244814661 |
Nanog |
Nanog |
network_0 |
TF |
4 |
81 |
656 |
15.417885470810674 |
Klf4 |
Klf4 |
network_0 |
TF |
3 |
62 |
428 |
15.348589852448784 |
Type can either be Gene or TF(Transcription Factor).
Source-Num and Target_Num indicate amount of directly related regulatory source and target in full atlas.
Meta_Degree and Log2FC here are same with Degree and Log2FC in meta_report.csv.
For using normalized gene counts
We will need to make new gene expression matrices with all MEF samples and ESC samples respectively.
[ ]:
import re
import pandas as pd
data = pd.read_csv('GSE103221_normalized_counts.csv', index_col = 0)
mef_samples = [x for x in data if re.search(r'mef', x)]
esc_samples = [x for x in data if re.search(r'esc', x)]
data[mef_samples].to_csv('mef.csv.gz')
data[esc_samples].to_csv('esc.csv.gz')
Considering the sample size of raw data for each class no less than few thousands, generating pseudo-samples, which abstracts gene expressions from several distinct samples as continuous expression data in order to calculate gene expression correlations, with every distinct 100 samples by default setting should be acceptable.
However, in this normalized data, only dozens of samples clearly labeled as MEF or ESC. To make at least dozens of pseudo-samples, we can adjust few arguments for Sliding Window Algorithm.
Furthermore, with normalized expression value, GRP filters shall also be adjusted to keep total amount of GRP after meta-processing reasonable. log2fc_thread can rule out genes and corresponding GRPs based on expression difference.
[ ]:
test_normalized = ageas.Launch(
unit_num = 4,
# ageas.Data_Preprocess() args
class1_path = 'mef.csv.gz',
class2_path = 'esc.csv.gz',
log2fc_thread = 3,
std_value_thread = 100,
sliding_window_size = 10,
sliding_window_stride = 1,
)
test_normalized.save_reports(
folder_path = 'report_files/'
)
Files within folder report_files are under same structure and formats described above.
Carbon Tetrachloride(CCl4) induced liver fibrosis example:
This example represent how to use ageas.Launch() with ‘mex_folders’.
Stimulating portal fibroblasts(PFs) as activated hepatic stellate cells(HSCs), we can try to find key genetic factors in liver fibrosis through extracting key genetic differences among HSCs and PFs.
The scRNA-seq based gene expression data for PF is retrieved from GSM4085627 and data for HSC is retrieved from GSM4085625.
Files retrieved are managed as:
a6w_pf/
├─ GSM4085627_10x_5_barcodes.tsv.gz
├─ GSM4085627_10x_5_genes.tsv.gz
├─ GSM4085627_10x_5_matrix.mtx.gz
a6w_hsc/
├─ GSM4085625_10x_3_barcodes.tsv.gz
├─ GSM4085625_10x_3_genes.tsv.gz
├─ GSM4085625_10x_3_matrix.mtx.gz
Then, we can launch AGEAS with:
[ ]:
test_raw = ageas.Launch(
unit_num = 4,
# ageas.Data_Preprocess() args
database_type = 'mex_folders',
class1_path = 'a6w_hsc/',
class2_path = 'a6w_pf/',
)
test_normalized.save_reports(
folder_path = 'report_files/'
)
Files within folder report_files are under same structure and formats described in example above.
And then…
We can visualize each network as a graph with ageas.Plot()