ageas.Launch()

This notebook demonstrate how to use ageas.Launch() to launch AGEAS in extracting key genetic regulatory elements from RNA-seq based Gene Expression Matrices (GEMs) in differentiating sample classes.

[ ]:

import ageas

Currently, AGEAS support input data under two different formats:

‘gem_files’: Gene expression of all classes are presented as dataframes under CSV or TXT format with rows representing genes and columns representing samples.

Example:

	SRR1039509	SRR1039512	SRR1039513	SRR1039516	SRR1039508
ENSG00000000003	679	448	873	408	1138
ENSG00000000005	0	0	0	0	0
ENSG00000000419	467	515	621	365	587
ENSG00000000457	260	211	263	164	245
ENSG00000000460	60	55	40	35	78
ENSG00000000938	0	0	2	0	1

Genes must either be named with official gene symbols or Ensembl gene IDs(ENS***).

There is no requirement for sample name type. Barcodes, numbers, any artificial names can work.

‘mex_folders’: Each folder containing Market Exchange Format (MEX) files output by cellranger pipeline representing samples of same class. For more information: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices

Mouse Embryonic Fibroblast(MEF) vs Embryonic Stem Cells(ESC) example:

This example represent how to use ageas.Launch() with ‘gem_files’.

Here, we attempt to extract key genetic factors to perform cell reprogramming from MEF into Induced Pluripotent Stem Cell(iPSC), one of the most well known cell reprogramming case, with AGEAS.

The scRNA-seq based gene expression data for both MEF and ESC are retrieved from GSE103221.

Either raw data in GSE103221_RAW.tar or normalized counts in GSE103221_normalized_counts.csv.gz can be processed with AGEAS.

For using raw data

With all setting remain default, we can launch AGEAS as:

[ ]:

ageas.Launch(
    # since 'gem_files' is the default setting, this line can be deleted
    database_type = 'gem_files',

    # ageas.Data_Preprocess() args
    class1_path = 'GSE103221_RAW/GSM3629847_10x_osk_mef.csv.gz',
    class2_path = 'GSE103221_RAW/GSM3629848_10x_osk_esc.csv.gz',
)

However, section above could be too computational expensive for PC!

Few adjustments can be made:

[ ]:

test_raw = ageas.Launch(
    protocol = 'multi',
    unit_num = 4,

    # ageas.Data_Preprocess() args
    class1_path = 'GSE103221_RAW/GSM3629847_10x_osk_mef.csv.gz',
    class2_path = 'GSE103221_RAW/GSM3629848_10x_osk_esc.csv.gz',
    std_value_thread = 3.0,
)

With std_value_thread = 3.0, more genes with relatively low expression variability will be ruled out during meta-processing and, thus, limit amount of Gene Regulatory Pathways(GRPs) in meta level processed Gene Regulatory Network(GRN) and pseudo-sample GRNs.

Instead of using only one AGEAS extractor units by default, four units are used with unit_num = 4. Result now is generalized from GRPs extracted with every unit.

To reduce running time, protocol = ‘multi’ set units to run parallelly with multithreading.

For more API information, please visit documentaion page.

Extraction reports can be saved as files with:

[ ]:

test_raw.save_reports(
    folder_path = 'report_files/',
    save_unit_reports = True
)

Within folder report_files, there should have following files:

report_files/
    │
    ├─ no_1/
    │  ├─ grps_importances.txt
    │  ├─ outlier_grps.js
    │
    ├─ no_2/
    |  ├─ ...
    │
    ├─ no_3/
    |  ├─ ...
    │
    ├─ no_4/
    |  ├─ ...
    │
    ├─ full_atlas.js
    ├─ grp_scores.csv
    ├─ key_atlas.js
    ├─ meta_GRN.js
    ├─ meta_report.csv
    ├─ pseudo_sample_GRNs.js
    ├─ report.csv

Folders no_1, no_2, no_3, no_4 contain GRP importance scores returned by each extractor unit as grps_importances.csv which has GRPs ranked with importance scores and outlier_grps.js which has GRPs once removed during feature selection due to extremly high importance score. If these information not needed, keep save_unit_reports as False by default.

full_atlas.js: networks reconstructed with every important GRP extracted by every extractor unit.

grp_scores.csv: GRPs with max importance score each can achieve after extracted by every extractor unit.

key_atlas.js: pruned networks reconstructed only with genes being capable to regulate other genes in full atlas.

meta_GRN.js: meta-level processed GRN cast with all samples.

meta_report.csv: summary of every gene in meta-processed GRN. By default, records are ranked by Degree. Top few rows should look like:

ID	Gene Symbol	Type	Degree	Log2FC
Pou5f1	Pou5f1	TF	786	18.0654266535883
Trim28	Trim28	TF	727	16.7739633684336
Trp53	Trp53	TF	725	15.9708922902521
Rest	Rest	TF	695	15.0240141813129
Sox2	Sox2	TF	687	15.4459524481466
Junb	Junb	TF	683	14.1196706687477
Cebpb	Cebpb	TF	682	13.8599229728618

Type can either be Gene or TF(Transcription Factor).
Degree here stands for degree in graph theory.
Log2FC is calcualted with all expression values in sample classes.

pseudo_sample_GRNs.js: GRNs cast for each pseudo-sample generated with Sliding Window Algorithm.

report.csv: information about key regulatory-source genes in full_atlas.js. By default, records are ranked by Log2FC. Top few rows should look like:

ID	Gene Symbol	Network	Type	Source_Num	Target_Num	Meta_Degree	Log2FC
Pou5f1	Pou5f1	network_0	TF	5	68	786	18.065426653588275
Klf2	Klf2	network_0	TF	0	71	592	17.77230812962195
Trim28	Trim28	network_0	TF	8	136	727	16.77396336843356
Trp53	Trp53	network_0	TF	7	133	724	15.97089229025213
Sox2	Sox2	network_0	TF	3	83	687	15.44595244814661
Nanog	Nanog	network_0	TF	4	81	656	15.417885470810674
Klf4	Klf4	network_0	TF	3	62	428	15.348589852448784

Type can either be Gene or TF(Transcription Factor).
Source-Num and Target_Num indicate amount of directly related regulatory source and target in full atlas.
Meta_Degree and Log2FC here are same with Degree and Log2FC in meta_report.csv.

For using normalized gene counts

We will need to make new gene expression matrices with all MEF samples and ESC samples respectively.

[ ]:

import re
import pandas as pd

data = pd.read_csv('GSE103221_normalized_counts.csv', index_col = 0)

mef_samples = [x for x in data if re.search(r'mef', x)]
esc_samples = [x for x in data if re.search(r'esc', x)]

data[mef_samples].to_csv('mef.csv.gz')
data[esc_samples].to_csv('esc.csv.gz')

Considering the sample size of raw data for each class no less than few thousands, generating pseudo-samples, which abstracts gene expressions from several distinct samples as continuous expression data in order to calculate gene expression correlations, with every distinct 100 samples by default setting should be acceptable.

However, in this normalized data, only dozens of samples clearly labeled as MEF or ESC. To make at least dozens of pseudo-samples, we can adjust few arguments for Sliding Window Algorithm.

Furthermore, with normalized expression value, GRP filters shall also be adjusted to keep total amount of GRP after meta-processing reasonable. log2fc_thread can rule out genes and corresponding GRPs based on expression difference.

[ ]:

test_normalized = ageas.Launch(
    unit_num = 4,

    # ageas.Data_Preprocess() args
    class1_path = 'mef.csv.gz',
    class2_path = 'esc.csv.gz',
    log2fc_thread = 3,
    std_value_thread = 100,
    sliding_window_size = 10,
    sliding_window_stride = 1,
)

test_normalized.save_reports(
    folder_path = 'report_files/'
)

Files within folder report_files are under same structure and formats described above.

Carbon Tetrachloride(CCl4) induced liver fibrosis example:

This example represent how to use ageas.Launch() with ‘mex_folders’.

Stimulating portal fibroblasts(PFs) as activated hepatic stellate cells(HSCs), we can try to find key genetic factors in liver fibrosis through extracting key genetic differences among HSCs and PFs.

The scRNA-seq based gene expression data for PF is retrieved from GSM4085627 and data for HSC is retrieved from GSM4085625.

Files retrieved are managed as:

a6w_pf/
    ├─ GSM4085627_10x_5_barcodes.tsv.gz
    ├─ GSM4085627_10x_5_genes.tsv.gz
    ├─ GSM4085627_10x_5_matrix.mtx.gz

a6w_hsc/
    ├─ GSM4085625_10x_3_barcodes.tsv.gz
    ├─ GSM4085625_10x_3_genes.tsv.gz
    ├─ GSM4085625_10x_3_matrix.mtx.gz

Then, we can launch AGEAS with:

[ ]:

test_raw = ageas.Launch(
    unit_num = 4,

    # ageas.Data_Preprocess() args
    database_type = 'mex_folders',
    class1_path = 'a6w_hsc/',
    class2_path = 'a6w_pf/',
)

test_normalized.save_reports(
    folder_path = 'report_files/'
)

Files within folder report_files are under same structure and formats described in example above.

And then…

We can visualize each network as a graph with ageas.Plot()