ageas.Launch()

This notebook demonstrate how to use ageas.Launch() to launch AGEAS in extracting key genetic regulatory elements from RNA-seq based Gene Expression Matrices (GEMs) in differentiating sample classes.

[ ]:
import ageas

Currently, AGEAS support input data under two different formats:

  1. ‘gem_files’: Gene expression of all classes are presented as dataframes under CSV or TXT format with rows representing genes and columns representing samples.

    Example:

    SRR1039509

    SRR1039512

    SRR1039513

    SRR1039516

    SRR1039508

    ENSG00000000003

    679

    448

    873

    408

    1138

    ENSG00000000005

    0

    0

    0

    0

    0

    ENSG00000000419

    467

    515

    621

    365

    587

    ENSG00000000457

    260

    211

    263

    164

    245

    ENSG00000000460

    60

    55

    40

    35

    78

    ENSG00000000938

    0

    0

    2

    0

    1

    Genes must either be named with official gene symbols or Ensembl gene IDs(ENS***).

    There is no requirement for sample name type. Barcodes, numbers, any artificial names can work.

  2. ‘mex_folders’: Each folder containing Market Exchange Format (MEX) files output by cellranger pipeline representing samples of same class. For more information: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices

Mouse Embryonic Fibroblast(MEF) vs Embryonic Stem Cells(ESC) example:

This example represent how to use ageas.Launch() with ‘gem_files’.

Here, we attempt to extract key genetic factors to perform cell reprogramming from MEF into Induced Pluripotent Stem Cell(iPSC), one of the most well known cell reprogramming case, with AGEAS.

The scRNA-seq based gene expression data for both MEF and ESC are retrieved from GSE103221.

Either raw data in GSE103221_RAW.tar or normalized counts in GSE103221_normalized_counts.csv.gz can be processed with AGEAS.

For using raw data

With all setting remain default, we can launch AGEAS as:

[ ]:
ageas.Launch(
    # since 'gem_files' is the default setting, this line can be deleted
    database_type = 'gem_files',

    # ageas.Data_Preprocess() args
    class1_path = 'GSE103221_RAW/GSM3629847_10x_osk_mef.csv.gz',
    class2_path = 'GSE103221_RAW/GSM3629848_10x_osk_esc.csv.gz',
)

However, section above could be too computational expensive for PC!

Few adjustments can be made:

[ ]:
test_raw = ageas.Launch(
    protocol = 'multi',
    unit_num = 4,

    # ageas.Data_Preprocess() args
    class1_path = 'GSE103221_RAW/GSM3629847_10x_osk_mef.csv.gz',
    class2_path = 'GSE103221_RAW/GSM3629848_10x_osk_esc.csv.gz',
    std_value_thread = 3.0,
)

With std_value_thread = 3.0, more genes with relatively low expression variability will be ruled out during meta-processing and, thus, limit amount of Gene Regulatory Pathways(GRPs) in meta level processed Gene Regulatory Network(GRN) and pseudo-sample GRNs.

Instead of using only one AGEAS extractor units by default, four units are used with unit_num = 4. Result now is generalized from GRPs extracted with every unit.

To reduce running time, protocol = ‘multi’ set units to run parallelly with multithreading.

For more API information, please visit documentaion page.

Extraction reports can be saved as files with:

[ ]:
test_raw.save_reports(
    folder_path = 'report_files/',
    save_unit_reports = True
)

Within folder report_files, there should have following files:

report_files/
    │
    ├─ no_1/
    │  ├─ grps_importances.txt
    │  ├─ outlier_grps.js
    │
    ├─ no_2/
    |  ├─ ...
    │
    ├─ no_3/
    |  ├─ ...
    │
    ├─ no_4/
    |  ├─ ...
    │
    ├─ full_atlas.js
    ├─ grp_scores.csv
    ├─ key_atlas.js
    ├─ meta_GRN.js
    ├─ meta_report.csv
    ├─ pseudo_sample_GRNs.js
    ├─ report.csv

Folders no_1, no_2, no_3, no_4 contain GRP importance scores returned by each extractor unit as grps_importances.csv which has GRPs ranked with importance scores and outlier_grps.js which has GRPs once removed during feature selection due to extremly high importance score. If these information not needed, keep save_unit_reports as False by default.

full_atlas.js: networks reconstructed with every important GRP extracted by every extractor unit.

grp_scores.csv: GRPs with max importance score each can achieve after extracted by every extractor unit.

key_atlas.js: pruned networks reconstructed only with genes being capable to regulate other genes in full atlas.

meta_GRN.js: meta-level processed GRN cast with all samples.

meta_report.csv: summary of every gene in meta-processed GRN. By default, records are ranked by Degree. Top few rows should look like:

ID

Gene Symbol

Type

Degree

Log2FC

Pou5f1

Pou5f1

TF

786

18.0654266535883

Trim28

Trim28

TF

727

16.7739633684336

Trp53

Trp53

TF

725

15.9708922902521

Rest

Rest

TF

695

15.0240141813129

Sox2

Sox2

TF

687

15.4459524481466

Junb

Junb

TF

683

14.1196706687477

Cebpb

Cebpb

TF

682

13.8599229728618

  • Type can either be Gene or TF(Transcription Factor).

  • Degree here stands for degree in graph theory.

  • Log2FC is calcualted with all expression values in sample classes.

pseudo_sample_GRNs.js: GRNs cast for each pseudo-sample generated with Sliding Window Algorithm.

report.csv: information about key regulatory-source genes in full_atlas.js. By default, records are ranked by Log2FC. Top few rows should look like:

ID

Gene Symbol

Network

Type

Source_Num

Target_Num

Meta_Degree

Log2FC

Pou5f1

Pou5f1

network_0

TF

5

68

786

18.065426653588275

Klf2

Klf2

network_0

TF

0

71

592

17.77230812962195

Trim28

Trim28

network_0

TF

8

136

727

16.77396336843356

Trp53

Trp53

network_0

TF

7

133

724

15.97089229025213

Sox2

Sox2

network_0

TF

3

83

687

15.44595244814661

Nanog

Nanog

network_0

TF

4

81

656

15.417885470810674

Klf4

Klf4

network_0

TF

3

62

428

15.348589852448784

  • Type can either be Gene or TF(Transcription Factor).

  • Source-Num and Target_Num indicate amount of directly related regulatory source and target in full atlas.

  • Meta_Degree and Log2FC here are same with Degree and Log2FC in meta_report.csv.

For using normalized gene counts

We will need to make new gene expression matrices with all MEF samples and ESC samples respectively.

[ ]:
import re
import pandas as pd

data = pd.read_csv('GSE103221_normalized_counts.csv', index_col = 0)

mef_samples = [x for x in data if re.search(r'mef', x)]
esc_samples = [x for x in data if re.search(r'esc', x)]

data[mef_samples].to_csv('mef.csv.gz')
data[esc_samples].to_csv('esc.csv.gz')

Considering the sample size of raw data for each class no less than few thousands, generating pseudo-samples, which abstracts gene expressions from several distinct samples as continuous expression data in order to calculate gene expression correlations, with every distinct 100 samples by default setting should be acceptable.

However, in this normalized data, only dozens of samples clearly labeled as MEF or ESC. To make at least dozens of pseudo-samples, we can adjust few arguments for Sliding Window Algorithm.

Furthermore, with normalized expression value, GRP filters shall also be adjusted to keep total amount of GRP after meta-processing reasonable. log2fc_thread can rule out genes and corresponding GRPs based on expression difference.

[ ]:
test_normalized = ageas.Launch(
    unit_num = 4,

    # ageas.Data_Preprocess() args
    class1_path = 'mef.csv.gz',
    class2_path = 'esc.csv.gz',
    log2fc_thread = 3,
    std_value_thread = 100,
    sliding_window_size = 10,
    sliding_window_stride = 1,
)

test_normalized.save_reports(
    folder_path = 'report_files/'
)

Files within folder report_files are under same structure and formats described above.

Carbon Tetrachloride(CCl4) induced liver fibrosis example:

This example represent how to use ageas.Launch() with ‘mex_folders’.

Stimulating portal fibroblasts(PFs) as activated hepatic stellate cells(HSCs), we can try to find key genetic factors in liver fibrosis through extracting key genetic differences among HSCs and PFs.

The scRNA-seq based gene expression data for PF is retrieved from GSM4085627 and data for HSC is retrieved from GSM4085625.

Files retrieved are managed as:

a6w_pf/
    ├─ GSM4085627_10x_5_barcodes.tsv.gz
    ├─ GSM4085627_10x_5_genes.tsv.gz
    ├─ GSM4085627_10x_5_matrix.mtx.gz

a6w_hsc/
    ├─ GSM4085625_10x_3_barcodes.tsv.gz
    ├─ GSM4085625_10x_3_genes.tsv.gz
    ├─ GSM4085625_10x_3_matrix.mtx.gz

Then, we can launch AGEAS with:

[ ]:
test_raw = ageas.Launch(
    unit_num = 4,

    # ageas.Data_Preprocess() args
    database_type = 'mex_folders',
    class1_path = 'a6w_hsc/',
    class2_path = 'a6w_pf/',
)

test_normalized.save_reports(
    folder_path = 'report_files/'
)

Files within folder report_files are under same structure and formats described in example above.

And then…

We can visualize each network as a graph with ageas.Plot()