{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ageas.Launch()\n", "\n", "This notebook demonstrate how to use ageas.Launch() to launch AGEAS in extracting key genetic regulatory elements from RNA-seq based Gene Expression Matrices (GEMs) in differentiating sample classes.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import ageas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Currently, AGEAS support input data under two different formats:\n", "\n", "1. **'gem_files'**: Gene expression of all classes are presented as dataframes under CSV or TXT format with rows representing genes and columns representing samples.\n", "\n", " Example:\n", "\n", " | | SRR1039509 | SRR1039512 | SRR1039513 | SRR1039516 | SRR1039508 |\n", " |-----------------|------------|------------|------------|------------|------------|\n", " | ENSG00000000003 | 679 | 448 | 873 | 408 | 1138 |\n", " | ENSG00000000005 | 0 | 0 | 0 | 0 | 0 |\n", " | ENSG00000000419 | 467 | 515 | 621 | 365 | 587 |\n", " | ENSG00000000457 | 260 | 211 | 263 | 164 | 245 |\n", " | ENSG00000000460 | 60 | 55 | 40 | 35 | 78 |\n", " | ENSG00000000938 | 0 | 0 | 2 | 0 | 1 |\n", "\n", " \n", "\n", " Genes must either be named with official gene symbols or Ensembl gene IDs(ENS***).\n", "\n", " There is no requirement for sample name type. Barcodes, numbers, any artificial names can work.\n", "\n", "\n", "2. **'mex_folders'**: Each folder containing Market Exchange Format (MEX) files output by [cellranger](https://github.com/10XGenomics/cellranger) pipeline representing samples of same class. For more information: \n", " https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mouse Embryonic Fibroblast(MEF) vs Embryonic Stem Cells(ESC) example:\n", "\n", "This example represent how to use ageas.Launch() with ***'gem_files'***.\n", "\n", "Here, we attempt to extract key genetic factors to perform cell reprogramming from **MEF** into **Induced Pluripotent Stem Cell(iPSC)**, one of the most well known cell reprogramming case, with AGEAS.\n", "\n", "The scRNA-seq based gene expression data for both **MEF** and **ESC** are retrieved from [GSE103221](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103221).\n", "\n", "Either raw data in **GSE103221_RAW.tar** or normalized counts in **GSE103221_normalized_counts.csv.gz** can be processed with AGEAS.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### For using raw data\n", "\n", "With all setting remain default, we can launch AGEAS as:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ageas.Launch(\n", "\t# since 'gem_files' is the default setting, this line can be deleted\n", "\tdatabase_type = 'gem_files',\n", "\n", "\t# ageas.Data_Preprocess() args\n", "\tclass1_path = 'GSE103221_RAW/GSM3629847_10x_osk_mef.csv.gz',\n", "\tclass2_path = 'GSE103221_RAW/GSM3629848_10x_osk_esc.csv.gz',\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**However, section above could be too computational expensive for PC!**\n", "\n", "Few adjustments can be made:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_raw = ageas.Launch(\n", "\tprotocol = 'multi',\n", "\tunit_num = 4,\n", "\n", "\t# ageas.Data_Preprocess() args\n", "\tclass1_path = 'GSE103221_RAW/GSM3629847_10x_osk_mef.csv.gz',\n", "\tclass2_path = 'GSE103221_RAW/GSM3629848_10x_osk_esc.csv.gz',\n", "\tstd_value_thread = 3.0,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With **_std_value_thread = 3.0_**, more genes with relatively low expression variability will be ruled out during meta-processing and, thus, limit amount of Gene Regulatory Pathways(GRPs) in meta level processed Gene Regulatory Network(GRN) and pseudo-sample GRNs.\n", "\n", "Instead of using only one AGEAS extractor units by default, four units are used with **_unit_num = 4_**. Result now is generalized from GRPs extracted with every unit.\n", "\n", "To reduce running time, **_protocol = 'multi'_** set units to run parallelly with multithreading.\n", "\n", "For more API information, please visit [documentaion page](https://JackSSK.github.io/Ageas/html/generated/ageas.Launch.html#ageas.Launch).\n", "\n", "Extraction reports can be saved as files with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_raw.save_reports(\n", "\tfolder_path = 'report_files/',\n", "\tsave_unit_reports = True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Within folder *report_files*, there should have following files:\n", "```bash\n", "report_files/\n", " │\n", " ├─ no_1/\n", " │ ├─ grps_importances.txt\n", " │ ├─ outlier_grps.js\n", " │\n", " ├─ no_2/\n", " | ├─ ...\n", " │\n", " ├─ no_3/\n", " | ├─ ...\n", " │\n", " ├─ no_4/\n", " | ├─ ...\n", " │\n", " ├─ full_atlas.js\n", " ├─ grp_scores.csv\n", " ├─ key_atlas.js\n", " ├─ meta_GRN.js\n", " ├─ meta_report.csv\n", " ├─ pseudo_sample_GRNs.js\n", " ├─ report.csv\n", "```\n", "\n", "Folders ***no_1***, ***no_2***, ***no_3***, ***no_4*** contain GRP importance scores returned by each extractor unit as ***grps_importances.csv*** which has GRPs ranked with importance scores and ***outlier_grps.js*** which has GRPs once removed during feature selection due to extremly high importance score. If these information not needed, keep **_save_unit_reports_** as **_False_** by default.\n", "\n", "**full_atlas.js**: networks reconstructed with every important GRP extracted by every extractor unit.\n", "\n", "**grp_scores.csv**: GRPs with max importance score each can achieve after extracted by every extractor unit.\n", "\n", "**key_atlas.js**: pruned networks reconstructed only with genes being capable to regulate other genes in full atlas.\n", "\n", "***meta_GRN.js***: meta-level processed GRN cast with all samples.\n", "\n", "***meta_report.csv***: summary of every gene in meta-processed GRN. By default, records are ranked by **_Degree_**. Top few rows should look like:\n", "\n", "\n", "| ID | Gene Symbol | Type | Degree | Log2FC |\n", "|--------|-------------|------|--------|------------------|\n", "| Pou5f1 | Pou5f1 | TF | 786 | 18.0654266535883 |\n", "| Trim28 | Trim28 | TF | 727 | 16.7739633684336 |\n", "| Trp53 | Trp53 | TF | 725 | 15.9708922902521 |\n", "| Rest | Rest | TF | 695 | 15.0240141813129 |\n", "| Sox2 | Sox2 | TF | 687 | 15.4459524481466 |\n", "| Junb | Junb | TF | 683 | 14.1196706687477 |\n", "| Cebpb | Cebpb | TF | 682 | 13.8599229728618 |\n", "\n", "\n", "- Type can either be Gene or TF(Transcription Factor).\n", "\n", "- Degree here stands for [degree in graph theory](https://en.wikipedia.org/wiki/Degree_(graph_theory)).\n", "\n", "- Log2FC is calcualted with all expression values in sample classes.\n", "\n", "\n", "***pseudo_sample_GRNs.js***: GRNs cast for each pseudo-sample generated with [Sliding Window Algorithm](https://stackoverflow.com/questions/8269916/what-is-sliding-window-algorithm-examples).\n", "\n", "***report.csv***: information about key regulatory-source genes in ***full_atlas.js***. By default, records are ranked by **_Log2FC_**. Top few rows should look like:\n", "\n", "\n", "| ID | Gene Symbol | Network | Type | Source_Num | Target_Num | Meta_Degree | Log2FC |\n", "|--------|-------------|-----------|------|------------|------------|--------------|--------------------|\n", "| Pou5f1 | Pou5f1 | network_0 | TF | 5 | 68 | 786 | 18.065426653588275 |\n", "| Klf2 | Klf2 | network_0 | TF | 0 | 71 | 592 | 17.77230812962195 |\n", "| Trim28 | Trim28 | network_0 | TF | 8 | 136 | 727 | 16.77396336843356 |\n", "| Trp53 | Trp53 | network_0 | TF | 7 | 133 | 724 | 15.97089229025213 |\n", "| Sox2 | Sox2 | network_0 | TF | 3 | 83 | 687 | 15.44595244814661 |\n", "| Nanog | Nanog | network_0 | TF | 4 | 81 | 656 | 15.417885470810674 |\n", "| Klf4 | Klf4 | network_0 | TF | 3 | 62 | 428 | 15.348589852448784 |\n", "\n", "- Type can either be Gene or TF(Transcription Factor).\n", "\n", "- Source-Num and Target_Num indicate amount of directly related regulatory source and target in full atlas. \n", "\n", "- Meta_Degree and Log2FC here are same with Degree and Log2FC in ***meta_report.csv***.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### For using normalized gene counts\n", "\n", "We will need to make new gene expression matrices with all MEF samples and ESC samples respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "import pandas as pd\n", "\n", "data = pd.read_csv('GSE103221_normalized_counts.csv', index_col = 0)\n", "\n", "mef_samples = [x for x in data if re.search(r'mef', x)]\n", "esc_samples = [x for x in data if re.search(r'esc', x)]\n", "\n", "data[mef_samples].to_csv('mef.csv.gz')\n", "data[esc_samples].to_csv('esc.csv.gz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Considering the sample size of raw data for each class no less than few thousands, generating pseudo-samples, which abstracts gene expressions from several distinct samples as continuous expression data in order to calculate gene expression correlations, with every distinct 100 samples by default setting should be acceptable.\n", "\n", "However, in this normalized data, only dozens of samples clearly labeled as MEF or ESC. To make at least dozens of pseudo-samples, we can adjust few arguments for Sliding Window Algorithm. \n", "\n", "Furthermore, with normalized expression value, GRP filters shall also be adjusted to keep total amount of GRP after meta-processing reasonable. ***log2fc_thread*** can rule out genes and corresponding GRPs based on expression difference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_normalized = ageas.Launch(\n", "\tunit_num = 4,\n", "\n", "\t# ageas.Data_Preprocess() args\n", "\tclass1_path = 'mef.csv.gz',\n", "\tclass2_path = 'esc.csv.gz',\n", "\tlog2fc_thread = 3,\n", "\tstd_value_thread = 100,\n", "\tsliding_window_size = 10,\n", " \tsliding_window_stride = 1,\n", ")\n", "\n", "test_normalized.save_reports(\n", "\tfolder_path = 'report_files/'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Files within folder *report_files* are under same structure and formats described above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Carbon Tetrachloride(CCl4) induced liver fibrosis example:\n", "\n", "This example represent how to use ageas.Launch() with ***'mex_folders'***.\n", "\n", "Stimulating portal fibroblasts(PFs) as activated hepatic stellate cells(HSCs), we can try to find key genetic factors in liver fibrosis through extracting key genetic differences among HSCs and PFs.\n", "\n", "The scRNA-seq based gene expression data for PF is retrieved from [GSM4085627](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4085627) and data for HSC is retrieved from [GSM4085625](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4085625).\n", "\n", "Files retrieved are managed as:\n", "```bash\n", "a6w_pf/\n", " ├─ GSM4085627_10x_5_barcodes.tsv.gz\n", " ├─ GSM4085627_10x_5_genes.tsv.gz\n", " ├─ GSM4085627_10x_5_matrix.mtx.gz\n", "\n", "a6w_hsc/\n", " ├─ GSM4085625_10x_3_barcodes.tsv.gz\n", " ├─ GSM4085625_10x_3_genes.tsv.gz\n", " ├─ GSM4085625_10x_3_matrix.mtx.gz\n", "```\n", "\n", "Then, we can launch AGEAS with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_raw = ageas.Launch(\n", "\tunit_num = 4,\n", "\n", "\t# ageas.Data_Preprocess() args\n", "\tdatabase_type = 'mex_folders',\n", "\tclass1_path = 'a6w_hsc/',\n", "\tclass2_path = 'a6w_pf/',\n", ")\n", "\n", "test_normalized.save_reports(\n", "\tfolder_path = 'report_files/'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Files within folder *report_files* are under same structure and formats described in example above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## And then...\n", "\n", "We can visualize each network as a graph with [ageas.Plot()](https://JackSSK.github.io/Ageas/html/generated/ageas.Plot.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.13 64-bit (windows store)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "2a9e64e061ab733f2e33056f37bf3f62a8dd02da99810729dff6b17cfb3a5e9f" } } }, "nbformat": 4, "nbformat_minor": 2 }