ageas.Data_Preprocess

ageas.Data_Preprocess(correlation_thread: float = 0.2, database_path: Optional[str] = None, database_type: str = 'gem_files', class1_path: Optional[str] = None, class2_path: Optional[str] = None, interaction_database: str = 'gtrd', log2fc_thread: Optional[float] = None, meta_load_path: Optional[str] = None, mww_p_val_thread: float = 0.05, normalize: Optional[str] = None, prediction_thread='auto', psgrn_load_path: Optional[str] = None, specie: str = 'mouse', sliding_window_size: int = 100, sliding_window_stride: Optional[int] = None, std_value_thread: float = 1.0, std_ratio_thread: Optional[float] = None)

Function to integrate database information and get pseudo-sample GRNs from gene expression data.

Parameters
  • correlation_thread

    <float Default = 0.2> Gene expression correlation thread value of GRPs.

    Potential GRPs failed to reach this value will be dropped.

  • database_path

    <str Default = None> Database header.

    If specified, class1_path and class2_path will be rooted here.

  • database_type

    <str Default = ‘gem_files’> Type of data class1_path and class1_path are directing to Supporting:

    ’gem_files’: Each path is directing to a GEM file. Pseudo samples will be generated with sliding window algo.

    ’gem_folders’: Each path is directing to a GEM folder. Files in each folder will be used to generate pseudo samples.

    ’mex_folders’: Each path is directing to a folder consisting MEX files(*matrix.mtx*, *genes.tsv or features.tsv*, *barcodes.tsv*)

    Pseudo-sample GRNs will be generated with sliding window method.

  • class1_path – <str Default = None> Path to file or folder of class 1 samples data

  • class2_path – <str Default = None> Path to file or folder of class 2 samples data

  • interaction_database

    <str Default = ‘gtrd’> Which interaction database to use for confirming a GRP has a high possibility to exist. Supporting:

    None: No database will be used. As long as a GRP can pass all related filters, it’s good to go.

    ’gtrd’: Using GTRD as regulatory pathway reference. https://gtrd.biouml.org/

    ’biogrid’: Using BioGRID as regulatory pathway reference. Gene symbols must be given as index in GEM matrix or MEX feature file. https://thebiogrid.org/

  • log2fc_thread

    <float Default = None> Log2 fold change thread to filer non-differntial expressing genes.

    It’s generally not encouraged to set up this filter since it can result in lossing key TFs not having great changes on overall expression volume but having changes on expression pattern.

    If local computational power is relatively limited, setting up this thread can help a lot to keep program runable.

  • meta_load_path – <str Default = None> Path to load meta_GRN

  • mww_p_val_thread – <str Default = 0.05> Gene expression Mann–Whitney–Wilcoxon test p-value thread. To make sure one gene’s expression profile is not constant among differnt classes.

  • normalize

    <str Default = None> Choose of normalization method on input GEMs. Supporting:

    None: No normalization will be done.

    ’CPM’: Counts Per Million(CPM).

    ’Min_Max_1000’: Values multiplied by 100 after Min-Max

    Normalization

  • prediction_thread

    <str or float Default = ‘auto’> The importance thread for a GRP predicted with GRNBoost2-like algo to be included. Supporting:

    ’auto’: Automatically set up thread value by minimum imporatnace value of a interaction database recorded GRP of TF having most amount of GRPs. If not using interaction database, it will be set by (1 / amount of genes)

    float type: Value will be set as thread directly

  • psgrn_load_path – <str Default = None> Path to load pseudo-sample GRNs.

  • specie

    <str Default = ‘mouse’> Specify which sepcie’s interaction database shall be used. Supporting:

    ’mouse’: Mus Musculus.

    ’human’: Homo sapiens.

  • sliding_window_size – <int Default = 100> Number of samples a pseudo-sample generated with sliding window technique contains.

  • sliding_window_stride – <int Default = None> Stride of sliding window when generating pseudo-samples.

  • std_value_thread – <float Default = 1.0> Set up gene expression standard deviation thread by value. To filter genes having relatively constant expression.

  • std_ratio_thread – <float Default = None> Set up gene expression standard deviation thread by portion. Only genes reaching top portion will be kept.