GeneSNAKE is a Python-based toolbox for gene regulatory network (GRN) data simulation and benchmarking. It offers generation of graphs with GRN-like properties, a variety of different experimental perturbation designs, and simulation of gene expression data based on an ODE model. Further details can be found in the article.

Working with GeneSNAKE

Here we provide a brief introduction on how to perform the most common tasks in GeneSNAKE, loading or creating a network, preparing a model and simulating data, exporting the data as well as saving the current model for later use and finally how to perform a benchmark on a completed run. A detailed description of each function can be found in the appropriate section below.

Importing to Python

The first step for using GeneSNAKE once it is installed is to import it into your python environment using:

Loading or creating a network

The first step in starting a GeneSNAKE simulation is obtaining a GRN. This can be done in a few different ways. First and simplest is to use the built in algorithm in GeneSNAKE to generate a network with desired properties. Here we use the FFLatt algorithm but there are others built in to GeneSNAKE. To generate a network use:

Alternatively if you wish to use a topology not supported by GeneSNAKE to generate a network you can use the python package networkx which supports a large number of network topologies. These networks are likely less biologically realistic than the FFLatt algorithm but offers a great flexibility for network generation.

Note that if you use a non-directed network generator you must use the option undirected=TRUE when moving it to GeneSNAKE. The software will then add a direction randomly which will allow it to be used in simulation, though the simulation may not entirely work depending on the network topology in question.

Finally if you already have a GRN you wish to use you can do so using the gs.grn.load function to load a network from a file. The function can handle either an edge list or an adjacency matrix. The network can be with or without weights but must contain an indicator of sign (+/-) if unweighted. Note that the edge list must be formatted in following order “regulator,target,weight/sign”.

Preparing the simulation model

Once a GRN has been obtained the next step is to create a simulation model and the corresponding parameters for the model to use. Along with this a perturbation design should also be created for the experiments. If you wish to use default values this is as easy as:

Note that setting the perturbation is a separate step as this allows you to reuse the same model for different experiments should you wish.

If instead you want to specify some parameters the command becomes a bit more complex as you will need to feed the information to the right parameter. For all parameter options see the create a GeneSNAKE model section below. In principle this is done by:

Run the simulation and extract the data

Once the network and model is ready it is time to run the simulation, this can be done in two ways, or rather with two output types. Steady-state (ss) output provides only the control and final step of the simulation modeling an experiment with only two measurements. Alternatively time series (ts) provides several steps between the control and final step, modeling an experiment with multiple measurements over time.

A key setting when generating data is the noise level, here we will only work with Gaussian noise (normal) which is controlled by the SNR parameter where a higher value indicates less noise. It is defined as a ratio of mean and standard deviation:

\[ SNR = \frac{\mu}{\sigma} \] To run steady state use:

For running time series a few more parameters are available

The end_at_ss option for time_series ends the simulation at the first found steady state and ensures that all timepoints between the first and last falls in the time period when the expression is still changing.

Once data is generated you can easily export it either to a file using the GRN model export function.

This will create a csv file of the data and network in the format “file_name_data.csv”, “file_name_nonoise_data.csv and “file_name_GRN.csv” that you can continue to work with as you like.

If at this point you are exiting python either by the script ending or by closing the interpreter we recommend that you save the current model using the command:

This will allow you to easily load up the model, network and data in the future as well as sharing it with others.

Benchmarking

Finally once you have run GRN prediction on the data it is time to perform benchmarking. To do so the first step is to load the data into a pandas DataFrame. If you work outside of python this is easiest done using:

Once this is done, load up the GeneSNAKE model you used to generate the data or load the network as described above. To load the model use:

If you run your tool in python and keep the model loaded during the run you obviously do not need to load anything and are ready to go. Regardless once the predicted and true GRN is available benchmarking can be done via:

To get the results out you then simply do for all metrics you wish to see:

Network generation

Generating networks are central step of any GRN simulator a such GeneSNAKE offers several methods for doing this. The main method is the FFLatt algorithm which creates biologically feasible GRN like networks. For details on FFLatt see the publication.

NOTE: In GeneSNAKE, link(edge) direction is defined as going from ROW to COLUMN !! (unlike GeneSPIDER)

A set of controllable topological parameters is available for FFLatt network generation (grn/config.json): Further options for the FFLatt tool are:

  • FFL_ENRICHED - Enrichment of FFL motif (1 or 0);
  • NO_CYCLES - Depletion of 3-node cycle motif (1 or 0);
  • OTHER_MOTIF_DEPLETION - Depletion of cascades and uplinks (1 or 0);
  • SPARSITY - Network sparsity (int);
  • TEST_NETWORK_SIZE - Network size (int);

Other options include RANDOM_SEED - random_seed (int), a reproducibility parameter; SHUFFLED - whether network motifs should be shuffled with a preserved in/out-degree for all nodes until at least 80% of edges in the original network are swapped (1 or 0); N_CORES_TO_USE - number of cores to use for network generation (int); OUTPUT - the output representation of the network (adj_list or adj_matrix).

GeneSNAKE supplies two additional network generation algorithms a DAG and a random topology generator. These take the same input but are called using make_DAG or make_random. To generate a network use:

The options for both of the generators are:

  • genes - int - the number of nodes in the network
  • degree - float (>1)- the number of edges each node will on average have
  • Self_loops - bool - if True the network will contain a negative selfloop for each node, increasing stability

Additionally Make_DAG also has the option:

  • single_comp - bool - if True the algorithm will ensure that the generated network is a single component. NOTE only for make_DAG.

In addition to the built in generators GeneSNAKE offers loading of custom networks from csv or tsv files. The loaded network can be in edge list or adjacency matrix form and is used via:

The options for loading network are:

  • file - the file to read, should be in csv or tsv format.
  • net_type - str - the typ of network to read in, either an edge list or an adjacency matrix.
  • sep - str - the character to use for separating each column.
  • header - bool - if True use the first row in the file as column header.
  • index - bool - if True use the first column in the file as row names.

In addition to the built in generators you can also use the networkx package to generate any graph supported by them and load that into GeneSNAKE. To do so use the function:

Create a GeneSNAKE model

Once a network is generated the second key step is to translate this in to an GRNmodel object. This will contain all parameters needed to run the simulation of the data as well as help store various data objects needed to run the simulation. Due to the number of parameters available the model can seem complex to use but at its core all that is needed is a network:

This will create an ODE model using the default optimized values set by us to ensure a biologically feasible data generation. If for any reason you want to create your own parameters you can use the option:

If you want to generate some parameters you can provide a list of the parameters to generate using:

If you want to modify any of the parameter generation steps this can be done by feeding a dictionary with the setting to the make_model ensuring that you can determine the parameters generated fully. NOTE: We cannot guarantee that the generation will be biologically realistic or solvable at all if the parameters are changed too much.

All parameters their function and options

  • steady_state_RNA/steady_state_prot - random floats used as starting point. This should probably never be changed. Options:
    • Genes - the number of values to generate
    • High_exp - float - the maximum value
    • low_exp - float - the minimum value
  • time - the number and size of time points and time steps. Options:
    • time_points - int - the number of time steps
    • end_time - int - the final number in the time series, will not affect the simulation much
  • max_rna_exp/max_prot_exp - controls the m parameter and thus sets the total expression level allowed for RNA or protein expression. Options:
    • genes - the number of values to generate.
    • Emax - float - float in range 0-1 which is the maximum value generated.
    • Emin - float - float in range 0-1 which is the minimum value generated. NOTE. Emax and Emin is often the same value, generally 1.0.
  • hill_coef - the hill coefficient used for the ODE equation in the simulation. Options:
    • genes - the number of values to generate.
    • Emax - intt - float in range 2-inf which is the maximum value generated.
    • Emin - int - float in range 2-inf which is the minimum value generated.
  • combinations - the combinatory behaviour of genes that regulates the same target. Can either have AND or OR logicgate like behavior. Options:
    • network - the network the model builds on, used to determine co-regulation.
    • max_comb - int - the maximum number of genes that can form a co-regulation unit. All collaborations with more members than this will be discarded.
    • comb_prob - float - Float in range 0-1 that determine the probability of a co-regulatory relationship being an AND gate meaning both regulators are needed.
  • alphas - a value that control the regulatory strength the regulator has on the target gene.
    Options:
    • Network - the network to get information about the regulator - target relationship.
    • combinations - a dict of combinatory regulations from the combinations function.
    • genes - a list of gene names used to make sure each gene gets an alpha value for each of it’s regulators and combinations.
    • maxa - float - the maximum value alpha can take.
    • mina - float - the minimum value alpha can take.
  • dis_const - the dissociation constant, how likely it is that a bound regulator will detach and thus not have any regulatory effect. Options:
    • Network - the network to get information about the regulator - target relationship.
    • combinations - a dict of combinatory regulations from the combinations function.
    • genes - a list of gene names used to make sure each gene gets an alpha value for each of its regulators and combinations.
    • maxa - float - the maximum value the dis_constant can take.
    • mina - float - the minimum value the dis_constant can take.
  • lambda_RNA/lambda_prot - The degradation rate for RNA and protein respectively. Controls how fast molecules will degrade after creation following the equation lambdamolecule. E.g. lambda_RNARNA is the degradation of RNA.
    NOTE: Increasing or decreasing this too much will prevent perturbations due to all new RNA/proteins either degrading instantly or proteins never degrading. Options:
    • genes - the number of lambda values to generate.
    • Emax - the maximum degradation rate.
    • Emin - the minimum degradation rate.

Perturbation generation

Experimental or perturbation design is a key feature in inferring gene regulatory networks, to make sure that any experiment you may be interested in can be represented in GeneSNAKE we offer a number of predefined generators. Generating a perturbation scheme is straightforward and relies only on an existing model object, if you do not have a gs.GRNmodel yet see the section “Create a GeneSNAKE model”. To create a perturbation for your model use:

The options for the set_pert is:

  • design - the design to use, see below for details.
  • effect - tuple - (A, B) the perturbation effect will be drawn in the range A-B. For a single strength perturbation use (A,A).
  • noise - float - a float number that will be used to add additional fluctuation to the perturbation. The perturbation will then be +/- noise changing the effect at random between replicates.
  • sign - +/-1 determines if the perturbation increases or decreases gene expression.

Design options for perturbation is:

  • Single gene - Each gene in the system is perturbed one by one so that all genes in the system are perturbed once.
  • Combinatory - Multiple genes are perturbed in each experiment in such a way that each gene is perturbed roughly an equal amount of times across all experiments. The number of perturbed genes per experiment is determined by the user.
  • Targeted combinations - Multiple genes are perturbed in each experiment and combinations are selected from the GRN so that genes regulating the same target are perturbed in the combinations. The number of perturbed genes per experiment is determined by the user.
  • All genes - All genes in the system are perturbed with a random but small effect, either positive or negative, to simulate global perturbation experiments such as cell medium or temperature changes.
  • Specific genes - A single or multiple genes are perturbed in each experiment based on an input list of gene names and gene combinations to perturb. Only those genes in the list will be perturbed.

Running the simulation

Once a GRNmodel and a perturbation design have been created it is time to generate gene expression data. The data may vary depending on which model you have selected but in general it should be raw gene expression. The data will be stored both with and without noise and can be accessed under the GRNmodel. Should you want to change something, the same model, parameters and perturbation can be used to generate data repeatedly. Note however that this will overwrite the saved data in the GNRmodel so we recommend saving this somewhere else first if you wish to use it.

The simulator can be set up either to return time-series or steady-state data, for the time-series you can also choose to have a “better” time-series by ensuring that all selected timepoints fall between the perturbation and first steady-state. Ensuring that most of the measurements are while the system is changing. To run the simulation use the command:

The options for the simulation tool are:

  • restart_control - bool - if true the initial steady state will be calculated from a random start.
  • time_points - int - the number of timepoints to return. Note more timepoints are generated and that number is controlled by the GRNmodel.parameters.time option.
  • exp_type - str - selects what type of experiment to run (steady-state, ss, time-series,ts).
  • draw_pattern - str - selects the distribution of the time points, accepts linear and log. Where linear are timepoints selected evenly across * all timepoints and log has more timepoints in the first half of the timepoints.
  • noise_model - str - select the noise model to use. For details on these see Noise Models below.
  • SNR - str/float - select the ratio of noise to signal that the noisy data will have.
  • end_at_ss - bool - if True the timepoints for time-series option will be drawn between perturbation and the first steady state rather than the last time point generated.

Noise models

GeneSNAKE supports a number of different noise models, a Gaussian noise model that adds random noise on top of the data by addition. A microarray like that adds a multiplicative log-norm noise. Finally a sequencing like noise model that mimics RNA-sequencing noise with or without dropout.

For the Gaussian and microarray like noise model the input are the same.

The RNA-seq like is more complex and uses a set of parameters to model the noise:

The noise settings depend on several parameters reflecting the sequencing procedure:

  • genes_ind - indices of genes, set “random” to get a random set of indices
  • c - coverage
  • L - read length [bp]
  • techrep - number of technical replicates of the sequencing
  • ermin - minimum sequencing error rate, 0-1
  • ermax - maximum sequencing error rate, 0-1
  • prob_zi - probability of zero-inflation, 0-1, for bulk set to 0
  • prob_nb - probability of noiseless data from negative binomial distribution, e.g. value around 1 is lack of noise
  • sf - scaling factor that adjusts variance in data to variance in noise, similar to SNR

Data analysis

In addition to data generation GeneSNAKE offers a set of basic tools for data analysis to determine the properties of data, both simulated and real. Currently we only offer a tool for a general data exploration analysis and a Umap tool called via:

Snakeda options:

  • data - pandas.DataFrame or file of type csv or json. input should be a matrix of genesXexperiments or experimentsXgenes (see option direct). NOTE for json file data must be named "data" or "expression".
  • perts - optional pertrubation matrix in csv or pandas.DataFrame format, either in matrix format or memory efficent format with experiment name followed by each relevant column in the data file. E.g. rep1,1,3,52,2,4,6.that if perts are not used SNAKED will try to estimate a perturbation design from row or column name, these should follow naming practis rep1.1,rep1.2..rep1.n..json files if pert is not given a perturbation design can be stored in the json file under the name P or perturbation, if this does not exist a similar automatic estimation as with .csv files will be performed using the experiment key.
  • out - name of file to save the created html file to. If set to None (default) analysis results will instead be printed stdout.
  • direct - direction the replicates in the matrix is expected to be found in in e.g. row for experiments⋅genes [default] or col for genes⋅experiments
  • noreps - option to turn of replicates, will disable most metrics but allows for analysis of data with only a single replicate per experiment.
  • ff - option to force the file type to be either csv or json. Does nothing if data is a pandas.DataFrame.

Umap options are:

  • data - the data to be plotted either a single csv file or a directory with csv files
  • nneighbors - number of neighbours for umap to use
  • output - name of file to save the created html file to.
  • groupby - option to set grouping to columns or file. Default column for single file and file for directory
  • savedata - option to save the data used to create the plot. Default False

Benchmarking example

GeneSNAKE offers also benchmarking of predictions, i.e. comparing predicted networks to the true network used to generate the model. Below you can find an example script that simulates data with GeneSNAKE, performs GRN inference and benchmarks the GRNs against gold standard created during simulations. Finally, a graphic illustration for selected metrics is displayed.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import string
import genesnake as gs

### simulation part
genes = 50 # number of genes
spars = 4 # sparsity

network = gs.grn.make_DAG(genes, spars, self_loops=True)
M = gs.GRNmodel.make_model(network)
M.set_pert('diag', effect=(0.9, 1), noise=0.01, reps=3, sign=-0.1)
M.simulate_data(exp_type='ss', SNR=10, noise_model='microarray')

### GRN inference part
Y = M.data # simulated expression data
P = M.perturbation # perturbation (P) matrix

# an example GRN inference with normalized least squares regression (LSCON)
# Hillerton et al. 2022
aest = np.matmul(-P, np.linalg.pinv(Y))
aest = pd.DataFrame(aest)
for column in aest:
    aest[column] = np.divide(aest[column], sum(abs(aest[column]))) * Y.shape[0]

### benchmarking part
inets = gs.benchmarking.cutoff_handler(aest) # create a span of networks from full to empty
out = gs.benchmarking.compare_networks(inets, M.network, exclude_diag=False, include_sign=False)
# print all metrics as data frame
outm = out['metrics']
# print(outm)

### plotting part
# plot a boxplot with selected statistics
figure, axis = plt.subplots(1, 2)
pal1 = ['#f62681','#00ffff','#fef250','#ffa500','#c8a2c8']
bp = axis[0].boxplot(outm[["mcc","f1","ppv","fdr","tpr"]], labels=['MCC', 'F1', 'PPV', 'FDR', 'TPR'], patch_artist=True)
axis[0].grid(axis='y')

for patch, color in zip(bp['boxes'], pal1):
    patch.set_facecolor(color)

for median in bp['medians']:
    median.set(color='black',
               linewidth=3)

data = {'AUROC': out['AUROC'], 'AUPR': out['AUPR']}
courses = list(data.keys())
values = list(data.values())

# creating the bar plot of AUPR and AUC
axis[1].bar(courses, values, color=['#9bbb59','#38761d'], width=0.4)

for n, ax in enumerate(axis):
    ax.text(-0.05, 1.05, string.ascii_uppercase[n], transform=ax.transAxes,
            size=18, weight='bold')

plt.show()

The options for compare_networks are:

  • nets - a list of inferred networks that span from full to empty (list of pandas data frames)
  • net_gs - gold standard network for comparison (pandas data frame)
  • exclude_diag - logical option if to exclude diagonal for comparison (True), or not (False)
  • include_sign - logical option if sign should be considered for comparison (True), or not (False)

Results are various performance metrics (based on confusion matrix). All individual metrics are listed below:

  • tp - true positive
  • fp - false positive
  • tn - true negative
  • fn - false negative
  • tpr - true positive rate
  • tnr -true negative rate
  • ppv - precision
  • npv - negative predictive value
  • fnr - false negative rate
  • fpr - false positive rate
  • fdr - false discovery rate
  • for - false omission rate
  • lrp - positive likelihood ratio
  • lrm - negative likelihood ratio
  • pt - prevalence threshold
  • ts - threat score
  • f1 - F1-score (harmonic mean of precision and sensitivity)
  • mcc - Matthews correlation coefficient
  • fm - Fowlkes-Mallows index
  • bm - informedness
  • mk - markedness
  • dor - diagnostic odds ratio
  • s - sparsity