GeneSNAKE is a Python-based toolbox for gene regulatory network (GRN) data simulation and benchmarking. It offers generation of graphs with GRN-like properties, a variety of different experimental perturbation designs, and simulation of gene expression data based on an ODE model. Further details can be found in the article.
Here we provide a brief introduction on how to perform the most common tasks in GeneSNAKE, loading or creating a network, preparing a model and simulating data, exporting the data as well as saving the current model for later use and finally how to perform a benchmark on a completed run. A detailed description of each function can be found in the appropriate section below.
The first step for using GeneSNAKE once it is installed is to import it into your python environment using:
The first step in starting a GeneSNAKE simulation is obtaining a GRN. This can be done in a few different ways. First and simplest is to use the built in algorithm in GeneSNAKE to generate a network with desired properties. Here we use the FFLatt algorithm but there are others built in to GeneSNAKE. To generate a network use:
Alternatively if you wish to use a topology not supported by GeneSNAKE to generate a network you can use the python package networkx which supports a large number of network topologies. These networks are likely less biologically realistic than the FFLatt algorithm but offers a great flexibility for network generation.
Note that if you use a non-directed network generator you must use the option undirected=TRUE when moving it to GeneSNAKE. The software will then add a direction randomly which will allow it to be used in simulation, though the simulation may not entirely work depending on the network topology in question.
Finally if you already have a GRN you wish to use you can do so using the gs.grn.load function to load a network from a file. The function can handle either an edge list or an adjacency matrix. The network can be with or without weights but must contain an indicator of sign (+/-) if unweighted. Note that the edge list must be formatted in following order “regulator,target,weight/sign”.
Once a GRN has been obtained the next step is to create a simulation model and the corresponding parameters for the model to use. Along with this a perturbation design should also be created for the experiments. If you wish to use default values this is as easy as:
Note that setting the perturbation is a separate step as this allows you to reuse the same model for different experiments should you wish.
If instead you want to specify some parameters the command becomes a bit more complex as you will need to feed the information to the right parameter. For all parameter options see the create a GeneSNAKE model section below. In principle this is done by:
Once the network and model is ready it is time to run the simulation, this can be done in two ways, or rather with two output types. Steady-state (ss) output provides only the control and final step of the simulation modeling an experiment with only two measurements. Alternatively time series (ts) provides several steps between the control and final step, modeling an experiment with multiple measurements over time.
A key setting when generating data is the noise level, here we will only work with Gaussian noise (normal) which is controlled by the SNR parameter where a higher value indicates less noise. It is defined as a ratio of mean and standard deviation:
\[ SNR = \frac{\mu}{\sigma} \] To run steady state use:
For running time series a few more parameters are available
M.simulate_data(exp_type='ts', SNR=5, noise_model='normal', time_points=10,draw_pattern='log', end_at_ss=TRUE)
The end_at_ss option for time_series ends the simulation at the first found steady state and ensures that all timepoints between the first and last falls in the time period when the expression is still changing.
Once data is generated you can easily export it either to a file using the GRN model export function.
This will create a csv file of the data and network in the format “file_name_data.csv”, “file_name_nonoise_data.csv and “file_name_GRN.csv” that you can continue to work with as you like.
If at this point you are exiting python either by the script ending or by closing the interpreter we recommend that you save the current model using the command:
This will allow you to easily load up the model, network and data in the future as well as sharing it with others.
Finally once you have run GRN prediction on the data it is time to perform benchmarking. To do so the first step is to load the data into a pandas DataFrame. If you work outside of python this is easiest done using:
Once this is done, load up the GeneSNAKE model you used to generate the data or load the network as described above. To load the model use:
If you run your tool in python and keep the model loaded during the run you obviously do not need to load anything and are ready to go. Regardless once the predicted and true GRN is available benchmarking can be done via:
# if your "results" variable is a single pandas data frame
# use cutoff_handler prior (see Benchmarking section)
bench = gs.benchmarking.compare_networks(results, M.network)
To get the results out you then simply do for all metrics you wish to see:
Below is the example code for how to perform each step above with default settings, but condensed into a single code box to make it easier to get an overview of what is done.
import genesnake as gs
# network generation
genes = 50
spars = 3
network = gs.grn.make_FFLATTnetwork(NETWORK_SIZE=genes, SPARSITY=spars)
# prepare the simulation model
scheme = 'diag'
M = grn.GRNmodel.make_model(network)
M.set_pert(scheme)
# run the simulation with and SNR of 5
M.simulate_data(exp_type='ss', SNR=5, noise_model='normal')
# finally export the data and save the GRN model for future use
M.export('geneSNAKE_steadystate_sim_1.csv',sep=',')
M.save('geneSNAKE_model_1.json')
Generating networks are central step of any GRN simulator a such GeneSNAKE offers several methods for doing this. The main method is the FFLatt algorithm which creates biologically feasible GRN like networks. For details on FFLatt see the publication.
# To generate network with FFLatt use the:
genes = 10
network = gs.grn.make_FFLATTnetwork(NETWORK_SIZE=genes)
A set of controllable topological parameters is available for FFLatt network generation (grn/config.json): Further options for the FFLatt tool are:
Other options include RANDOM_SEED - random_seed (int), a reproducibility parameter; SHUFFLED - whether network motifs should be shuffled with a preserved in/out-degree for all nodes until at least 80% of edges in the original network are swapped (1 or 0); N_CORES_TO_USE - number of cores to use for network generation (int); OUTPUT - the output representation of the network (adj_list or adj_matrix).
GeneSNAKE supplies two additional network generation algorithms a DAG and a random topology generator. These take the same input but are called using make_DAG or make_random. To generate a network use:
The options for both of the generators are:
Additionally Make_DAG also has the option:
In addition to the built in generators GeneSNAKE offers loading of custom networks from csv or tsv files. The loaded network can be in edge list or adjacency matrix form and is used via:
The options for loading network are:
In addition to the built in generators you can also use the networkx package to generate any graph supported by them and load that into GeneSNAKE. To do so use the function:
Once a network is generated the second key step is to translate this in to an GRNmodel object. This will contain all parameters needed to run the simulation of the data as well as help store various data objects needed to run the simulation. Due to the number of parameters available the model can seem complex to use but at its core all that is needed is a network:
This will create an ODE model using the default optimized values set by us to ensure a biologically feasible data generation. If for any reason you want to create your own parameters you can use the option:
If you want to generate some parameters you can provide a list of the parameters to generate using:
If you want to modify any of the parameter generation steps this can be done by feeding a dictionary with the setting to the make_model ensuring that you can determine the parameters generated fully. NOTE: We cannot guarantee that the generation will be biologically realistic or solvable at all if the parameters are changed too much.
All parameters their function and options
Experimental or perturbation design is a key feature in inferring gene regulatory networks, to make sure that any experiment you may be interested in can be represented in GeneSNAKE we offer a number of predefined generators. Generating a perturbation scheme is straightforward and relies only on an existing model object, if you do not have a gs.GRNmodel yet see the section “Create a GeneSNAKE model”. To create a perturbation for your model use:
The options for the set_pert is:
Design options for perturbation is:
Once a GRNmodel and a perturbation design have been created it is time to generate gene expression data. The data may vary depending on which model you have selected but in general it should be raw gene expression. The data will be stored both with and without noise and can be accessed under the GRNmodel. Should you want to change something, the same model, parameters and perturbation can be used to generate data repeatedly. Note however that this will overwrite the saved data in the GNRmodel so we recommend saving this somewhere else first if you wish to use it.
The simulator can be set up either to return time-series or steady-state data, for the time-series you can also choose to have a “better” time-series by ensuring that all selected timepoints fall between the perturbation and first steady-state. Ensuring that most of the measurements are while the system is changing. To run the simulation use the command:
The options for the simulation tool are:
GeneSNAKE supports a number of different noise models, a Gaussian noise model that adds random noise on top of the data by addition. A microarray like that adds a multiplicative log-norm noise. Finally a sequencing like noise model that mimics RNA-sequencing noise with or without dropout.
For the Gaussian and microarray like noise model the input are the same.
Model.simulate_data(exp_type='steady-state', SNR=5, noise_model='normal')
Model.simulate_data(exp_type='steady-state', SNR=5, noise_model='microarray')
The RNA-seq like is more complex and uses a set of parameters to model the noise:
The noise settings depend on several parameters reflecting the sequencing procedure:
In addition to data generation GeneSNAKE offers a set of basic tools for data analysis to determine the properties of data, both simulated and real. Currently we only offer a tool for a general data exploration analysis and a Umap tool called via:
gs.analyse.snakeda(data, pert) # calls the exploratory data analysis tool.
gs.analyse.umap(data, pert) # calls the Umap tool.
Snakeda options:
Umap options are:
GeneSNAKE offers also benchmarking of predictions, i.e. comparing predicted networks to the true network used to generate the model. Below you can find an example script that simulates data with GeneSNAKE, performs GRN inference and benchmarks the GRNs against gold standard created during simulations. Finally, a graphic illustration for selected metrics is displayed.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import string
import genesnake as gs
### simulation part
genes = 50 # number of genes
spars = 4 # sparsity
network = gs.grn.make_DAG(genes, spars, self_loops=True)
M = gs.GRNmodel.make_model(network)
M.set_pert('diag', effect=(0.9, 1), noise=0.01, reps=3, sign=-0.1)
M.simulate_data(exp_type='ss', SNR=10, noise_model='microarray')
### GRN inference part
Y = M.data # simulated expression data
P = M.perturbation # perturbation (P) matrix
# an example GRN inference with normalized least squares regression (LSCON)
# Hillerton et al. 2022
aest = np.matmul(-P, np.linalg.pinv(Y))
aest = pd.DataFrame(aest)
for column in aest:
aest[column] = np.divide(aest[column], sum(abs(aest[column]))) * Y.shape[0]
### benchmarking part
inets = gs.benchmarking.cutoff_handler(aest) # create a span of networks from full to empty
out = gs.benchmarking.compare_networks(inets, M.network, exclude_diag=False, include_sign=False)
# print all metrics as data frame
outm = out['metrics']
# print(outm)
### plotting part
# plot a boxplot with selected statistics
figure, axis = plt.subplots(1, 2)
pal1 = ['#f62681','#00ffff','#fef250','#ffa500','#c8a2c8']
bp = axis[0].boxplot(outm[["mcc","f1","ppv","fdr","tpr"]], labels=['MCC', 'F1', 'PPV', 'FDR', 'TPR'], patch_artist=True)
axis[0].grid(axis='y')
for patch, color in zip(bp['boxes'], pal1):
patch.set_facecolor(color)
for median in bp['medians']:
median.set(color='black',
linewidth=3)
data = {'AUROC': out['AUROC'], 'AUPR': out['AUPR']}
courses = list(data.keys())
values = list(data.values())
# creating the bar plot of AUPR and AUC
axis[1].bar(courses, values, color=['#9bbb59','#38761d'], width=0.4)
for n, ax in enumerate(axis):
ax.text(-0.05, 1.05, string.ascii_uppercase[n], transform=ax.transAxes,
size=18, weight='bold')
plt.show()
The options for compare_networks are:
Results are various performance metrics (based on confusion matrix). All individual metrics are listed below: