GeneSNAKE is a Python-based toolbox for gene regulatory network (GRN) data simulation and benchmarking. It offers generation of graphs with GRN-like properties, a variety of different experimental perturbation designs, and simulation of gene expression data based on an ODE model. Further details can be found in the article.
Ensure Python and pip are installed. Use of virtual environments (venv) is also often a good idea. Good instructions for this can be found on the internet. Then, on a Linux command line, install using:
Here we provide a brief introduction on how to perform the most common tasks in GeneSNAKE, loading or creating a network, preparing a model and simulating data, exporting the data as well as saving the current model for later use and finally how to perform a benchmark on a completed run. A detailed description of each function can be found in the appropriate section below.
Inside Python, import using:
The first step in starting a GeneSNAKE simulation is obtaining a GRN. This can be done in a few different ways. First and simplest is to use the built in algorithm in GeneSNAKE to generate a network with desired properties. Here we use the FFLatt algorithm but there are others built in to GeneSNAKE. To generate a network use:
Alternatively if you wish to use a topology not supported by GeneSNAKE to generate a network you can use the python package networkx which supports a large number of network topologies. These networks are likely less biologically realistic than the FFLatt algorithm but offers a great flexibility for network generation.
Note that if you use a non-directed network generator you must use the option undirected=TRUE when moving it to GeneSNAKE. The software will then add a direction randomly which will allow it to be used in simulation, though the simulation may not entirely work depending on the network topology in question.
Finally if you already have a GRN you wish to use you can do so using the gs.grn.load function to load a network from a file. The function can handle either an edge list or an adjacency matrix. The network can be with or without weights but must contain an indicator of sign (+/-) if unweighted. Note that the edge list must be formatted in following order “regulator,target,weight/sign”. For demonstration purposes, saving and then reloading a networkx network can be done with:
Once a GRN has been obtained the next step is to create a simulation model and the corresponding parameters for the model to use. Along with this a perturbation design should also be created for the experiments. If you wish to use default values this is as easy as:
Note that setting the perturbation is a separate step as this allows you to reuse the same model for different experiments should you wish.
If instead you want to specify some parameters the command becomes a bit more complex as you will need to feed the information to the right parameter. For all parameter options see the create a GeneSNAKE model section below. In principle this is done by:
Once the network and model is ready it is time to run the simulation, this can be done in two ways, or rather with two output types. Steady-state (ss) output provides only the control and final step of the simulation modeling an experiment with only two measurements. Alternatively time series (ts) provides several steps between the control and final step, modeling an experiment with multiple measurements over time.
A key setting when generating data is the noise level, here we will only work with Gaussian noise (normal) which is controlled by the SNR parameter where a higher value indicates less noise. It is defined as a ratio of mean and standard deviation:
\[ SNR = \frac{\mu}{\sigma} \]
To run steady state use:
For running time series a few more parameters are available
M.simulate_data(
exp_type='ts', SNR=5, noise_model='normal',
time_points=10,draw_pattern='log', end_at_ss=True
)
The end_at_ss option for time_series ends the simulation at the first found steady state and ensures that all timepoints between the first and last falls in the time period when the expression is still changing.
You can save the current model and the data associated to it using the command:
This will allow you to easily load up the model, network and data in the future as well as sharing it with others. To reload the model, use:
Below is the example code for how to perform each step above with default settings, but condensed into a single code box to make it easier to get an overview of what is done.
import genesnake as gs
# network generation
genes = 50
spars = 3
network = gs.grn.make_FFLATTnetwork(NETWORK_SIZE=genes, SPARSITY=spars)
# prepare the simulation model
scheme = 'diag'
M = gs.GRNmodel.make_model(network)
M.set_pert(scheme)
# run the simulation with and SNR of 5
M.simulate_data(exp_type='ss', SNR=5, noise_model='normal')
# finally export the data and save the GRN model for future use
M.save('tutorial_model.pickle')
Generating networks are central step of any GRN simulator a such GeneSNAKE offers several methods for doing this. The main method is the FFLatt algorithm which creates biologically feasible GRN like networks. For details on FFLatt see the publication.
NOTE: In GeneSNAKE, link / edge direction is defined as going from row to column. In other words, the regulators are on the rows, and the targets on the columns. This is opposite to the GeneSPIDER Matlab package.
# To generate network with FFLatt use the:
genes = 30
network = gs.grn.make_FFLATTnetwork(NETWORK_SIZE=genes)
A set of controllable topological parameters is available for FFLatt network generation (grn/config.json): Further options for the FFLatt tool are:
Other options include RANDOM_SEED - random_seed (int), a reproducibility parameter; SHUFFLED - whether network motifs should be shuffled with a preserved in/out-degree for all nodes until at least 80% of edges in the original network are swapped (1 or 0); N_CORES_TO_USE - number of cores to use for network generation (int); OUTPUT - the output representation of the network (adj_list or adj_matrix).
GeneSNAKE supplies three additional network generation algorithms; Directed Acyclic Graphs (DAGs), scale free networks, and random networks. For example, a DAG can be generated using:
All options for the different network generators are available in the function docstrings.
Once a network is generated the second key step is to translate this in to an GRNmodel object. This will contain all parameters needed to run the simulation of the data as well as help store various data objects needed to run the simulation. Due to the number of parameters available the model can seem complex to use but at its core all that is needed is a network:
This will create an ODE model using the default optimized values set by us to ensure a biologically feasible data generation.
If you want to modify any of the parameter generation steps this can be done by feeding a dictionary with the setting to the make_model ensuring that you can determine the parameters generated fully. An example is shown below. Note that we cannot guarantee that the generation will be biologically realistic or solvable at all.
Model = gs.GRNmodel.make_model(
network,
alphas={'maxa':1,'mina':0.8},
combinations={'comb_prob':0.5}
)
All parameters their function and options
Experimental or perturbation design is a key feature in inferring gene regulatory networks, to make sure that any experiment you may be interested in can be represented in GeneSNAKE we offer a number of predefined generators. Generating a perturbation scheme is straightforward and relies only on an existing model object, if you do not have a gs.GRNmodel yet see the section “Create a GeneSNAKE model”. To create a perturbation for your model use:
The options for the set_pert is:
Design options for perturbation is:
Once a GRNmodel and a perturbation design have been created it is time to generate gene expression data. The data may vary depending on which model you have selected but in general it should be raw gene expression. The data will be stored both with and without noise and can be accessed under the GRNmodel. Should you want to change something, the same model, parameters and perturbation can be used to generate data repeatedly. Note however that this will overwrite the saved data in the GNRmodel so we recommend saving this somewhere else first if you wish to use it.
The simulator can be set up either to return time-series or steady-state data, for the time-series you can also choose to have a “better” time-series by ensuring that all selected timepoints fall between the perturbation and first steady-state. Ensuring that most of the measurements are while the system is changing. To run the simulation use the command:
The options for the simulation tool are:
GeneSNAKE supports three noise models. The normal noise model adds Gaussian noise multiplicatively. The log_normal log-normal noise mutliplicatively. Finally, the RNA sequencing-like one adds noise with simulated gene length coverage effects, negative bionomial noise, and zero inflation, also multiplicatively. They are chosen by:
GeneSNAKE also offers benchmarking of predictions, i.e. comparing predicted networks to the network used to generate the model. Below is an example script that simulates data with GeneSNAKE, performs GRN inference and benchmarks the inferred GRNs against the network used for simulations.
import numpy as np
import pandas as pd
import genesnake as gs
model_type = 'lsco'
genes = 100
self_loops = True
cell_replicates = 3
snr = 3
noise_model = 'normal'
using_fold_change_data = True
network = gs.grn.make_FFLATTnetwork(
network_size = genes, sparsity = 3,
self_loops = self_loops
)
print('network generated')
M = gs.GRNmodel.make_model(network)
M.set_pert(
'diag', effect = (0.9, 1), noise = 0.01,
reps = cell_replicates, sign = -0.1
)
print('model generated')
M.simulate_data(exp_type = 'ss', SNR = snr, noise_model = noise_model)
noise = M.noise
data = M.data
noise_free_data = M.noise_free_data
reference_network = M.network
print('data simulated')
if using_fold_change_data:
fold_change_data = np.zeros((genes,genes*cell_replicates))
nsf_fold_change_data = np.zeros((genes,genes*cell_replicates))
for i in range(genes*cell_replicates):
for j in range(genes):
fold_change_data[j,i] = np.log2(M.data.values[j,i]/M.steady_state_RNA[j])
nsf_fold_change_data[j,i] = np.log2(M.noise_free_data.values[j,i]/M.noise_free_steady_state_RNA[j])
fold_change_data = pd.DataFrame(fold_change_data,index=M.data.index,columns=M.data.columns)
nsf_fold_change_data = pd.DataFrame(nsf_fold_change_data,index=M.data.index,columns=M.data.columns)
expression_data = fold_change_data
else:
expression_data = M.data
print('fold changes calculated')
estimated_network = gs.inference.infer_networks(
Y = expression_data.T,
P = M.perturbation.T,
method = model_type)
print('network inferred')
all_stats = gs.benchmarking.benchmark(
estimated_network = estimated_network,
reference_network = reference_network.astype(bool),
plot_dir = 'example_outputs',
method_name = 'example',
exclude_diagonal = True
)
print('inference benchmarked')
print(f'{all_stats = }')
print('output figures are in ./example_outputs')