GeneSNAKE is a Python-based toolbox for gene regulatory network (GRN) data simulation and benchmarking. It offers generation of graphs with GRN-like properties, a variety of different experimental perturbation designs, and simulation of gene expression data based on an ODE model. Further details can be found in the article.

Installation

Ensure Python and pip are installed. Use of virtual environments (venv) is also often a good idea. Good instructions for this can be found on the internet. Then, on a Linux command line, install using:

git clone https://bitbucket.org/sonnhammergrni/genesnake.git
cd genesnake
pip3 install .

Working with GeneSNAKE

Here we provide a brief introduction on how to perform the most common tasks in GeneSNAKE, loading or creating a network, preparing a model and simulating data, exporting the data as well as saving the current model for later use and finally how to perform a benchmark on a completed run. A detailed description of each function can be found in the appropriate section below.

Example

There is an example script in the repository root. To run it:

# cd genesnake  # If you are not already in the repository
python3 example.py

Importing to Python

Inside Python, import using:

import genesnake as gs  

Loading or creating a network

The first step in starting a GeneSNAKE simulation is obtaining a GRN. This can be done in a few different ways. First and simplest is to use the built in algorithm in GeneSNAKE to generate a network with desired properties. Here we use the FFLatt algorithm but there are others built in to GeneSNAKE. To generate a network use:

genes = 50
spars = 3 
network = gs.grn.make_FFLATTnetwork(NETWORK_SIZE=genes, SPARSITY=spars)

Alternatively if you wish to use a topology not supported by GeneSNAKE to generate a network you can use the python package networkx which supports a large number of network topologies. These networks are likely less biologically realistic than the FFLatt algorithm but offers a great flexibility for network generation.

import networkx as nx 
G = nx.gn_graph(10)
network = gs.grn.from_networkx(G, undirected=False)

Note that if you use a non-directed network generator you must use the option undirected=TRUE when moving it to GeneSNAKE. The software will then add a direction randomly which will allow it to be used in simulation, though the simulation may not entirely work depending on the network topology in question.

Finally if you already have a GRN you wish to use you can do so using the gs.grn.load function to load a network from a file. The function can handle either an edge list or an adjacency matrix. The network can be with or without weights but must contain an indicator of sign (+/-) if unweighted. Note that the edge list must be formatted in following order “regulator,target,weight/sign”. For demonstration purposes, saving and then reloading a networkx network can be done with:

gs.grn.to_edgelist(network, 'tutorial_networkx_network.csv', sep=",")
network2 = gs.grn.load('tutorial_networkx_network.csv', sep=',', kind='edgelist')

Preparing the simulation model

Once a GRN has been obtained the next step is to create a simulation model and the corresponding parameters for the model to use. Along with this a perturbation design should also be created for the experiments. If you wish to use default values this is as easy as:

scheme = 'diag'
M = gs.GRNmodel.make_model(network) 
M.set_pert(scheme)

Note that setting the perturbation is a separate step as this allows you to reuse the same model for different experiments should you wish.

If instead you want to specify some parameters the command becomes a bit more complex as you will need to feed the information to the right parameter. For all parameter options see the create a GeneSNAKE model section below. In principle this is done by:

M = gs.GRNmodel.make_model(network, alphas={'maxa':1, 'mina':0.8})
M.set_pert(scheme,effect=(0.7, 0.9), reps=3, noise=0.1) 

Run the simulation and extract the data

Once the network and model is ready it is time to run the simulation, this can be done in two ways, or rather with two output types. Steady-state (ss) output provides only the control and final step of the simulation modeling an experiment with only two measurements. Alternatively time series (ts) provides several steps between the control and final step, modeling an experiment with multiple measurements over time.

A key setting when generating data is the noise level, here we will only work with Gaussian noise (normal) which is controlled by the SNR parameter where a higher value indicates less noise. It is defined as a ratio of mean and standard deviation:

\[ SNR = \frac{\mu}{\sigma} \]

To run steady state use:

M.simulate_data(exp_type='ss', SNR=5, noise_model='normal')

For running time series a few more parameters are available

M.simulate_data(
    exp_type='ts', SNR=5, noise_model='normal', 
    time_points=10,draw_pattern='log', end_at_ss=True
    )

The end_at_ss option for time_series ends the simulation at the first found steady state and ensures that all timepoints between the first and last falls in the time period when the expression is still changing.

You can save the current model and the data associated to it using the command:

M.save('tutorial_model.pickle')

This will allow you to easily load up the model, network and data in the future as well as sharing it with others. To reload the model, use:

M_reloaded = gs.GRNmodel.load('tutorial_model.pickle')

Code example

Below is the example code for how to perform each step above with default settings, but condensed into a single code box to make it easier to get an overview of what is done.

import genesnake as gs

# network generation 
genes = 50 
spars = 3 
network = gs.grn.make_FFLATTnetwork(NETWORK_SIZE=genes, SPARSITY=spars)

# prepare the simulation model
scheme = 'diag'
M = gs.GRNmodel.make_model(network) 
M.set_pert(scheme)

# run the simulation with and SNR of 5 
M.simulate_data(exp_type='ss', SNR=5, noise_model='normal')

# finally export the data and save the GRN model for future use
M.save('tutorial_model.pickle')

Network generation

Generating networks are central step of any GRN simulator a such GeneSNAKE offers several methods for doing this. The main method is the FFLatt algorithm which creates biologically feasible GRN like networks. For details on FFLatt see the publication.

NOTE: In GeneSNAKE, link / edge direction is defined as going from row to column. In other words, the regulators are on the rows, and the targets on the columns. This is opposite to the GeneSPIDER Matlab package.

# To generate network with FFLatt use the:
genes = 30
network = gs.grn.make_FFLATTnetwork(NETWORK_SIZE=genes) 

A set of controllable topological parameters is available for FFLatt network generation (grn/config.json): Further options for the FFLatt tool are:

  • FFL_ENRICHED - Enrichment of FFL motif (1 or 0);
  • NO_CYCLES - Depletion of 3-node cycle motif (1 or 0);
  • OTHER_MOTIF_DEPLETION - Depletion of cascades and uplinks (1 or 0);
  • SPARSITY - Network sparsity (int);
  • TEST_NETWORK_SIZE - Network size (int);

Other options include RANDOM_SEED - random_seed (int), a reproducibility parameter; SHUFFLED - whether network motifs should be shuffled with a preserved in/out-degree for all nodes until at least 80% of edges in the original network are swapped (1 or 0); N_CORES_TO_USE - number of cores to use for network generation (int); OUTPUT - the output representation of the network (adj_list or adj_matrix).

GeneSNAKE supplies three additional network generation algorithms; Directed Acyclic Graphs (DAGs), scale free networks, and random networks. For example, a DAG can be generated using:

network = gs.grn.make_DAG(30, degree = 3, sign_bias=0.6)

All options for the different network generators are available in the function docstrings.

Create a GeneSNAKE model

Once a network is generated the second key step is to translate this in to an GRNmodel object. This will contain all parameters needed to run the simulation of the data as well as help store various data objects needed to run the simulation. Due to the number of parameters available the model can seem complex to use but at its core all that is needed is a network:

Model = gs.GRNmodel.make_model(network)

This will create an ODE model using the default optimized values set by us to ensure a biologically feasible data generation.

If you want to modify any of the parameter generation steps this can be done by feeding a dictionary with the setting to the make_model ensuring that you can determine the parameters generated fully. An example is shown below. Note that we cannot guarantee that the generation will be biologically realistic or solvable at all.

Model = gs.GRNmodel.make_model(
    network,
    alphas={'maxa':1,'mina':0.8},
    combinations={'comb_prob':0.5}
    )

All parameters their function and options

  • steady_state_RNA/steady_state_prot - random floats used as starting point. This should probably never be changed. Options:
    • Genes - the number of values to generate
    • High_exp - float - the maximum value
    • low_exp - float - the minimum value
  • time - the number and size of time points and time steps. Options:
    • time_points - int - the number of time steps
    • end_time - int - the final number in the time series, will not affect the simulation much
  • max_rna_exp/max_prot_exp - controls the m parameter and thus sets the total expression level allowed for RNA or protein expression. Options:
    • genes - the number of values to generate.
    • Emax - float - float in range 0-1 which is the maximum value generated.
    • Emin - float - float in range 0-1 which is the minimum value generated. NOTE. Emax and Emin is often the same value, generally 1.0.
  • hill_coef - the hill coefficient used for the ODE equation in the simulation. Options:
    • genes - the number of values to generate.
    • Emax - intt - float in range 2-inf which is the maximum value generated.
    • Emin - int - float in range 2-inf which is the minimum value generated.
  • combinations - the combinatory behaviour of genes that regulates the same target. Can either have AND or OR logicgate like behavior. Options:
    • network - the network the model builds on, used to determine co-regulation.
    • max_comb - int - the maximum number of genes that can form a co-regulation unit. All collaborations with more members than this will be discarded.
    • comb_prob - float - Float in range 0-1 that determine the probability of a co-regulatory relationship being an AND gate meaning both regulators are needed.
  • alphas - a value that control the regulatory strength the regulator has on the target gene.
    Options:
    • Network - the network to get information about the regulator - target relationship.
    • combinations - a dict of combinatory regulations from the combinations function.
    • genes - a list of gene names used to make sure each gene gets an alpha value for each of it’s regulators and combinations.
    • maxa - float - the maximum value alpha can take.
    • mina - float - the minimum value alpha can take.
  • dis_const - the dissociation constant, how likely it is that a bound regulator will detach and thus not have any regulatory effect. Options:
    • Network - the network to get information about the regulator - target relationship.
    • combinations - a dict of combinatory regulations from the combinations function.
    • genes - a list of gene names used to make sure each gene gets an alpha value for each of its regulators and combinations.
    • maxa - float - the maximum value the dis_constant can take.
    • mina - float - the minimum value the dis_constant can take.
  • lambda_RNA/lambda_prot - The degradation rate for RNA and protein respectively. Controls how fast molecules will degrade after creation following the equation lambdamolecule. E.g. lambda_RNARNA is the degradation of RNA.
    NOTE: Increasing or decreasing this too much will prevent perturbations due to all new RNA/proteins either degrading instantly or proteins never degrading. Options:
    • genes - the number of lambda values to generate.
    • Emax - the maximum degradation rate.
    • Emin - the minimum degradation rate.

Perturbation generation

Experimental or perturbation design is a key feature in inferring gene regulatory networks, to make sure that any experiment you may be interested in can be represented in GeneSNAKE we offer a number of predefined generators. Generating a perturbation scheme is straightforward and relies only on an existing model object, if you do not have a gs.GRNmodel yet see the section “Create a GeneSNAKE model”. To create a perturbation for your model use:

Model.set_pert('design', effect=(0.8,0.9), noise=0, reps=3, sign=-1)

The options for the set_pert is:

  • design - the design to use, see below for details.
  • effect - tuple - (A, B) the perturbation effect will be drawn in the range A-B. For a single strength perturbation use (A,A).
  • noise - float - a float number that will be used to add additional fluctuation to the perturbation. The perturbation will then be +/- noise changing the effect at random between replicates.
  • sign - +/-1 determines if the perturbation increases or decreases gene expression.

Design options for perturbation is:

  • Single gene - Each gene in the system is perturbed one by one so that all genes in the system are perturbed once.
  • Combinatory - Multiple genes are perturbed in each experiment in such a way that each gene is perturbed roughly an equal amount of times across all experiments. The number of perturbed genes per experiment is determined by the user.
  • Targeted combinations - Multiple genes are perturbed in each experiment and combinations are selected from the GRN so that genes regulating the same target are perturbed in the combinations. The number of perturbed genes per experiment is determined by the user.
  • All genes - All genes in the system are perturbed with a random but small effect, either positive or negative, to simulate global perturbation experiments such as cell medium or temperature changes.
  • Specific genes - A single or multiple genes are perturbed in each experiment based on an input list of gene names and gene combinations to perturb. Only those genes in the list will be perturbed.

Running the simulation

Once a GRNmodel and a perturbation design have been created it is time to generate gene expression data. The data may vary depending on which model you have selected but in general it should be raw gene expression. The data will be stored both with and without noise and can be accessed under the GRNmodel. Should you want to change something, the same model, parameters and perturbation can be used to generate data repeatedly. Note however that this will overwrite the saved data in the GNRmodel so we recommend saving this somewhere else first if you wish to use it.

The simulator can be set up either to return time-series or steady-state data, for the time-series you can also choose to have a “better” time-series by ensuring that all selected timepoints fall between the perturbation and first steady-state. Ensuring that most of the measurements are while the system is changing. To run the simulation use the command:

Model.set_pert('diag')
Model.simulate_data(exp_type='steady-state', SNR=5, noise_model='normal')

The options for the simulation tool are:

  • restart_control - bool - if true the initial steady state will be calculated from a random start.
  • time_points - int - the number of timepoints to return. Note more timepoints are generated and that number is controlled by the GRNmodel.parameters.time option.
  • exp_type - str - selects what type of experiment to run (steady-state, ss, time-series,ts).
  • draw_pattern - str - selects the distribution of the time points, accepts linear and log. Where linear are timepoints selected evenly across * all timepoints and log has more timepoints in the first half of the timepoints.
  • noise_model - str - select the noise model to use. For details on these see Noise Models below.
  • SNR - str/float - select the ratio of noise to signal that the noisy data will have.
  • end_at_ss - bool - if True the timepoints for time-series option will be drawn between perturbation and the first steady state rather than the last time point generated.

Noise models

GeneSNAKE supports three noise models. The normal noise model adds Gaussian noise multiplicatively. The log_normal log-normal noise mutliplicatively. Finally, the RNA sequencing-like one adds noise with simulated gene length coverage effects, negative bionomial noise, and zero inflation, also multiplicatively. They are chosen by:

Model.simulate_data(exp_type='steady-state', SNR=5, noise_model='normal')
Model.simulate_data(exp_type='steady-state', SNR=5, noise_model='log_normal')
Model.simulate_data(exp_type='steady-state', SNR='medium', noise_model='rnaseq')

Benchmarking example

GeneSNAKE also offers benchmarking of predictions, i.e. comparing predicted networks to the network used to generate the model. Below is an example script that simulates data with GeneSNAKE, performs GRN inference and benchmarks the inferred GRNs against the network used for simulations.

import numpy as np 
import pandas as pd
import genesnake as gs

model_type = 'lsco'
genes = 100
self_loops = True
cell_replicates = 3
snr = 3
noise_model = 'normal'
using_fold_change_data = True

network = gs.grn.make_FFLATTnetwork(
    network_size = genes, sparsity = 3, 
    self_loops = self_loops
    )
print('network generated')

M = gs.GRNmodel.make_model(network)
M.set_pert(
    'diag', effect = (0.9, 1), noise = 0.01, 
    reps = cell_replicates, sign = -0.1
    )
print('model generated')

M.simulate_data(exp_type = 'ss', SNR = snr, noise_model = noise_model)
noise = M.noise
data = M.data
noise_free_data = M.noise_free_data
reference_network = M.network
print('data simulated')

if using_fold_change_data:
    fold_change_data = np.zeros((genes,genes*cell_replicates))
    nsf_fold_change_data = np.zeros((genes,genes*cell_replicates))
    for i in range(genes*cell_replicates):
        for j in range(genes):
            fold_change_data[j,i] = np.log2(M.data.values[j,i]/M.steady_state_RNA[j])
            nsf_fold_change_data[j,i] = np.log2(M.noise_free_data.values[j,i]/M.noise_free_steady_state_RNA[j])
    fold_change_data = pd.DataFrame(fold_change_data,index=M.data.index,columns=M.data.columns)
    nsf_fold_change_data = pd.DataFrame(nsf_fold_change_data,index=M.data.index,columns=M.data.columns)
    expression_data = fold_change_data
else:
    expression_data = M.data
print('fold changes calculated')

estimated_network = gs.inference.infer_networks(
    Y = expression_data.T, 
    P = M.perturbation.T,
    method = model_type)
print('network inferred')

all_stats = gs.benchmarking.benchmark(
    estimated_network = estimated_network,
    reference_network = reference_network.astype(bool),
    plot_dir = 'example_outputs',
    method_name = 'example',
    exclude_diagonal = True
    )
print('inference benchmarked')

print(f'{all_stats = }')
print('output figures are in ./example_outputs')