Getting started

GeneSPIDER(v2) (Generation and Simulation Package for Informative Data ExploRation) is a Matlab toolbox for gene regulatory network inference and benchmarking with controlled network and data properties. Recent version of GeneSPIDER (v2), allows to simulate perturbed single-cell data.

Download and install GeneSPIDER from: https://bitbucket.org/sonnhammergrni/genespider GeneSPIDER is supported by external tools included in the package:

Why GeneSPIDER?

Inference of gene regulatory networks (GRNs) is a central goal in systems biology. It is therefore important to evaluate the accuracy of GRN inference methods in the light of network and data properties. Although several packages are available for modelling, simulate, and analyse GRN inference, they offer limited control of network topology together with system dynamics, experimental design, data properties, and noise characteristics. Independent control of these properties in simulations is key to drawing conclusions about which inference method to use in a given condition and what performance to expect from it, as well as to obtain properties representative of real biological systems.

What is GeneSPIDER?

We here present the Matlab toolbox GeneSPIDER for generation and analysis of networks and data in a dynamical systems framework with focus on the ability to vary properties on data to mimic plausible real world settings. It supplies essential components that have been missing to existing network inference methods in common use and wrappers for a selected number of inference methods. GeneSPIDER contains tools for controlling and evaluating network topology (random, small-world, scale-free), stability of linear time-invariant systems, signal to noise ratio (SNR), and network Interampatteness, properties that has been shown to play a major role in the ability to infer a good network that can explain the data. Procedures for design of perturbation experiments, bootstrapping, analysis of linear dependence, sample selection, scaling of SNR, and performance evaluation are included. The ability of GeneSPIDER to independently control network and data properties in simulations, together with its tools to analyse these properties and the quality of inferred GRNs enables much more informative analysis of GRN inference performance than was previously possible.
Schematic workflow of network and data generation followed by analysis in GeneSPIDER
Schematic workflow of network and data generation followed by analysis in GeneSPIDER

Citing GeneSPIDER

Tjärnberg A, Morgan DC, Studham M, Nordling TEM, Sonnhammer ELL. GeneSPIDER - gene regulatory network inference benchmarking with controlled network and data properties. Mol Biosyst. 2017 Jun 27;13(7):1304-1312. doi: 10.1039/c7mb00058h.PMID:28485748

Theory behind GeneSPIDER

In this section, you will find examples of data structures and formats used by GeneSPIDER

From GRN to data

General workflow of the GeneSPIDER simulations

Format of data matrix Y

Table.1 Example format of Y matrix for two replicates
G1_1 G2_1 G3_1 …_1 Gn_1 G1_2 G2_2 G3_2 …_2 Gn_2
G1 -0.91 -0.19 0.34 0.85 -0.98 0.62 0.25 -0.48
G2 0.12 -0.96 -0.77 -0.87 -0.95 -0.95 0.97 0.03
G3 0.94 -0.15 -0.99 0.55 0.7 -0.97 -0.99 0.06
Gn -0.09 -0.19 0.81 -0.94 0.79 -0.83 -0.54 -0.93

Format of perturbation design matrix P

Table.2 Example format of P matrix for two replicates
G1_1 G2_1 G3_1 …_1 Gn_1 G1_2 G2_2 G3_2 …_2 Gn_2
G1 -1 0 0 0 -1 0 0 0
G2 0 -1 0 0 0 -1 0 0
G3 0 0 -1 0 0 0 -1 0
Gn 0 0 0 -1 0 0 0 -1

GRN Inference methods in GeneSPIDER

Table.3 A list of inference methods available in GeneSPIDER. First column inidcates name of the method in GeneSPIDER
Algorithm/tool P matrix
adaboost Adaptive boosting
bag Bootstrap Aggregation (Bagging) and Random Forest
CART Classification And Regression Tree
elnet Elastic Net
fcls Constrained least squares with Newton
gentleboost Gentle Adaptive Boosting
GLasso Graphical lasso
gplin Gaussian process regression for linear function
julius LASSO based convex programming
lars Least Angle Regression
lasso Lasso
lassoglmp Lasso for generalized linear model with Poisson distribution
lassolog Logistic learner with lasso regularization
logitboost Adaptive Logistic Regression
lsco Least Squares Cut Off
LSCON Least Squares Cut Off with Normalization
neunet Neural network model-based testing
neunetcv Neural network training and testing in cross-validation process
ridgeco Ridge regression with Cut Off
RNI Robust network inference
svmc Support Vector Machine classification
tlsco Total Least Squares Cut Off
Zscore Z-score based inference
aracne Mutual information with ARACNE
BC3NET Bayesian bootstrapping based on mutual information
CLR Context Likelihood of Relatedness
GENIE3 Regression trees with Genie3
PLSNET Partial least squares
svmr Support Vector Machine regression
TIGRESS Trustful Inference of Gene REgulation using Stability Selection

Example GRN

Below is an example visualization of directed, signed and weighted network.

Table.4 An example of adjacency matrix reflecting inferred network.
G1 G2 G3 G4 G5
G1 0.0 0 0.00 0.00 0.9
G2 -0.5 0 0.25 0.00 -0.7
G3 -0.9 0 0.00 0.00 0.0
G4 0.0 0 -0.80 0.00 0.0
G5 0.0 0 0.56 0.77 0.0

Functions guide

datastruct.large_scalefree(N, S)
  • Creates a large scale-free network in efficient time
    • N: number of genes in the network (size)
    • S: average degree of edges in the network (sparsity)
    • min_sub_grn: min. number of genes for subnetwork (default 10)
    • max_sub_grn: max. number of genes for subnetwork (default 30)
    • min_alpha: min. power parameter for scale-free overlap (default 1.5)
    • max_alpha: max. power parameter for scale-free overlap (default 2.5)
    • min_ov: min. number of overlapping genes (default 1)
    • max_ov: max. number of overlapping genes (default 2)
  • Returns array of size N*N

datastruct.scalefree(N, S)
  • Creates a scale-free network with probability estimated to edges (GS version 1)
    • N: number of genes in the network (size)
    • S: average degree of edges in the network (sparsity)
  • Returns array of size N*N

datastruct.scalefree2(N, S)
  • Creates a scale-free network with probability estimated to edges (GS version 2)
    • N: number of genes in the network (size)
    • S: average degree of edges in the network (sparsity)
    • alpha: the power value in the power law distribution
    • pasign: probability of activation sign, (default is 0.62 based on TRRUST for h. sapiens)
  • Returns array of size N*N

datastruct.stabilize(A, iaa)
  • Applies stabilization to the network
    • A: network array
    • iaa: interampatteness level, select between ‘low’ or ‘high’ (numeric)
    • sign: if signed structure is to be kept (logical) (default false)
  • Returns array of the size of the A

[E, stdE] = datastruct.noise(X,P,...);
  • Creates additive noise for the data
    • X: an array with noise-free data
    • SNR: value of signal-to-noise ratio
    • SNR_model: signal-to-noise ratio model, choose between “SNR_L” (default), “SNR_var”, “SNR_wiki”, “SNR_wiki2”, “SNR_cov” and “SNR_manual”
  • Returns
    • E - an array of noise
    • stdE - a standard deviation of noise

datastruct.Network(A, name)
  • Creates a Network object
    • A: network array
    • name: name of the network (character)

datastruct.Dataset(D, Net)
  • Creates a Dataset object
    • D: dataset structure object
    • Net: a Network object

[Y, X, Ed, Eg, SCC] = datastruct.scdata(A, P,...);
  • Creates a single-cell data set with defined noise
    • A: network array
    • P: perturbation matrix
    • SNR: signal-to-noise ratio (default 0.1)
    • SNR_model: signal-to-noise ratio model, choose between “SNR_L” (default), “SNR_var”, “SNR_wiki”, “SNR_wiki2”, “SNR_cov” and “SNR_manual”
    • raw_counts: if true, raw counts are outputted, otherwise fold-change (FC)
    • left_tail: power of left tail in Pk distribution (>1 makes left tail longer), (default 1)
    • right_tail power of right tail in Pk distribution (>1 makes right tail longer), (default 2)
    • negbin_prob: mean for negative binomial used to simulate pseudo control, (default 0.5)
    • disper: dispersion parameter in Pk (Svensson et al.), (default 1)
    • n_clust: number of clusters to be created (default 5)
    • logbase: log base to use in reversing FC, (default 10)
    • ds_min: minimum dissimilarity between clusters, (default 0.3)
    • ds_max: maximum dissimilarity between clusters, (default 0.6)
  • Returns
    • Y: array with fold-changes
    • X: noise-free array with fold-changes
    • Ed: binary array with dropouts
    • Eg: array with Gaussian noise
    • SCC: array with simulated control counts

Methods.(method)(Data, zeta)
  • Infers gene regulatory networks from data
    • Data: data set object
    • zeta: a vector of zeta values
    • method: a string of the method name
  • Returns a 3D array of networks of the size of NNz where N is number of genes and z is length of zeta vector

analyse.CompareModels(Net, Aest)
  • Compares networks to gold standard and provides performance metrics
    • Net: network object (gold standard)
    • Aest: a list of estimated networks
  • Returns information including various performance metrics

GeneSPIDER tutorials

Below you can find tutorials for synthetic, as well as real gene expression data simulation and GRN inference

Synthetic bulk data

clear
addpath(genpath('path/to/GeneSPIDER'));

% define the size of your data and network, i.e. number of genes
N = 50;
% define desired sparsity degree, e.g. 3 edges per node on average
S = 3;

% create scale-free network that gives probability to edges
A = datastruct.scalefree2(N, S); % A is the network matrix

% stabilize networks
A = datastruct.stabilize(A,'iaa','low');
% as random weights cannot guarantee that the network is stable
% hence we stabilize, i.e. tune IAA degree

% create Network object
Net = datastruct.Network(A, 'myNetwork');

% define perturbation matrix for the experiment
% for two replicates
P = -[eye(N) eye(N)];
% create data
X = Net.G*P; % where G - static gain model, A - network

% add noise to data
% define signal-to-noise ratio
SNR = 0.1; % usually 1 is low, 0.1 is medium, 0.01 is high noise
s = svd(X);
stdE = s(N)/(SNR*sqrt(chi2inv(1-analyse.Data.alpha,numel(P))));
% estimate noise matrix
E = stdE*randn(size(P));
% input noise matrix
F = zeros(size(P));

% prepare data structure obejct
% assign scale-free network
D(1).network = Net.network;
% define zero noise
D(1).E = E;
D(1).F = F;
D(1).Y = X+E; % noise-free gene expression + noise
D(1).P = P;
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));

% create data object with data "D" and scale-free network "Net"
Data = datastruct.Dataset(D, Net);

% now we can run inference and benchmark it against our gold standard
zeta = logspace(-6,0,30); % return 30 networks of full sparsity spectrum
infMethod = 'lsco';
[Aest0, z0] = Methods.(infMethod)(Data,zeta);
Aest01 = Aest0(:,:,25); % select network 25
% compare models
M = analyse.CompareModels(Net, Aest0);

% display AUROC
M.AUROC
% display max F1 score
max(M.F1)

Real bulk data

clear
addpath(genpath('path/to/GeneSPIDER'));

% load Y and P matrices
load('data/Y_bulk_k562.mat')
load('data/P_bulk_k562.mat')
Y = Y_bulk_k562;
P = P_bulk_k562;

% fetch size of the data, i.e. number of genes
N = size(Y,1);

% start with setting up an empty network
A = zeros(N);
% create a Network object
Net = datastruct.Network(A, 'myNetwork');

% define stdE
stdE = std(Y(:));
% define zero noise matrix
E = [zeros(N) zeros(N)];
% define input noise matrix as zeros
F = zeros(size(P));

% prepare data structure obejct
% assign an empty network
D(1).network = Net.network;
% define zero noise
D(1).E = E;
D(1).F = F;
D(1).Y = Y; % here is where your data is assigned
D(1).P = P;
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));

% create a data object with your data and scale-free network
Data = datastruct.Dataset(D, Net);

% infer a set of networks
zeta = logspace(-6,0,30); % return 30 networks of full sparsity spectrum
infMethod = 'lsco';
[Aest, z] = Methods.(infMethod)(Data,zeta);

Synthetic single-cell data

clear
addpath(genpath('path/to/GeneSPIDER'));

%% CREATE A LARGE SCALE-FREE GRN
m = 500; % number of genes
Savg = 3; % average sparsity, i.e. the average node degree in GRN
A = datastruct.large_scalefree(m, Savg);

%% CREATE SINGLE-CELL DATA
cn = 20; % assuming 20 cells were perturbed for each gene
P = -repmat(eye(m),1,cn); % example perturbation design matrix
% gives in total 1000 cells
SNRv = 0.1; % noise level

% raw counts data are set to default:
% [Y, X, Ed, Eg, SCC] = datastruct.scdata(A, P, "SNR", SNRv);

% to obtain fold-change data, set raw_counts to false
[Y, X, Ed, Eg, SCC] = datastruct.scdata(A, P, "SNR", SNRv, 'raw_counts', false);

The datastruct.scdata() main parameters:

  • A - gene regulatory network
  • P - perturbation matrix

The datastruct.scdata() other parameters:

  • SNR - numeric, default 0.05, signal-to-noise ratio
  • SNR_model - text, default “SNR_L”, SNR model choose between “SNR_L”, “SNR_vov”, “SNR_movd”, “SNR_movd2”, “SNR_cov” and “SNR_manual”
  • raw_counts - logical, default true, raw counts are outputted, if false fold-change
  • left_tail - numeric, default 1, power of left tail in Pk distribution (>1 makes left tail longer)
  • right_tail - numeric, default 2, power of right tail in Pk distribution (>1 makes right tail longer)
  • negbin_prob - numeric, default 0.5, probability(P) for negative binomial used to simulate pseudo control, for R = 1
  • disper - numeric, default 1, dispersion parameter in theoretical dropout distribution (Svensson et al.)
  • n_clusts - numeric, default 5, theoretical number of clusters
  • logbase - numeric, default 10, logaithm base to use in reversing FC, also controls perturbation strength
  • ds_min - numeric, default 0.2, minimum dissimilarity between clusters
  • ds_max - numeric, default 0.6, maximum dissimilarity between clusters

The datastruct.scdata() outputs:

  • Y - output data, either counts or fold-change
  • X - original fold-change
  • Ed - dropout noise (0 or 1)
  • Eg - Gaussian noise
  • SCC - single-cell counts of control
%% GRN INFERENCE OF SINGLE-CELL DATA
D(1).network = A;
D(1).E = Ed.*Eg; % total noise, dropouts and gaussian
D(1).F = zeros(size(P)); 
D(1).Y = Y; % response matrix + noise
D(1).P = P; % perturbation design matrix
stdE = sqrt(var(X(:))/SNRv);
D(1).lambda = [stdE^2,0]; 
D(1).cvY = D.lambda(1)*eye(m); 
D(1).cvP = zeros(m); 
D(1).sdY = stdE*ones(size(D.P)); 
D(1).sdP = zeros(size(D.P)); 

zeta = logspace(-6,0,30);
Net = datastruct.Network(A, 'scnet');
data = datastruct.Dataset(D, Net);
[Aest, zv] = Methods.('LSCON')(data, zeta);
M = analyse.CompareModels(Net.A, Aest);
disp(M.F1) % display all F1-scores

Real single-cell data

clear
addpath(genpath('path/to/GeneSPIDER'));

% load Y and P matrices
load('data/Y_sc_k562.mat')
load('data/P_sc_k562.mat')
Y = Y_sc_k562;
P = P_sc_k562;

% fetch size of the data, i.e. number of genes
N = size(Y,1);

% start with setting up an empty network
A = zeros(N);
% create a Network object
Net = datastruct.Network(A, 'myNetwork');

% define stdE
stdE = std(Y(:));
% define zero noise matrix
E = [zeros(size(Y))];
% define input noise matrix as zeros
F = zeros(size(P));

% prepare data structure obejct
% assign an empty network
D(1).network = Net.network;
% define zero noise
D(1).E = E;
D(1).F = F;
D(1).Y = Y; % here is where your data is assigned
D(1).P = P;
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));

% create a data object with your data and scale-free network
Data = datastruct.Dataset(D,Net);

% infer a set of networks
zeta = logspace(-6,0,30); % return 30 networks of full sparsity spectrum
infMethod = 'lsco';
[Aest, z] = Methods.(infMethod)(Data,zeta);

GRN benchmark

What is GRN benchmark?

A web server for benchmarking directed gene regulatory network inference methods. Available at https://grnbenchmark.org/

From GeneSPIDER to GRN benchmark

Below you can find a script that loads data files from GRN benchmark and saves GRN in the required input

clear
addpath(genpath('path/to/GeneSPIDER'));

tool = "GeneNetWeaver"; % select between GeneSPIDER and GeneNetWeaever
nlev = "LowNoise"; % select between LowNoise, HighNoise, MediumNoise

pathin = 'path/to/GRNBenchmark/Data';
pathout = 'path/to/your/folder';

reps = 5; % repetitions, in GRN benchmark it is 5

for j = 1:reps

    Y = readtable(pathin+tool+"_"+nlev+"_Network"+j+"_GeneExpression.csv", "ReadRowNames", true);
    gnsnms = string(cell2mat(Y.Row));
    P = readtable(pathin+tool+"_"+nlev+"_Network"+j+"_Perturbations.csv", "ReadRowNames", true);
    Y = table2array(Y);
    P = table2array(P);

    N = size(Y,1);
    A = zeros(N);
    Net = datastruct.Network(A, 'myNetwork');
    D(1).network = [];
    % define zero noise
    D(1).E = [zeros(N) zeros(N) zeros(N)];
    D(1).F = zeros(N);
    D(1).Y = Y; % here is where your data is assigned
    D(1).P = P;
    D(1).lambda = [std(Y(:))^2,0];
    D(1).cvY = D.lambda(1)*eye(N);
    D(1).cvP = zeros(N);
    D(1).sdY = std(Y(:))*ones(size(D.P));
    D(1).sdP = zeros(size(D.P));

    % create data object with data "D" and scale-free network "Net"
    Data = datastruct.Dataset(D, Net);

    % now we can run inference
    zeta = 0; % return full network as GRN benchmark do cutoff internally
    infMethod = 'LSCON';
    [Aest, z] = Methods.(infMethod)(Data,zeta);

    inet = Aest; % network to save
    wedges = compose("%9.5f",round(inet(:),5)); % keep weights

    inet(inet<0) = -1; % convert to signed edges without weights
    inet(inet>0) = 1;

    edges = inet(:); % from left to right, columns are merged to one vec
    s = size(inet,1);
    nams_edges = [repmat(1:s, 1, s); repelem(1:s, s)]';
    edges_from = gnsnms(nams_edges(:,2));
    edges_to = gnsnms(nams_edges(:,1));
    nrid = string((1:length(edges_from))');
    edge_list = table(nrid, edges_from, edges_to, wedges, string(edges));

    allVars = 1:width(edge_list);
    newNames = ["ID","Regulator","Target","Weight","Sign"]; % define names in the file
    edge_list = renamevars(edge_list,allVars,newNames);
    Var1 = "";
    Var2 = "Regulator";
    Var3 = "Target";
    Var4 = "Weight";
    Var5 = "Sign";
    newNamesTab = table(Var1,Var2,Var3,Var4,Var5);
    newNamesTab = renamevars(newNamesTab,allVars,newNames);

    edge_list(edges==0,:) = [];
    edge_list2 = [newNamesTab; edge_list];
    writetable(edge_list2,pathout+tool+"_"+nlev+"_Network"+j+"_grn.csv",'QuoteStrings',true,"WriteVariableNames",false) % save as csv

end

NestBoot

What is NestBoot?

The method is based on nested bootstrapping, which applies a bootstrapping protocol to GRN inference, and by repeated bootstrap runs assesses the stability of the estimated support values.
FDR estimation via NestBoot algorithm for a given sparsity
FDR estimation via NestBoot algorithm for a given sparsity

Updates to previous version

  • NestBoot is now a part of GeneSPIDER
  • code was optimized and parallelized
  • new GRN methods were added
  • outdated functions were replaced

Methods available for NestBoot

Table.10 NestBoot inference methods
lsco lscon tlsco genie3 lasso elnet ridgeco clr rni zscore
Description Least Squares Cut Off Least Squares Cut-Off with Normalization Total Least Squares Cut Off Genie3 Lasso Elastic net Ridge regression with cut off cLR RNI Zscore-based inference

Running NestBoot via GeneSPIDER

clear
addpath(genpath('path/to/GeneSPIDER'));

% define the size of your data and network, i.e. number of genes
N = 50;
% define desired sparsity degree, e.g. 3 edges per node on average
S = 3;

% create scale-free network that gives probability to edges
A = datastruct.scalefree2(N, S); % A is the network matrix

% stabilize networks
A = datastruct.stabilize(A,'iaa','low');
% as random weights cannot guarantee that the network is stable
% hence we stabilize, i.e. tune IAA degree

% creating Network object with 
net = datastruct.Network(A, 'yourNetwork');
SNR = 0.1; % set a signal-to-noise ratio
P = -[eye(N) eye(N)]; %sample perturbation matrix
X = net.G*P; % corresponding response Y, G is the static gain matrix (inverse of A (network matrix))
s = svd(X);
stdE = s(N)/(SNR*sqrt(chi2inv(1-analyse.Data.alpha,numel(P))));
E = stdE*randn(size(P)); % noise matrix
F = zeros(size(P)); % input noise matrix

D(1).network = net.network;
D(1).E = E; % noise
D(1).F = F;
D(1).Y = X + E; %response matrix + noise
D(1).P = P; % perturbation matrix
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
  
data = datastruct.Dataset(D,net);
zetavec = logspace(-6,0,30);
method="lsco"; % see Table.1
nest = 10; % outer iterations
boot = 10; % inner iterations
fdr = 0.05; % fdr thershold
paral = true; % turn on/off parallelization
cornr = 4; % number of cores to use
direc = '~/';
nbout = Methods.NestBoot(data, method, nest, boot, zetavec, fdr, direc, paral, cornr);

NestBoot output

Table.2 NestBoot output values
Description
binary_networks bootstrapped intersection at a given support cutoff
signed_networks signed network
summed_networks sum of signum function support to be sign
minab_networks minimum absolute value of sign support
bin_cutoff crossing of shufled and plain data
accumulated_frequency accumulated structure support
binned_frequency frequency of bins
FP_rate_cross frequency rate for crossing
support_at_FP_cross support for crossing (bin_cutoff)
area_shuffled area under bins from trapeozid for shuffled networks
area_measured area under bins from trapeozid for measured networks

Citing NestBoot

Morgan D, Tjärnberg A, Nordling TEM, Sonnhammer ELL. A generalized framework for controlling FDR in gene regulatory network inference. Bioinformatics. 2019 Mar 15;35(6):1026-1032. doi: 10.1093/bioinformatics/bty764. PMID: 30169550

Processing GeneSPIDER output in R

Below you can find a script that allows to load GeneSPIDER network and display it with igraph package ## Loading GeneSPIDER networks

library(R.matlab)
# load file with networks from Matlab
# in this example we use one of the random scale-free network inferred in GeneSPIDER
nets <- readMat(path_to_the_file.mat)
nets <- nets[[1]]

Visualizing random network in igraph

library(igraph)
library(tidyverse)
# select 7th network from all estimated networks
selected_net <- nets[,,7]
# convert to graph format
# directed and weighted network
net0 <- graph_from_adjacency_matrix(
  selected_net %>% t,
  mode = c("directed"),
  weighted = T)

# define edges colors based on weights
clr <- E(graph_sn)$weight
npos <- clr>=0
nneg <- clr<0
clr[npos] <- "#FF3807" # positive regulation
clr[nneg] <- "#0796FF" # negative regulation
E(graph_sn)$color <- clr
  
plot(graph_sn, edge.arrow.size=.2, vertex.color="#D9D9D9", vertex.size=12, 
      vertex.frame.color="gray", vertex.label.color="black", 
      vertex.label.cex=0.6, edge.curved=0.2)