GeneSPIDER
GeneSPIDER(v2) (Generation and Simulation Package for Informative Data ExploRation) is a Matlab toolbox for gene regulatory network inference and benchmarking with controlled network and data properties. The most recent version of GeneSPIDER (v2), includes simulation of perturbed single-cell data.
Download and install GeneSPIDER from: https://bitbucket.org/sonnhammergrni/genespider GeneSPIDER is supported by external tools included in the package:
Inference of gene regulatory networks (GRNs) is a central goal in systems biology. It is therefore important to evaluate the accuracy of GRN inference methods in the light of network and data properties. Although several packages are available for modelling, simulate, and analyse GRN inference, they offer limited control of network topology together with system dynamics, experimental design, data properties, and noise characteristics. Independent control of these properties in simulations is key to drawing conclusions about which inference method to use in a given condition and what performance to expect from it, as well as to obtain properties representative of real biological systems.
Garbulowski M, Hillerton T, Morgan D, Secilmis D, Sonnhammer L, Tjärnberg A, Nordling TEM, Sonnhammer ELL. GeneSPIDER2: large scale GRN simulation and benchmarking with perturbed single-cell data NAR Genomics and Bioinformatics, 6:lqae121 (2024) https://doi.org/10.1093/nargab/lqae121
Tjärnberg A, Morgan DC, Studham M, Nordling TEM, Sonnhammer ELL. GeneSPIDER - gene regulatory network inference benchmarking with controlled network and data properties. Mol. Biosystems, 13:1304-1312 (2017) https://doi.org/10.1039/c7mb00058h
In this section, you will find examples of data structures and formats used by GeneSPIDER
G1_1 | G2_1 | G3_1 | …_1 | Gn_1 | G1_2 | G2_2 | G3_2 | …_2 | Gn_2 | |
---|---|---|---|---|---|---|---|---|---|---|
G1 | -0.91 | -0.94 | 0.38 | … | 0.65 | -0.98 | 0.72 | -0.74 | … | 0.13 |
G2 | -0.08 | -0.96 | -0.12 | … | -0.91 | -0.02 | -0.95 | 1 | … | -0.73 |
G3 | -0.27 | 0.06 | -0.99 | … | 0.13 | 0.73 | -0.62 | -0.99 | … | 0.88 |
… | … | … | … | … | … | … | … | … | … | … |
Gn | 0.58 | 0.85 | -0.17 | … | -0.94 | -0.74 | 0.93 | 0.68 | … | -0.93 |
G1_1 | G2_1 | G3_1 | …_1 | Gn_1 | G1_2 | G2_2 | G3_2 | …_2 | Gn_2 | |
---|---|---|---|---|---|---|---|---|---|---|
G1 | -1 | 0 | 0 | … | 0 | -1 | 0 | 0 | … | 0 |
G2 | 0 | -1 | 0 | … | 0 | 0 | -1 | 0 | … | 0 |
G3 | 0 | 0 | -1 | … | 0 | 0 | 0 | -1 | … | 0 |
… | … | … | … | … | … | … | … | … | … | … |
Gn | 0 | 0 | 0 | … | -1 | 0 | 0 | 0 | … | -1 |
Algorithm/tool | P matrix | |
---|---|---|
adaboost | Adaptive boosting | <U+2714> |
bag | Bootstrap Aggregation (Bagging) and Random Forest | <U+2714> |
CART | Classification And Regression Tree | <U+2714> |
elnet | Elastic Net | <U+2714> |
fcls | Constrained least squares with Newton | <U+2714> |
gentleboost | Gentle Adaptive Boosting | <U+2714> |
GLasso | Graphical lasso | <U+2714> |
gplin | Gaussian process regression for linear function | <U+2714> |
julius | LASSO based convex programming | <U+2714> |
lars | Least Angle Regression | <U+2714> |
lasso | Lasso | <U+2714> |
lassoglmp | Lasso for generalized linear model with Poisson distribution | <U+2714> |
lassolog | Logistic learner with lasso regularization | <U+2714> |
logitboost | Adaptive Logistic Regression | <U+2714> |
lsco | Least Squares Cut Off | <U+2714> |
LSCON | Least Squares Cut Off with Normalization | <U+2714> |
neunet | Neural network model-based testing | <U+2714> |
neunetcv | Neural network training and testing in cross-validation process | <U+2714> |
ridgeco | Ridge regression with Cut Off | <U+2714> |
RNI | Robust network inference | <U+2714> |
svmc | Support Vector Machine classification | <U+2714> |
tlsco | Total Least Squares Cut Off | <U+2714> |
Zscore | Z-score based inference | <U+2714> |
aracne | Mutual information with ARACNE | <U+2718> |
BC3NET | Bayesian bootstrapping based on mutual information | <U+2718> |
CLR | Context Likelihood of Relatedness | <U+2718> |
GENIE3 | Regression trees with Genie3 | <U+2718> |
PLSNET | Partial least squares | <U+2718> |
svmr | Support Vector Machine regression | <U+2718> |
TIGRESS | Trustful Inference of Gene REgulation using Stability Selection | <U+2718> |
Below is an example visualization of a directed, signed and weighted GRN.
NOTE: In GeneSPIDER, link(edge) direction is defined as going from COLUMN to ROW !! (unlike GeneSNAKE)
G1 | G2 | G3 | G4 | G5 | |
---|---|---|---|---|---|
G1 | 0.0 | 0 | 0.00 | 0.00 | 0.9 |
G2 | -0.5 | 0 | 0.25 | 0.00 | -0.7 |
G3 | -0.9 | 0 | 0.00 | 0.00 | 0.0 |
G4 | 0.0 | 0 | -0.80 | 0.00 | 0.0 |
G5 | 0.0 | 0 | 0.56 | 0.77 | 0.0 |
, S) datastruct.large_scalefree(N
, S) datastruct.scalefree(N
, S) datastruct.scalefree2(N
, iaa) datastruct.stabilize(A
, stdE] = datastruct.noise(X,P,...); [E
, name) datastruct.Network(A
, Net) datastruct.Dataset(D
, X, Ed, Eg, SCC] = datastruct.scdata(A, P,...); [Y
, zeta) Methods.(method)(Data
, Aest) analyse.CompareModels(Net
Below you can find tutorials for synthetic, as well as real gene expression data simulation and GRN inference
clear
addpath(genpath('path/to/GeneSPIDER'));
% define the size of your data and network, i.e. number of genes
= 50;
N % define desired sparsity degree, e.g. 3 edges per node on average
= 3;
S
% create scale-free network that gives probability to edges
= datastruct.scalefree2(N, S); % A is the network matrix
A
% stabilize networks
= datastruct.stabilize(A,'iaa','low');
A % as random weights cannot guarantee that the network is stable
% hence we stabilize, i.e. tune IAA degree
% create Network object
= datastruct.Network(A, 'myNetwork');
Net
% define perturbation matrix for the experiment
% for two replicates
= -[eye(N) eye(N)];
P % create data
= Net.G*P; % where G - static gain model, A - network
X
% add noise to data
% define signal-to-noise ratio
= 0.1; % usually 1 is low, 0.1 is medium, 0.01 is high noise
SNR = svd(X);
s = s(N)/(SNR*sqrt(chi2inv(1-analyse.Data.alpha,numel(P))));
stdE % estimate noise matrix
= stdE*randn(size(P));
E % input noise matrix
= zeros(size(P));
F
% prepare data structure obejct
% assign scale-free network
1).network = Net.network;
D(% define zero noise
1).E = E;
D(1).F = F;
D(1).Y = X+E; % noise-free gene expression + noise
D(1).P = P;
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
D(
% create data object with data "D" and scale-free network "Net"
= datastruct.Dataset(D, Net);
Data
% now we can run inference and benchmark it against our gold standard
zeta = logspace(-6,0,30); % return 30 networks of full sparsity spectrum
= 'LSCON';
infMethod , z0] = Methods.(infMethod)(Data,zeta);
[Aest0= Aest0(:,:,25); % select network 25
Aest01 % compare models
= analyse.CompareModels(Net, Aest0);
M
% display AUROC
M.AUROC% display max F1 score
max(M.F1)
clear
addpath(genpath('path/to/GeneSPIDER'));
% load Y and P matrices
load('data/Y_bulk_k562.mat')
load('data/P_bulk_k562.mat')
= Y_bulk_k562;
Y = P_bulk_k562;
P
% fetch size of the data, i.e. number of genes
= size(Y,1);
N
% start with setting up an empty network
= zeros(N);
A % create a Network object
= datastruct.Network(A, 'myNetwork');
Net
% define stdE
= std(Y(:));
stdE % define zero noise matrix
= [zeros(N) zeros(N)];
E % define input noise matrix as zeros
= zeros(size(P));
F
% prepare data structure obejct
% assign an empty network
1).network = Net.network;
D(% define zero noise
1).E = E;
D(1).F = F;
D(1).Y = Y; % here is where your data is assigned
D(1).P = P;
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
D(
% create a data object with your data and scale-free network
= datastruct.Dataset(D, Net);
Data
% infer a set of networks
zeta = logspace(-6,0,30); % return 30 networks of full sparsity spectrum
= 'LSCON';
infMethod , z] = Methods.(infMethod)(Data,zeta); [Aest
clear
addpath(genpath('path/to/GeneSPIDER'));
%% CREATE A LARGE SCALE-FREE GRN
= 500; % number of genes
m = 3; % average sparsity, i.e. the average node degree in GRN
Savg = datastruct.large_scalefree(m, Savg);
A
%% CREATE SINGLE-CELL DATA
= 20; % assuming 20 cells were perturbed for each gene
cn = -repmat(eye(m),1,cn); % example perturbation design matrix
P % gives in total 1000 cells
= 0.1; % noise level
SNRv
% raw counts data are set to default:
% [Y, X, Ed, Eg, SCC] = datastruct.scdata(A, P, "SNR", SNRv);
% to obtain fold-change data, set raw_counts to false
, X, Ed, Eg, SCC] = datastruct.scdata(A, P, "SNR", SNRv, 'raw_counts', false); [Y
The datastruct.scdata() main parameters:
The datastruct.scdata() other parameters:
The datastruct.scdata() outputs:
%% GRN INFERENCE OF SINGLE-CELL DATA
1).network = A;
D(1).E = Ed.*Eg; % total noise, dropouts and gaussian
D(1).F = zeros(size(P));
D(1).Y = Y; % response matrix + noise
D(1).P = P; % perturbation design matrix
D(= sqrt(var(X(:))/SNRv);
stdE 1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(m);
D(1).cvP = zeros(m);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
D(
zeta = logspace(-6,0,30);
= datastruct.Network(A, 'scnet');
Net = datastruct.Dataset(D, Net);
data , zv] = Methods.('LSCON')(data, zeta);
[Aest= analyse.CompareModels(Net.A, Aest);
M disp(M.F1) % display all F1-scores
clear
addpath(genpath('path/to/GeneSPIDER'));
% load Y and P matrices
load('data/Y_sc_k562.mat')
load('data/P_sc_k562.mat')
= Y_sc_k562;
Y = P_sc_k562;
P
% fetch size of the data, i.e. number of genes
= size(Y,1);
N
% start with setting up an empty network
= zeros(N);
A % create a Network object
= datastruct.Network(A, 'myNetwork');
Net
% define stdE
= std(Y(:));
stdE % define zero noise matrix
= [zeros(size(Y))];
E % define input noise matrix as zeros
= zeros(size(P));
F
% prepare data structure obejct
% assign an empty network
1).network = Net.network;
D(% define zero noise
1).E = E;
D(1).F = F;
D(1).Y = Y; % here is where your data is assigned
D(1).P = P;
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
D(
% create a data object with your data and scale-free network
= datastruct.Dataset(D,Net);
Data
% infer a set of networks
zeta = logspace(-6,0,30); % return 30 networks of full sparsity spectrum
= 'LSCON';
infMethod , z] = Methods.(infMethod)(Data,zeta); [Aest
A web server for benchmarking directed gene regulatory network inference methods. Available at https://grnbenchmark.org/
Below you can find a script that loads data files from GRN benchmark and saves GRN in the required input
clear
addpath(genpath('path/to/GeneSPIDER'));
= "GeneNetWeaver"; % select between GeneSPIDER and GeneNetWeaever
tool = "LowNoise"; % select between LowNoise, HighNoise, MediumNoise
nlev
= 'path/to/GRNBenchmark/Data';
pathin = 'path/to/your/folder';
pathout
= 5; % repetitions, in GRN benchmark it is 5
reps
for j = 1:reps
= readtable(pathin+tool+"_"+nlev+"_Network"+j+"_GeneExpression.csv", "ReadRowNames", true);
Y = string(cell2mat(Y.Row));
gnsnms = readtable(pathin+tool+"_"+nlev+"_Network"+j+"_Perturbations.csv", "ReadRowNames", true);
P = table2array(Y);
Y = table2array(P);
P
= size(Y,1);
N = zeros(N);
A = datastruct.Network(A, 'myNetwork');
Net 1).network = [];
D(% define zero noise
1).E = [zeros(N) zeros(N) zeros(N)];
D(1).F = zeros(N);
D(1).Y = Y; % here is where your data is assigned
D(1).P = P;
D(1).lambda = [std(Y(:))^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = std(Y(:))*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
D(
% create data object with data "D" and scale-free network "Net"
= datastruct.Dataset(D, Net);
Data
% now we can run inference
zeta = 0; % return full network as GRN benchmark do cutoff internally
= 'LSCON';
infMethod , z] = Methods.(infMethod)(Data,zeta);
[Aest
= Aest; % network to save
inet = compose("%9.5f",round(inet(:),5)); % keep weights
wedges
<0) = -1; % convert to signed edges without weights
inet(inet>0) = 1;
inet(inet
= inet(:); % from left to right, columns are merged to one vec
edges = size(inet,1);
s = [repmat(1:s, 1, s); repelem(1:s, s)]';
nams_edges = gnsnms(nams_edges(:,2));
edges_from = gnsnms(nams_edges(:,1));
edges_to = string((1:length(edges_from))');
nrid = table(nrid, edges_from, edges_to, wedges, string(edges));
edge_list
= 1:width(edge_list);
allVars = ["ID","Regulator","Target","Weight","Sign"]; % define names in the file
newNames = renamevars(edge_list,allVars,newNames);
edge_list = "";
Var1 = "Regulator";
Var2 = "Target";
Var3 = "Weight";
Var4 = "Sign";
Var5 = table(Var1,Var2,Var3,Var4,Var5);
newNamesTab = renamevars(newNamesTab,allVars,newNames);
newNamesTab
==0,:) = [];
edge_list(edges= [newNamesTab; edge_list];
edge_list2 ,pathout+tool+"_"+nlev+"_Network"+j+"_grn.csv",'QuoteStrings',true,"WriteVariableNames",false) % save as csv
writetable(edge_list2
end
lsco | lscon | tlsco | genie3 | lasso | elnet | ridgeco | clr | rni | zscore | |
---|---|---|---|---|---|---|---|---|---|---|
Description | Least Squares Cut Off | Least Squares Cut-Off with Normalization | Total Least Squares Cut Off | Genie3 | Lasso | Elastic net | Ridge regression with cut off | cLR | RNI | Zscore-based inference |
clear
addpath(genpath('path/to/GeneSPIDER'));
% define the size of your data and network, i.e. number of genes
= 50;
N % define desired sparsity degree, e.g. 3 edges per node on average
= 3;
S
% create scale-free network that gives probability to edges
= datastruct.scalefree2(N, S); % A is the network matrix
A
% stabilize networks
= datastruct.stabilize(A,'iaa','low');
A % as random weights cannot guarantee that the network is stable
% hence we stabilize, i.e. tune IAA degree
% creating Network object with
= datastruct.Network(A, 'yourNetwork');
net = 0.1; % set a signal-to-noise ratio
SNR = -[eye(N) eye(N)]; %sample perturbation matrix
P = net.G*P; % corresponding response Y, G is the static gain matrix (inverse of A (network matrix))
X = svd(X);
s = s(N)/(SNR*sqrt(chi2inv(1-analyse.Data.alpha,numel(P))));
stdE = stdE*randn(size(P)); % noise matrix
E = zeros(size(P)); % input noise matrix
F
1).network = net.network;
D(1).E = E; % noise
D(1).F = F;
D(1).Y = X + E; %response matrix + noise
D(1).P = P; % perturbation matrix
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
D(
= datastruct.Dataset(D,net);
data = logspace(-6,0,30);
zetavec ="lscon";
method= 10; % outer iterations
nest = 10; % inner iterations
boot = 0.05; % fdr thershold
fdr = true; % turn on/off parallelization
paral = 4; % number of cores to use
cornr = '~/';
direc = Methods.NestBoot(data, method, nest, boot, zetavec, fdr, direc, paral, cornr); nbout
Description | |
---|---|
binary_networks | bootstrapped intersection at a given support cutoff |
signed_networks | signed network |
summed_networks | sum of signum function support to be sign |
minab_networks | minimum absolute value of sign support |
bin_cutoff | crossing of shufled and plain data |
accumulated_frequency | accumulated structure support |
binned_frequency | frequency of bins |
FP_rate_cross | frequency rate for crossing |
support_at_FP_cross | support for crossing (bin_cutoff) |
area_shuffled | area under bins from trapeozid for shuffled networks |
area_measured | area under bins from trapeozid for measured networks |
Morgan D, Tjärnberg A, Nordling TEM, Sonnhammer ELL. A generalized framework for controlling FDR in gene regulatory network inference. Bioinformatics. 2019 Mar 15;35(6):1026-1032. doi: 10.1093/bioinformatics/bty764. PMID: 30169550
Below you can find a script that allows to load GeneSPIDER network and display it with igraph package ## Loading GeneSPIDER networks
library(R.matlab)
# load file with networks from Matlab
# in this example we use one of the random scale-free network inferred in GeneSPIDER
readMat(path_to_the_file.mat)
nets <- nets[[1]] nets <-
library(igraph)
library(tidyverse)
# select 7th network from all estimated networks
nets[,,7]
selected_net <-# convert to graph format
# directed and weighted network
graph_from_adjacency_matrix(
net0 <-%>% t,
selected_net mode = c("directed"),
weighted = T)
# define edges colors based on weights
E(graph_sn)$weight
clr <- clr>=0
npos <- clr<0
nneg <- "#FF3807" # positive regulation
clr[npos] <- "#0796FF" # negative regulation
clr[nneg] <-E(graph_sn)$color <- clr
plot(graph_sn, edge.arrow.size=.2, vertex.color="#D9D9D9", vertex.size=12,
vertex.frame.color="gray", vertex.label.color="black",
vertex.label.cex=0.6, edge.curved=0.2)