GeneSPIDER
GeneSPIDER(v2) (Generation and Simulation Package for Informative Data ExploRation) is a Matlab toolbox for gene regulatory network inference and benchmarking with controlled network and data properties. Recent version of GeneSPIDER (v2), allows to simulate perturbed single-cell data.
Download and install GeneSPIDER from: https://bitbucket.org/sonnhammergrni/genespider GeneSPIDER is supported by external tools included in the package:
Inference of gene regulatory networks (GRNs) is a central goal in systems biology. It is therefore important to evaluate the accuracy of GRN inference methods in the light of network and data properties. Although several packages are available for modelling, simulate, and analyse GRN inference, they offer limited control of network topology together with system dynamics, experimental design, data properties, and noise characteristics. Independent control of these properties in simulations is key to drawing conclusions about which inference method to use in a given condition and what performance to expect from it, as well as to obtain properties representative of real biological systems.
Tjärnberg A, Morgan DC, Studham M, Nordling TEM, Sonnhammer ELL. GeneSPIDER - gene regulatory network inference benchmarking with controlled network and data properties. Mol Biosyst. 2017 Jun 27;13(7):1304-1312. doi: 10.1039/c7mb00058h.PMID:28485748
In this section, you will find examples of data structures and formats used by GeneSPIDER
G1_1 | G2_1 | G3_1 | …_1 | Gn_1 | G1_2 | G2_2 | G3_2 | …_2 | Gn_2 | |
---|---|---|---|---|---|---|---|---|---|---|
G1 | -0.91 | -0.19 | 0.34 | … | 0.85 | -0.98 | 0.62 | 0.25 | … | -0.48 |
G2 | 0.12 | -0.96 | -0.77 | … | -0.87 | -0.95 | -0.95 | 0.97 | … | 0.03 |
G3 | 0.94 | -0.15 | -0.99 | … | 0.55 | 0.7 | -0.97 | -0.99 | … | 0.06 |
… | … | … | … | … | … | … | … | … | … | … |
Gn | -0.09 | -0.19 | 0.81 | … | -0.94 | 0.79 | -0.83 | -0.54 | … | -0.93 |
G1_1 | G2_1 | G3_1 | …_1 | Gn_1 | G1_2 | G2_2 | G3_2 | …_2 | Gn_2 | |
---|---|---|---|---|---|---|---|---|---|---|
G1 | -1 | 0 | 0 | … | 0 | -1 | 0 | 0 | … | 0 |
G2 | 0 | -1 | 0 | … | 0 | 0 | -1 | 0 | … | 0 |
G3 | 0 | 0 | -1 | … | 0 | 0 | 0 | -1 | … | 0 |
… | … | … | … | … | … | … | … | … | … | … |
Gn | 0 | 0 | 0 | … | -1 | 0 | 0 | 0 | … | -1 |
Algorithm/tool | P matrix | |
---|---|---|
adaboost | Adaptive boosting | ✔ |
bag | Bootstrap Aggregation (Bagging) and Random Forest | ✔ |
CART | Classification And Regression Tree | ✔ |
elnet | Elastic Net | ✔ |
fcls | Constrained least squares with Newton | ✔ |
gentleboost | Gentle Adaptive Boosting | ✔ |
GLasso | Graphical lasso | ✔ |
gplin | Gaussian process regression for linear function | ✔ |
julius | LASSO based convex programming | ✔ |
lars | Least Angle Regression | ✔ |
lasso | Lasso | ✔ |
lassoglmp | Lasso for generalized linear model with Poisson distribution | ✔ |
lassolog | Logistic learner with lasso regularization | ✔ |
logitboost | Adaptive Logistic Regression | ✔ |
lsco | Least Squares Cut Off | ✔ |
LSCON | Least Squares Cut Off with Normalization | ✔ |
neunet | Neural network model-based testing | ✔ |
neunetcv | Neural network training and testing in cross-validation process | ✔ |
ridgeco | Ridge regression with Cut Off | ✔ |
RNI | Robust network inference | ✔ |
svmc | Support Vector Machine classification | ✔ |
tlsco | Total Least Squares Cut Off | ✔ |
Zscore | Z-score based inference | ✔ |
aracne | Mutual information with ARACNE | ✘ |
BC3NET | Bayesian bootstrapping based on mutual information | ✘ |
CLR | Context Likelihood of Relatedness | ✘ |
GENIE3 | Regression trees with Genie3 | ✘ |
PLSNET | Partial least squares | ✘ |
svmr | Support Vector Machine regression | ✘ |
TIGRESS | Trustful Inference of Gene REgulation using Stability Selection | ✘ |
Below is an example visualization of directed, signed and weighted network.
G1 | G2 | G3 | G4 | G5 | |
---|---|---|---|---|---|
G1 | 0.0 | 0 | 0.00 | 0.00 | 0.9 |
G2 | -0.5 | 0 | 0.25 | 0.00 | -0.7 |
G3 | -0.9 | 0 | 0.00 | 0.00 | 0.0 |
G4 | 0.0 | 0 | -0.80 | 0.00 | 0.0 |
G5 | 0.0 | 0 | 0.56 | 0.77 | 0.0 |
Below you can find tutorials for synthetic, as well as real gene expression data simulation and GRN inference
clear
addpath(genpath('path/to/GeneSPIDER'));
% define the size of your data and network, i.e. number of genes
N = 50;
% define desired sparsity degree, e.g. 3 edges per node on average
S = 3;
% create scale-free network that gives probability to edges
A = datastruct.scalefree2(N, S); % A is the network matrix
% stabilize networks
A = datastruct.stabilize(A,'iaa','low');
% as random weights cannot guarantee that the network is stable
% hence we stabilize, i.e. tune IAA degree
% create Network object
Net = datastruct.Network(A, 'myNetwork');
% define perturbation matrix for the experiment
% for two replicates
P = -[eye(N) eye(N)];
% create data
X = Net.G*P; % where G - static gain model, A - network
% add noise to data
% define signal-to-noise ratio
SNR = 0.1; % usually 1 is low, 0.1 is medium, 0.01 is high noise
s = svd(X);
stdE = s(N)/(SNR*sqrt(chi2inv(1-analyse.Data.alpha,numel(P))));
% estimate noise matrix
E = stdE*randn(size(P));
% input noise matrix
F = zeros(size(P));
% prepare data structure obejct
% assign scale-free network
D(1).network = Net.network;
% define zero noise
D(1).E = E;
D(1).F = F;
D(1).Y = X+E; % noise-free gene expression + noise
D(1).P = P;
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
% create data object with data "D" and scale-free network "Net"
Data = datastruct.Dataset(D, Net);
% now we can run inference and benchmark it against our gold standard
zeta = logspace(-6,0,30); % return 30 networks of full sparsity spectrum
infMethod = 'lsco';
[Aest0, z0] = Methods.(infMethod)(Data,zeta);
Aest01 = Aest0(:,:,25); % select network 25
% compare models
M = analyse.CompareModels(Net, Aest0);
% display AUROC
M.AUROC
% display max F1 score
max(M.F1)
clear
addpath(genpath('path/to/GeneSPIDER'));
% load Y and P matrices
load('data/Y_bulk_k562.mat')
load('data/P_bulk_k562.mat')
Y = Y_bulk_k562;
P = P_bulk_k562;
% fetch size of the data, i.e. number of genes
N = size(Y,1);
% start with setting up an empty network
A = zeros(N);
% create a Network object
Net = datastruct.Network(A, 'myNetwork');
% define stdE
stdE = std(Y(:));
% define zero noise matrix
E = [zeros(N) zeros(N)];
% define input noise matrix as zeros
F = zeros(size(P));
% prepare data structure obejct
% assign an empty network
D(1).network = Net.network;
% define zero noise
D(1).E = E;
D(1).F = F;
D(1).Y = Y; % here is where your data is assigned
D(1).P = P;
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
% create a data object with your data and scale-free network
Data = datastruct.Dataset(D, Net);
% infer a set of networks
zeta = logspace(-6,0,30); % return 30 networks of full sparsity spectrum
infMethod = 'lsco';
[Aest, z] = Methods.(infMethod)(Data,zeta);
clear
addpath(genpath('path/to/GeneSPIDER'));
%% CREATE A LARGE SCALE-FREE GRN
m = 500; % number of genes
Savg = 3; % average sparsity, i.e. the average node degree in GRN
A = datastruct.large_scalefree(m, Savg);
%% CREATE SINGLE-CELL DATA
cn = 20; % assuming 20 cells were perturbed for each gene
P = -repmat(eye(m),1,cn); % example perturbation design matrix
% gives in total 1000 cells
SNRv = 0.1; % noise level
% raw counts data are set to default:
% [Y, X, Ed, Eg, SCC] = datastruct.scdata(A, P, "SNR", SNRv);
% to obtain fold-change data, set raw_counts to false
[Y, X, Ed, Eg, SCC] = datastruct.scdata(A, P, "SNR", SNRv, 'raw_counts', false);
The datastruct.scdata() main parameters:
The datastruct.scdata() other parameters:
The datastruct.scdata() outputs:
%% GRN INFERENCE OF SINGLE-CELL DATA
D(1).network = A;
D(1).E = Ed.*Eg; % total noise, dropouts and gaussian
D(1).F = zeros(size(P));
D(1).Y = Y; % response matrix + noise
D(1).P = P; % perturbation design matrix
stdE = sqrt(var(X(:))/SNRv);
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(m);
D(1).cvP = zeros(m);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
zeta = logspace(-6,0,30);
Net = datastruct.Network(A, 'scnet');
data = datastruct.Dataset(D, Net);
[Aest, zv] = Methods.('LSCON')(data, zeta);
M = analyse.CompareModels(Net.A, Aest);
disp(M.F1) % display all F1-scores
clear
addpath(genpath('path/to/GeneSPIDER'));
% load Y and P matrices
load('data/Y_sc_k562.mat')
load('data/P_sc_k562.mat')
Y = Y_sc_k562;
P = P_sc_k562;
% fetch size of the data, i.e. number of genes
N = size(Y,1);
% start with setting up an empty network
A = zeros(N);
% create a Network object
Net = datastruct.Network(A, 'myNetwork');
% define stdE
stdE = std(Y(:));
% define zero noise matrix
E = [zeros(size(Y))];
% define input noise matrix as zeros
F = zeros(size(P));
% prepare data structure obejct
% assign an empty network
D(1).network = Net.network;
% define zero noise
D(1).E = E;
D(1).F = F;
D(1).Y = Y; % here is where your data is assigned
D(1).P = P;
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
% create a data object with your data and scale-free network
Data = datastruct.Dataset(D,Net);
% infer a set of networks
zeta = logspace(-6,0,30); % return 30 networks of full sparsity spectrum
infMethod = 'lsco';
[Aest, z] = Methods.(infMethod)(Data,zeta);
A web server for benchmarking directed gene regulatory network inference methods. Available at https://grnbenchmark.org/
Below you can find a script that loads data files from GRN benchmark and saves GRN in the required input
clear
addpath(genpath('path/to/GeneSPIDER'));
tool = "GeneNetWeaver"; % select between GeneSPIDER and GeneNetWeaever
nlev = "LowNoise"; % select between LowNoise, HighNoise, MediumNoise
pathin = 'path/to/GRNBenchmark/Data';
pathout = 'path/to/your/folder';
reps = 5; % repetitions, in GRN benchmark it is 5
for j = 1:reps
Y = readtable(pathin+tool+"_"+nlev+"_Network"+j+"_GeneExpression.csv", "ReadRowNames", true);
gnsnms = string(cell2mat(Y.Row));
P = readtable(pathin+tool+"_"+nlev+"_Network"+j+"_Perturbations.csv", "ReadRowNames", true);
Y = table2array(Y);
P = table2array(P);
N = size(Y,1);
A = zeros(N);
Net = datastruct.Network(A, 'myNetwork');
D(1).network = [];
% define zero noise
D(1).E = [zeros(N) zeros(N) zeros(N)];
D(1).F = zeros(N);
D(1).Y = Y; % here is where your data is assigned
D(1).P = P;
D(1).lambda = [std(Y(:))^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = std(Y(:))*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
% create data object with data "D" and scale-free network "Net"
Data = datastruct.Dataset(D, Net);
% now we can run inference
zeta = 0; % return full network as GRN benchmark do cutoff internally
infMethod = 'LSCON';
[Aest, z] = Methods.(infMethod)(Data,zeta);
inet = Aest; % network to save
wedges = compose("%9.5f",round(inet(:),5)); % keep weights
inet(inet<0) = -1; % convert to signed edges without weights
inet(inet>0) = 1;
edges = inet(:); % from left to right, columns are merged to one vec
s = size(inet,1);
nams_edges = [repmat(1:s, 1, s); repelem(1:s, s)]';
edges_from = gnsnms(nams_edges(:,2));
edges_to = gnsnms(nams_edges(:,1));
nrid = string((1:length(edges_from))');
edge_list = table(nrid, edges_from, edges_to, wedges, string(edges));
allVars = 1:width(edge_list);
newNames = ["ID","Regulator","Target","Weight","Sign"]; % define names in the file
edge_list = renamevars(edge_list,allVars,newNames);
Var1 = "";
Var2 = "Regulator";
Var3 = "Target";
Var4 = "Weight";
Var5 = "Sign";
newNamesTab = table(Var1,Var2,Var3,Var4,Var5);
newNamesTab = renamevars(newNamesTab,allVars,newNames);
edge_list(edges==0,:) = [];
edge_list2 = [newNamesTab; edge_list];
writetable(edge_list2,pathout+tool+"_"+nlev+"_Network"+j+"_grn.csv",'QuoteStrings',true,"WriteVariableNames",false) % save as csv
end
lsco | lscon | tlsco | genie3 | lasso | elnet | ridgeco | clr | rni | zscore | |
---|---|---|---|---|---|---|---|---|---|---|
Description | Least Squares Cut Off | Least Squares Cut-Off with Normalization | Total Least Squares Cut Off | Genie3 | Lasso | Elastic net | Ridge regression with cut off | cLR | RNI | Zscore-based inference |
clear
addpath(genpath('path/to/GeneSPIDER'));
% define the size of your data and network, i.e. number of genes
N = 50;
% define desired sparsity degree, e.g. 3 edges per node on average
S = 3;
% create scale-free network that gives probability to edges
A = datastruct.scalefree2(N, S); % A is the network matrix
% stabilize networks
A = datastruct.stabilize(A,'iaa','low');
% as random weights cannot guarantee that the network is stable
% hence we stabilize, i.e. tune IAA degree
% creating Network object with
net = datastruct.Network(A, 'yourNetwork');
SNR = 0.1; % set a signal-to-noise ratio
P = -[eye(N) eye(N)]; %sample perturbation matrix
X = net.G*P; % corresponding response Y, G is the static gain matrix (inverse of A (network matrix))
s = svd(X);
stdE = s(N)/(SNR*sqrt(chi2inv(1-analyse.Data.alpha,numel(P))));
E = stdE*randn(size(P)); % noise matrix
F = zeros(size(P)); % input noise matrix
D(1).network = net.network;
D(1).E = E; % noise
D(1).F = F;
D(1).Y = X + E; %response matrix + noise
D(1).P = P; % perturbation matrix
D(1).lambda = [stdE^2,0];
D(1).cvY = D.lambda(1)*eye(N);
D(1).cvP = zeros(N);
D(1).sdY = stdE*ones(size(D.P));
D(1).sdP = zeros(size(D.P));
data = datastruct.Dataset(D,net);
zetavec = logspace(-6,0,30);
method="lsco"; % see Table.1
nest = 10; % outer iterations
boot = 10; % inner iterations
fdr = 0.05; % fdr thershold
paral = true; % turn on/off parallelization
cornr = 4; % number of cores to use
direc = '~/';
nbout = Methods.NestBoot(data, method, nest, boot, zetavec, fdr, direc, paral, cornr);
Description | |
---|---|
binary_networks | bootstrapped intersection at a given support cutoff |
signed_networks | signed network |
summed_networks | sum of signum function support to be sign |
minab_networks | minimum absolute value of sign support |
bin_cutoff | crossing of shufled and plain data |
accumulated_frequency | accumulated structure support |
binned_frequency | frequency of bins |
FP_rate_cross | frequency rate for crossing |
support_at_FP_cross | support for crossing (bin_cutoff) |
area_shuffled | area under bins from trapeozid for shuffled networks |
area_measured | area under bins from trapeozid for measured networks |
Morgan D, Tjärnberg A, Nordling TEM, Sonnhammer ELL. A generalized framework for controlling FDR in gene regulatory network inference. Bioinformatics. 2019 Mar 15;35(6):1026-1032. doi: 10.1093/bioinformatics/bty764. PMID: 30169550
Below you can find a script that allows to load GeneSPIDER network and display it with igraph package ## Loading GeneSPIDER networks
library(R.matlab)
# load file with networks from Matlab
# in this example we use one of the random scale-free network inferred in GeneSPIDER
nets <- readMat(path_to_the_file.mat)
nets <- nets[[1]]
library(igraph)
library(tidyverse)
# select 7th network from all estimated networks
selected_net <- nets[,,7]
# convert to graph format
# directed and weighted network
net0 <- graph_from_adjacency_matrix(
selected_net %>% t,
mode = c("directed"),
weighted = T)
# define edges colors based on weights
clr <- E(graph_sn)$weight
npos <- clr>=0
nneg <- clr<0
clr[npos] <- "#FF3807" # positive regulation
clr[nneg] <- "#0796FF" # negative regulation
E(graph_sn)$color <- clr
plot(graph_sn, edge.arrow.size=.2, vertex.color="#D9D9D9", vertex.size=12,
vertex.frame.color="gray", vertex.label.color="black",
vertex.label.cex=0.6, edge.curved=0.2)