With this function, the user can simulate realistic read counts for genes and spike-ins across two and multiple groups of samples (cells).

simulateCounts(n=c(20,100,30,25,500), ngenes=10000,
p.DE=0.1, pLFC,
p.B=NULL, bLFC=NULL, bPattern="uncorrelated",
p.M=NULL, mLFC=NULL,
params, size.factors='equal',
spike=NULL, spikeIns=FALSE,
downsample=FALSE, geneset=FALSE,
sim.seed=NULL, verbose=TRUE)

Arguments

n

The vector of sample groups with n samples, e.g c(10,50,20,16).

ngenes

The total number of genes to simulate. Default is 10000.

p.DE

Numeric vector between 0 and 1 representing the percentage of genes being differentially expressed due to phenotype, i.e. biological signal. Default is 0.1.

pLFC

The log2 phenotypic fold changes for DE genes. (1) For two group simulations, this can be: (a) a constant, e.g. 2; (b) a vector of values with length being number of DE genes. If the input is a vector and the length is not the number of DE genes, it will be sampled with replacement to generate log-fold change; (c) an univariate function that takes an integer n and generates vector(s) of length n, e.g. function(x) rnorm(x, mean=0, sd=1.5). (2) For multigroup simulations, this can be: (a) a list with number of elements equal to number of groups to simulate. The element of the list are vectors of log2 fold changes. (b) a multivariate function that takes an integer n and generates a dataframe with number of columns equal to number of groups. e.g. function(x) mvtnorm::rmvnorm(x, mean=c(4,2,1), sigma = matrix(c(4,2,2,2,3,2, 2, 2, 5), ncol=3)).

p.B

Numeric vector between 0 and 1 representing the percentage of genes being differentially expressed between batches. Default is NULL, i.e. no batch effect.

bLFC

The log2 batch fold change for all genes. This can be: (1) a constant, e.g. 2; (2) a vector of values with length being number of all genes. If the input is a vector and the length is not the number of total genes, it will be sampled with replacement to generate log2 fold changes; (3) an univariate function that takes an integer n, and generates a vector of length n, e.g. function(x) rnorm(x, mean=0, sd=1.5). Note that only two batches will be simulated irrespective of the number of phenotypic groups defined in pLFC.

bPattern

Character vector for batch effect pattern. Possible options include: "uncorrelated", "orthogonal" and " correlated". Default is "uncorrelated".

p.M

Numeric vector between 0 and 1 representing the percentage of genes being differentially expressed exclusively in one group, i.e. marker genes. Default is NULL.

mLFC

The log2 batch fold change for marker genes. This can be: (1) a constant, e.g. 2; (2) a vector of values with length being number of marker genes. If the input is a vector and the length is not the number of marker genes, it will be sampled with replacement to generate log2 fold changes; (3) a function that takes an integer n, and generates a vector of length n, e.g. function(x) rnorm(x, mean=0, sd=1.5).

params

The distributional parameters for simulations of genes, i.e. the output of estimateParam.

size.factors

Size factors representing sample-specific differences/biases in expected mean values of the NB distribution: "equal" or "given". The default is "equal", i.e. equal size factor of 1. If the user defines it as given, the size factors are sampled from the size factors ("sf") provided by the output of estimateParam.

spike

The distributional parameters for simulations of spike-ins, i.e. the output of estimateSpike.

spikeIns

Logical value to indicate whether to simulate spike-ins. Default is FALSE.

downsample

Drawing the associated dispersions after determining effective mean expressions by size factors. Default is FALSE, i.e. using the true mean expression values.

geneset

Sampling with replacement or filling count tables with low magnitude Poisson when the estimated mean expression vector is shorter than the number of genes to be simulated. Default is FALSE, i.e. random sampling of mean expression values with replacement.

sim.seed

Simulation seed.

verbose

Logical value to indicate whether to show progress report of simulation. Default is TRUE.

Value

List with the following vectors:

GeneCounts

The simulated read count matrix for genes with row=genes and columns=samples.

SpikeCounts

The simulated read count matrix for spike-ins with row=spike-ins and columns=samples.

DEid

A vector (length=ngenes*p.DE) for the IDs of phenotypic DE genes.

Bid

A vector (length=ngenes*p.B) for the IDs of batch DE genes.

Mid

A vector (length=ngenes*p.M) for the IDs of marker genes.

pLFC

A vector / matrix (columns = length(n); rows = ngenes) for phenotypic log fold change of all genes, ie nonDE=0 and DE=plfc.

bLFC

A vector / matrix (columns = length(n); rows = ngenes) for phenotypic log fold change of all genes, ie nonDE=0 and DE=plfc.

mLFC

A vector / matrix (columns = length(n); rows = ngenes) for phenotypic log fold change of all genes, ie nonDE=0 and DE=plfc.

ngenes, nsims, p.DE, p.B, p.M, sim.seed, n, k

Input parameters.

Examples

# NOT RUN {
## define log2 fold changes
p.foo <- function(x) mvtnorm::rmvnorm(x, mean=c(4,2,1),
sigma = matrix(c(4,2,2,2,3,2, 2, 2, 5), ncol=3))
b.foo <- function(x) rnorm(x, mean=0, sd=1.5)
## simulate 3 groups of cells
simcounts <- simulateCounts(n=c(100,110,90), ngenes=10000,
p.DE=0.05, pLFC = p.foo,
p.B=0.1, bLFC=b.foo, bPattern="uncorrelated",
p.M=NULL, mLFC=NULL,
params=kolodziejczk_param,
size.factors="equal",
spike=NULL, spikeIns=FALSE,
 downsample=FALSE, geneset=FALSE,
 sim.seed=34628, verbose=TRUE)
# }