With this function, the user can simulate realistic read counts for genes and spike-ins across two and multiple groups of samples (cells).
simulateCounts(n=c(20,100,30,25,500), ngenes=10000, p.DE=0.1, pLFC, p.B=NULL, bLFC=NULL, bPattern="uncorrelated", p.M=NULL, mLFC=NULL, params, size.factors='equal', spike=NULL, spikeIns=FALSE, downsample=FALSE, geneset=FALSE, sim.seed=NULL, verbose=TRUE)
| n | The vector of sample groups with n samples, |
|---|---|
| ngenes | The total number of genes to simulate. Default is |
| p.DE | Numeric vector between 0 and 1 representing
the percentage of genes being differentially expressed due to phenotype,
i.e. biological signal. Default is |
| pLFC | The log2 phenotypic fold changes for DE genes. (1) For two group simulations, this can be: (a) a constant, e.g. 2; (b) a vector of values with length being number of DE genes. If the input is a vector and the length is not the number of DE genes, it will be sampled with replacement to generate log-fold change; (c) an univariate function that takes an integer n and generates vector(s) of length n, e.g. function(x) rnorm(x, mean=0, sd=1.5). (2) For multigroup simulations, this can be: (a) a list with number of elements equal to number of groups to simulate. The element of the list are vectors of log2 fold changes. (b) a multivariate function that takes an integer n and generates a dataframe with number of columns equal to number of groups. e.g. function(x) mvtnorm::rmvnorm(x, mean=c(4,2,1), sigma = matrix(c(4,2,2,2,3,2, 2, 2, 5), ncol=3)). |
| p.B | Numeric vector between 0 and 1 representing the percentage of genes
being differentially expressed between batches. Default is |
| bLFC | The log2 batch fold change for all genes. This can be: (1) a constant, e.g. 2; (2) a vector of values with length being number of all genes. If the input is a vector and the length is not the number of total genes, it will be sampled with replacement to generate log2 fold changes; (3) an univariate function that takes an integer n, and generates a vector of length n, e.g. function(x) rnorm(x, mean=0, sd=1.5). Note that only two batches will be simulated irrespective of the number of phenotypic groups defined in pLFC. |
| bPattern | Character vector for batch effect pattern. Possible options include:
"uncorrelated", "orthogonal" and " correlated". Default is |
| p.M | Numeric vector between 0 and 1 representing the percentage of genes
being differentially expressed exclusively in one group, i.e. marker genes. Default is |
| mLFC | The log2 batch fold change for marker genes. This can be: (1) a constant, e.g. 2; (2) a vector of values with length being number of marker genes. If the input is a vector and the length is not the number of marker genes, it will be sampled with replacement to generate log2 fold changes; (3) a function that takes an integer n, and generates a vector of length n, e.g. function(x) rnorm(x, mean=0, sd=1.5). |
| params | The distributional parameters for simulations of genes,
i.e. the output of |
| size.factors | Size factors representing sample-specific differences/biases in expected mean values of the NB distribution:
"equal" or "given". The default is |
| spike | The distributional parameters for simulations of spike-ins,
i.e. the output of |
| spikeIns | Logical value to indicate whether to simulate spike-ins.
Default is |
| downsample | Drawing the associated dispersions after determining effective mean expressions by size factors.
Default is |
| geneset | Sampling with replacement or filling count tables with low magnitude Poisson
when the estimated mean expression vector is shorter than the number of genes to be simulated.
Default is |
| sim.seed | Simulation seed. |
| verbose | Logical value to indicate whether to show progress report of simulation.
Default is |
List with the following vectors:
The simulated read count matrix for genes with row=genes and columns=samples.
The simulated read count matrix for spike-ins with row=spike-ins and columns=samples.
A vector (length=ngenes*p.DE) for the IDs of phenotypic DE genes.
A vector (length=ngenes*p.B) for the IDs of batch DE genes.
A vector (length=ngenes*p.M) for the IDs of marker genes.
A vector / matrix (columns = length(n); rows = ngenes) for phenotypic log fold change of all genes, ie nonDE=0 and DE=plfc.
A vector / matrix (columns = length(n); rows = ngenes) for phenotypic log fold change of all genes, ie nonDE=0 and DE=plfc.
A vector / matrix (columns = length(n); rows = ngenes) for phenotypic log fold change of all genes, ie nonDE=0 and DE=plfc.
Input parameters.
# NOT RUN { ## define log2 fold changes p.foo <- function(x) mvtnorm::rmvnorm(x, mean=c(4,2,1), sigma = matrix(c(4,2,2,2,3,2, 2, 2, 5), ncol=3)) b.foo <- function(x) rnorm(x, mean=0, sd=1.5) ## simulate 3 groups of cells simcounts <- simulateCounts(n=c(100,110,90), ngenes=10000, p.DE=0.05, pLFC = p.foo, p.B=0.1, bLFC=b.foo, bPattern="uncorrelated", p.M=NULL, mLFC=NULL, params=kolodziejczk_param, size.factors="equal", spike=NULL, spikeIns=FALSE, downsample=FALSE, geneset=FALSE, sim.seed=34628, verbose=TRUE) # }