Hi,
I’m having issues using parallelisation (with BiocParallel) in some of the mixOmics functions.
It might be a user mistake or a bug, I’m not sure but it looks like the ‘BPPARAM’ argument doesn’t have any effect on running time. At least in the perf() function.
Here is a fully reproducible example:
library(mixOmics)
library(dplyr)
library(BiocParallel)
## -------------------------------------------------------------------------------------------------------------------
data(breast.TCGA) # load in the data
data = list(miRNA = breast.TCGA$data.train$mirna, # set a list of all the X dataframes
mRNA = breast.TCGA$data.train$mrna,
proteomics = breast.TCGA$data.train$protein)
Y = breast.TCGA$data.train$subtype # set the response variable as the Y dataframe
## -------------------------------------------------------------------------------------------------------------------
design = matrix(0.1, ncol = length(data),
nrow = length(data), # for square matrix filled with 0.1s
dimnames = list(names(data), names(data)))
diag(design) = 0 # set diagonal to 0s
basic.diablo.model = block.splsda(X = data, Y = Y, ncomp = 5, design = design) # form basic DIABLO
## -------------------------------------------------------------------------------------------------------------------
# Benchmark
n_rep = 1
res <- list(
"MulticoreParam(10)" = microbenchmark(perf(basic.diablo.model, validation = 'Mfold',
folds = 10, nrepeat = 10,
progressBar=FALSE,
BPPARAM=MulticoreParam(workers = 10)),
times = n_rep),
"MulticoreParam(5)" = microbenchmark(perf(basic.diablo.model, validation = 'Mfold',
folds = 10, nrepeat = 10,
progressBar=FALSE,
BPPARAM=MulticoreParam(workers = 5)),
times = n_rep),
"MulticoreParam(2)" = microbenchmark(perf(basic.diablo.model, validation = 'Mfold',
folds = 10, nrepeat = 10,
progressBar=FALSE,
BPPARAM=MulticoreParam(workers = 2)),
times = n_rep),
"SnowParam(10)" = microbenchmark(perf(basic.diablo.model, validation = 'Mfold',
folds = 10, nrepeat = 10,
progressBar=FALSE,
BPPARAM=BiocParallel::SnowParam(workers = 10)),
times = n_rep),
"SnowParam(5)" = microbenchmark(perf(basic.diablo.model, validation = 'Mfold',
folds = 10, nrepeat = 10,
progressBar=FALSE,
BPPARAM=BiocParallel::SnowParam(workers = 5)),
times = n_rep),
"SnowParam(2)" = microbenchmark(perf(basic.diablo.model, validation = 'Mfold',
folds = 10, nrepeat = 10,
progressBar=FALSE,
BPPARAM=BiocParallel::SnowParam(workers = 2)),
times = n_rep),
"SerialParam(1)" = microbenchmark(perf(basic.diablo.model, validation = 'Mfold',
folds = 10, nrepeat = 10,
progressBar=FALSE,
BPPARAM=SerialParam()),
times = n_rep))
bind_rows(res)
The table below shows the results.
Unit: seconds
expr min lq mean median uq
BPPARAM = MulticoreParam(workers = 10) 25.17865 25.17865 25.17865 25.17865 25.17865
BPPARAM = MulticoreParam(workers = 5) 25.37876 25.37876 25.37876 25.37876 25.37876
BPPARAM = MulticoreParam(workers = 2) 25.19722 25.19722 25.19722 25.19722 25.19722
BPPARAM = SnowParam(workers = 10)) 25.45244 25.45244 25.45244 25.45244 25.45244
BPPARAM = SnowParam(workers = 5)) 25.81489 25.81489 25.81489 25.81489 25.81489
BPPARAM = SnowParam(workers = 2)) 25.91184 25.91184 25.91184 25.91184 25.91184
BPPARAM = SerialParam()) 25.55273 25.55273 25.55273 25.55273 25.55273
Regardless of the number of workers (10,5,2 or serial (1)), the running time is always the same. MulticoreParam or SnowParam provide similar results.
This was tested on a Mac (table above) and a linux cluster (results not shown here but they were similar).
The problem doesn’t come from BiocParallel
# Test on a simple function
FUN <- function(x) { round(sqrt(x), 4) }
n_rep = 10
resb <- list(
"MulticoreParam(10)" = microbenchmark(BiocParallel::bplapply(1:10, FUN,
BPPARAM=MulticoreParam(workers = 10)),
times = n_rep),
"MulticoreParam(5)" = microbenchmark(BiocParallel::bplapply(1:10, FUN,
BPPARAM=MulticoreParam(workers = 5)),
times = n_rep),
"MulticoreParam(2)" = microbenchmark(BiocParallel::bplapply(1:10, FUN,
BPPARAM=MulticoreParam(workers = 2)),
times = n_rep),
"SerialParam(1)" = microbenchmark(BiocParallel::bplapply(1:10, FUN,
BPPARAM=SerialParam()),
times = n_rep))
bind_rows(resb)
Unit: milliseconds
expr min lq mean median uq
MulticoreParam(workers = 10)) 109.917966 112.847457 117.48494 117.635478 121.685909
MulticoreParam(workers = 5)) 105.232978 108.726998 111.34625 110.138341 111.077569
MulticoreParam(workers = 2)) 184.162119 184.523493 186.24020 185.903594 187.473689
SerialParam()) 2.200429 2.234254 2.32217 2.266336 2.274926
→ BiocParallel seems to work as expected with a regular R function.
Would you have an idea why the BPPARAM has no effect in the perf function ?
Thank you!