Hi Kim-Anh,
I went bak to the DIABLO paper and downloaded the Supervised Multivariate Analyses with mixOmics note and ran the tune()
and perf()
to find the ncomp
and keepX
for my data. Below is the code. I used a 5 fold
CV and just one CPU. Design is Full weighted design
.
Here is the code:
data = list(mRNA = mRNA.D,protein = protein.D,metabolite = metabol.D)
# check dimension
lapply(data, dim)
$mRNA
[1] 10 5559
$protein
[1] 10 1365
$metabolite
[1] 10 471
design = matrix(0.1, ncol = length(data), nrow = length(data),dimnames = list(names(data), names(data)))
diag(design) = 0
design
mRNA protein metabolite
mRNA 0.0 0.1 0.1
protein 0.1 0.0 0.1
metabolite 0.1 0.1 0.0
sgccda.res = block.splsda(X = data, Y = Y, ncomp = 5,design = design)
set.seed(123) # for reproducibility, only when the `cpus' argument is not used
t1 = proc.time()
perf.diablo = perf(sgccda.res, validation = 'Mfold', folds = 5, nrepeat = 5)
t2 = proc.time()
running_time = t2 - t1; running_time
plot(perf.diablo)
perf.diablo$choice.ncomp$WeightedVote
ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]
test.keepX = list (mRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)), protein = c(5:9, seq(10, 18, 2), seq(20,30,5)),metabolite = c(5:9, seq(10, 18, 2), seq(20,30,5)))
t1 = proc.time()
tune.BBM = tune.block.splsda(X = data, Y = Y, ncomp = ncomp,test.keepX = test.keepX, design = design, validation = 'Mfold', folds = 5, nrepeat = 1,dist = "centroids.dist", cpus = 1)
t2 = proc.time()
running_time = t2 - t1; running_time
list.keepX = tune.BBM$choice.keepX
list.keepX
I got the warning
: The SGCCA algorithm did not converge
list.keepX output is below. My ncomp was also 1.
$mRNA
[1] 5
$protein
[1] 5
$metabolite
[1] 8
Do you think the above is a better way to pull out the list.keepX for my data and then feed this to the supervised DIABLO? Any thoughts? My intention is to now understand what is the optimal list.keepX
for my data and use that for supervised DIABLO for N-integration. I am not intending to put anything like arbitrary list.keepX
to the model for N-integration. I reckon my initial analysis is very arbitrary. Any advice?
Kind regards,
VD