Dear ALL,
Firstly, thank you so much for developing such an interesting algorithm. I would like to use DIABLO to integrate lipidomics, metabolimics, and transcriptomics data and use combined multi-omics data for a well discriminate power. However, I failed to get an acceptable results. I will show the process of my data analysis, code and main findings as below.
First, I conducted the data preprocessing (filtering, normalization and scaling) before using DIABLO. And then I split my dataset (150 samples) using 8:2 ratio and applied DIABLO in the train dataset. However, I encountered some issues that make my results (confusion matrix and auc) less convinced.
tune_diablo_ncomp ← function(data, classes, design, ncomp=5, BPPARAM=BPPARAM,
near_zero_var=FALSE) {
# First, we fit a DIABLO model without variable selection to assess the global
# performance and choose the number of components for the final DIABLO model.
# The function perf is run with 10-fold cross validation repeated 10 times.
print(“Finding optimal number of components for DIABLO…”)
sgccda_res ← block.splsda(X=data, Y=classes, ncomp=ncomp, design=design, near.zero.var=near_zero_var)
# this code takes a couple of min to run
perf_diablo ← perf(sgccda_res, validation=‘Mfold’, folds=5, nrepeat=5, BPPARAM=BPPARAM)
#plot(perf_diablo)
print(perf_diablo$error.rate)
#plot(perf_diablo, main=“DIABLO optimal components”)
ncomp<-perf_diablo$choice.ncomp$WeightedVote[“Overall.BER”, “centroids.dist”]
output<-c(list(ncomp),list(perf_diablo))
names(output)<-c(“ncomp”,“perf_diablo”)
return(output)
}
In my results, the parameters I tuned are not appropriate for further analyses. First, the classification error is higher that 0.40. Thus, I cannot extract the number of component.
Nevertheless, the model was developed based one the optimized parameters (n.comp and keep.X). Next, I got the auc from perf() function and apply the same model in the test dataset.
auc<-function(X.train,Y.train,ncomp,design,testdata,outcome.test,keex){
#newdata as test data
auc.rep.comp<-list()
diablo.tcga.final ← block.splsda(X.train, Y.train, ncomp = ncomp, design = design, keepX = keepx)
perf.diablo.tcga.final.mfold = perf(diablo.tcga.final, validation = ‘Mfold’, folds = 5, nrepeat = 100,auc=TRUE)
#perf.diablo.tcga.final.loo = perf(diablo.tcga.final, validation = ‘loo’, nrepeat = 100,auc=TRUE)
print("<<<")
auroc.sgccda.result<-auroc.sgccda(diablo.tcga.final,plot=FALSE,newdata=testdata,outcome.test = outcome.test)
train.auc.mfold<-perf.diablo.tcga.final.mfold$auc
#print(train.auc.mfold)
#train.auc.loo<-perf.diablo.tcga.final.loo$auc
comps<-1:diablo.tcga.final$ncomp
blocks<-1:length(diablo.tcga.final$keepX)
auc.rep.comp<-lapply(comps, function(comp){
Reduce("+", lapply(blocks, function(block){
auroc.sgccda.result[[block]][[comp]][1]}))/length(blocks)})
print(“Predicting train on an external test set…”)
predict.diablo ← predict(diablo.tcga.final, newdata = testdata)
# the warning message will inform us that one block is missing
#predict.diablo # list the different outputs
print(“Getting confusion matrix…”)
confusion_matrix ← get.confusion_matrix(
truth=outcome.test, predicted=predict.diablo$WeightedVote$max.dist[,1])
# auc.rep.comp<-Reduce("+",lapply(comps, function(comp){
# Reduce("+", lapply(blocks, function(block){
# auroc.sgccda.result[[block]][[comp]][1]}))/length(blocks)}))/length(comps)
# out<-c(list(train.auc.mfold),list(train.auc.loo),list(auc.rep.comp))
# names(out)<-c(“Auc.mfold”,“Auc.loo”,“Auc.test”)
out<-c(list(train.auc.mfold),list(auc.rep.comp),list(confusion_matrix))
names(out)<-c(“Auc.mfold”,“Auc.test”,“confusion_matrix”)
return(out)
}
X.pre.m1.b.new <- list(meta = omics.pre.m1.b.new$data.train$met,
lip = omics.pre.m1.b.new$data.train$lip,
rna = omics.pre.m1.b.new$data.train$rna)
# Outcome
#Y <- breast.TCGA$data.train$subtype
Y.pre.m1.b.new<-factor(omics.pre.m1.b.new$data.train$outcome$M1_binary,levels=c("0","1"),labels=c("Low response","High response"))
summary(Y.pre.m1.b.new)
X.pre.m1.b.new.test.raw <- list(meta = omics.pre.m1.b.new$data.test$met,
lip = omics.pre.m1.b.new$data.test$lip,
rna = omics.pre.m1.b.new$data.test$rna)
Y.pre.m1.b.new.test<-factor(omics.pre.m1.b.new$data.test$outcome$M1_binary,levels=c("0","1"),labels=c("Low response","High response"))
$Auc.mfold
$Auc.mfold$comp1
AUC p-value
Low response vs High response 0.8382438 1.190397e-05
$Auc.mfold$comp2
AUC p-value
Low response vs High response 0.8557513 2.886073e-06
$Auc.mfold$comp3
AUC p-value
Low response vs High response 0.856822 2.706252e-06
$Auc.mfold$comp4
AUC p-value
Low response vs High response 0.8580873 2.374651e-06
$Auc.mfold$comp5
AUC p-value
Low response vs High response 0.8585538 2.191657e-06
$Auc.test
$Auc.test[[1]]
[1] 0.6628667
$Auc.test[[2]]
[1] 0.6003667
$Auc.test[[3]]
[1] 0.5928
$Auc.test[[4]]
[1] 0.5984667
$Auc.test[[5]]
[1] 0.6060333
$confusion_matrix
predicted.as.Low response predicted.as.High response
Low response 9 2
High response 5 11
I understand auc shown here cannot present the predictive power well and the confusion matrix was produced then. Since I cannot identify if the low predictive power resulted from the code or the potential biological reasons, could you please help me check the code?
Thank you so much in advance and I hope to hear back from you.
Shilin