Analytical issues using DIABLO

shilin · April 12, 2022, 8:59am

Dear ALL,

Firstly, thank you so much for developing such an interesting algorithm. I would like to use DIABLO to integrate lipidomics, metabolimics, and transcriptomics data and use combined multi-omics data for a well discriminate power. However, I failed to get an acceptable results. I will show the process of my data analysis, code and main findings as below.

First, I conducted the data preprocessing (filtering, normalization and scaling) before using DIABLO. And then I split my dataset (150 samples) using 8:2 ratio and applied DIABLO in the train dataset. However, I encountered some issues that make my results (confusion matrix and auc) less convinced.

tune_diablo_ncomp ← function(data, classes, design, ncomp=5, BPPARAM=BPPARAM,
near_zero_var=FALSE) {
# First, we fit a DIABLO model without variable selection to assess the global
# performance and choose the number of components for the final DIABLO model.
# The function perf is run with 10-fold cross validation repeated 10 times.
print(“Finding optimal number of components for DIABLO…”)
sgccda_res ← block.splsda(X=data, Y=classes, ncomp=ncomp, design=design, near.zero.var=near_zero_var)
# this code takes a couple of min to run
perf_diablo ← perf(sgccda_res, validation=‘Mfold’, folds=5, nrepeat=5, BPPARAM=BPPARAM)
#plot(perf_diablo)
print(perf_diablo$error.rate)
#plot(perf_diablo, main=“DIABLO optimal components”)
ncomp<-perf_diablo$choice.ncomp$WeightedVote[“Overall.BER”, “centroids.dist”]
output<-c(list(ncomp),list(perf_diablo))
names(output)<-c(“ncomp”,“perf_diablo”)
return(output)
}

In my results, the parameters I tuned are not appropriate for further analyses. First, the classification error is higher that 0.40. Thus, I cannot extract the number of component.

Nevertheless, the model was developed based one the optimized parameters (n.comp and keep.X). Next, I got the auc from perf() function and apply the same model in the test dataset.
auc<-function(X.train,Y.train,ncomp,design,testdata,outcome.test,keex){
#newdata as test data
auc.rep.comp<-list()
diablo.tcga.final ← block.splsda(X.train, Y.train, ncomp = ncomp, design = design, keepX = keepx)
perf.diablo.tcga.final.mfold = perf(diablo.tcga.final, validation = ‘Mfold’, folds = 5, nrepeat = 100,auc=TRUE)
#perf.diablo.tcga.final.loo = perf(diablo.tcga.final, validation = ‘loo’, nrepeat = 100,auc=TRUE)
print("<<<")
auroc.sgccda.result<-auroc.sgccda(diablo.tcga.final,plot=FALSE,newdata=testdata,outcome.test = outcome.test)
train.auc.mfold<-perf.diablo.tcga.final.mfold$auc
#print(train.auc.mfold)
#train.auc.loo<-perf.diablo.tcga.final.loo$auc
comps<-1:diablo.tcga.final$ncomp
blocks<-1:length(diablo.tcga.final$keepX)
auc.rep.comp<-lapply(comps, function(comp){
Reduce("+", lapply(blocks, function(block){
auroc.sgccda.result[[block]][[comp]][1]}))/length(blocks)})
print(“Predicting train on an external test set…”)
predict.diablo ← predict(diablo.tcga.final, newdata = testdata)
# the warning message will inform us that one block is missing
#predict.diablo # list the different outputs
print(“Getting confusion matrix…”)
confusion_matrix ← get.confusion_matrix(
truth=outcome.test, predicted=predict.diablo$WeightedVote$max.dist[,1])
# auc.rep.comp<-Reduce("+",lapply(comps, function(comp){
# Reduce("+", lapply(blocks, function(block){
# auroc.sgccda.result[[block]][[comp]][1]}))/length(blocks)}))/length(comps)
# out<-c(list(train.auc.mfold),list(train.auc.loo),list(auc.rep.comp))
# names(out)<-c(“Auc.mfold”,“Auc.loo”,“Auc.test”)
out<-c(list(train.auc.mfold),list(auc.rep.comp),list(confusion_matrix))
names(out)<-c(“Auc.mfold”,“Auc.test”,“confusion_matrix”)
return(out)
}

X.pre.m1.b.new <- list(meta = omics.pre.m1.b.new$data.train$met, 
                     lip = omics.pre.m1.b.new$data.train$lip, 
                     rna = omics.pre.m1.b.new$data.train$rna)

    # Outcome
    #Y <- breast.TCGA$data.train$subtype
    Y.pre.m1.b.new<-factor(omics.pre.m1.b.new$data.train$outcome$M1_binary,levels=c("0","1"),labels=c("Low response","High response"))
    
    
    summary(Y.pre.m1.b.new)
    
    X.pre.m1.b.new.test.raw <- list(meta = omics.pre.m1.b.new$data.test$met, 
                                      lip = omics.pre.m1.b.new$data.test$lip, 
                                      rna = omics.pre.m1.b.new$data.test$rna)
    Y.pre.m1.b.new.test<-factor(omics.pre.m1.b.new$data.test$outcome$M1_binary,levels=c("0","1"),labels=c("Low response","High response"))

            $Auc.mfold
            $Auc.mfold$comp1
                                                AUC      p-value
            Low response vs High response 0.8382438 1.190397e-05
            
            $Auc.mfold$comp2
                                                AUC      p-value
            Low response vs High response 0.8557513 2.886073e-06
            
            $Auc.mfold$comp3
                                               AUC      p-value
            Low response vs High response 0.856822 2.706252e-06
            
            $Auc.mfold$comp4
                                                AUC      p-value
            Low response vs High response 0.8580873 2.374651e-06
            
            $Auc.mfold$comp5
                                                AUC      p-value
            Low response vs High response 0.8585538 2.191657e-06
            
            
            $Auc.test
            $Auc.test[[1]]
            [1] 0.6628667
            
            $Auc.test[[2]]
            [1] 0.6003667
            
            $Auc.test[[3]]
            [1] 0.5928
            
            $Auc.test[[4]]
            [1] 0.5984667
            
            $Auc.test[[5]]
            [1] 0.6060333
            
            
            $confusion_matrix
                          predicted.as.Low response predicted.as.High response
            Low response                          9                          2
            High response                         5                         11

I understand auc shown here cannot present the predictive power well and the confusion matrix was produced then. Since I cannot identify if the low predictive power resulted from the code or the potential biological reasons, could you please help me check the code?

Thank you so much in advance and I hope to hear back from you.

Shilin

MaxBladen · April 13, 2022, 12:41am

Hi @shilin

Based purely on your code, I can’t really provide much direction. It seems that the methodology you have followed is valid. I cannot see where you are specifying the keepX range - this may have an impact. Look here for more info. If you are partioning your data into training and testing prior to the use of tune_diablo_ncomp(), then you are only tuning it on training data which is incorrect. The perf() function partitions the data for you within each repeat.

Apart from these suggestions, there isnt much else I can say. It may just be that your data cannot be discriminated on, or that your sample is not representative of the population you are examining.

shilin · April 13, 2022, 10:01am

Hi @MaxBladen

Thanks for your kind reply and helpful suggestions.

I corrected the code that I chose the number of components before data splitting.

No significant difference was observed in the distribution of key demographical factors (Age & Gender) between train set and test set. I guess the lower discriminant power may result from the phenotype data. I will try to find new approach to cut my data and hope omics data can provide a well-discriminations on the new classification.

Thanks for your time and assistance again.

Topic		Replies	Views
Using DIABLO Output for ML Training Analysis	1	22	June 13, 2025
DIABLO data transformation and tuning Analysis	1	511	February 28, 2022
Choice of DIABLO design Analysis	3	150	August 8, 2024
Perf diablo error Bugs	1	546	June 30, 2020
Integration of 2 data sets with DIABLO Analysis	4	1454	April 22, 2020

Analytical issues using DIABLO

Related topics