Model validation / analysis

Hello everyone!

I want to study the interaction between the host (transcriptomics data) and the bacteria (microbiome data), so I have found this amazing package with a lot of documentation to perform the analysis (thank you again!). But I have several doubts after a month of “playing” with mixOmics, because I think I’m obtaining bad model metrics and I don’t know if is the data or the analysis.

Once I studied most of the documentation, I realize that what I want to do was an N-integration with the DIABLO framework. I followed the steps described here: 6 N-Integration | mixOmics vignette

########### Here are the steps I followed and below the questions I have:

1. Clean my data (23 A samples vs 33 B samples, 60 OTU and 123 genes).

2. Define X (microbiome + transcriptome) and Y (outcomes labels, A and B).

3. Define design matrix. In my case, I’m still testing which is the best design for my analysis, trying to perform the same models with 0, 0.5 and 1 (I don’t know how related are my datasets).

####### For example, design = 0.5
design_all ← matrix(0.5, ncol = length(X_all), nrow = length(X_all), dimnames = list(names(X_all), names(X_all)))

4. Calculate the number of components in all the distances and select the overall.BER to have the best ncomp. I’m still testing also which is the best distance for my analysis (max.dist, centroids.dist or mahalanobis.dist).

diablo.all ← block.plsda(X_all, Y_all, ncomp = 10, design = design_all)
perf.diablo.all = perf(diablo.all, dist = “all”, validation = ‘Mfold’, folds = 6, nrepeat = 5)
plot(perf.diablo.all)

perf.diablo.all$choice.ncomp$WeightedVote
####### I tried with max.dist in this example
ncomp ← perf.diablo.all$choice.ncomp$WeightedVote[“Overall.BER”, “max.dist”]
ncomp

5. Calculate which is the best number of variables to select with tune.block.splsda:

set.seed(123)
####### microbiome data: 60 OTUs, transcriptome data: 123 genes.
test.keepX_all ← list(microbiome = seq(3, 60, 3), transcriptome = seq(3, 123, 3))

tune.diablo.all ← tune.block.splsda(X_all, Y_all, ncomp = ncomp,
test.keepX = test.keepX_all, design = design_all,
validation = ‘Mfold’, folds = 10, nrepeat = 100,
BPPARAM = BiocParallel::SnowParam(workers = 34),
dist = “max.dist”, progressBar = TRUE)

list.keepX_all ← tune.diablo.all$choice.keepX

6. Obtain the final model:

diablo.all ← block.splsda(X_all, Y_all, ncomp = ncomp, keepX = list.keepX_all, design = design_all)

7. Make several plots: circosPlot, see the networks, plotIndiv(), etc.

8. ¿Validate the model? (I’m not sure if this part is correct)

set.seed(123)
perf.diablo.tcga ← perf(diablo.all, validation = ‘Mfold’, folds = 10, nrepeat = 10, dist = ‘max.dist’)

perf.diablo.tcga # Lists the different outputs

perf.diablo.tcga$MajorityVote.error.rate
$max.dist
comp1 comp2 comp3 comp4
A 0.9173913 0.8652174 0.8043478 0.7695652
B 0.4121212 0.4969697 0.5030303 0.5272727
Overall.ER 0.6196429 0.6482143 0.6267857 0.6267857
Overall.BER 0.6647563 0.6810935 0.6536891 0.6484190

auc.diablo.tcga ← auroc(diablo.all, roc.block = “microbiome”, roc.comp = c(1,2,3,4), print = FALSE)
####### The AUC in this plot is 0.8959

auc.diablo.tcga ← auroc(diablo.all, roc.block = “transcriptome”, roc.comp = c(1,2,3,4), print = FALSE)
####### The AUC in this plot is 0.747

########### Questions:

1. My way of testing which design / distance to use is based on what metrics I get in perf.diablo.tcga$MajorityVote.error.rate and the AUC obtained in each block (microbiome / transcriptome). Is this the right way to know which is the best design / distance?

2. Am I doing the model validation right? Or should I separate my data 80/20 and test the model with blind data? I understand that cross validation is enough (Mfold), but I wanted to ask you because I am not sure.

3. I thought the model would return a single ROC curve for the model overall, but no, it returns one for the microbiome and one for the transcriptome. Would there be any way to have a ROC curve for the model overall?

4. Do you know what would be good metrics to accept a model as good? Obviously the AUC should be as close to 1 and the ER and BER as close to 0, but it is to know where could be the cut off between a bad model and an acceptable one, just based in your experience.

Sorry for the long query, I assure you that I have searched a lot on the internet to find the answers but I am a bit stuck on this.

If you see any other errors that I might have missed in my code, or something that I am not understanding, I will be very grateful to read your comments.

Thank you very much in advance and for your great work!
Marta

Hi @Marina,
thank you for taking the time to explain your analysis, it makes my life easier :slight_smile:

Screen Shot 2023-06-02 at 09.35.23

Kim-Anh

1 Like

Thank you very much for your reply, now I am more confident with my analysis.

Thank you for your work!
Best regards,
Marta