Hello everyone!

I want to study the interaction between the host (transcriptomics data) and the bacteria (microbiome data), so I have found this amazing package with a lot of documentation to perform the analysis (thank you again!). But I have several doubts after a month of “playing” with mixOmics, because I think I’m obtaining bad model metrics and I don’t know if is the data or the analysis.

Once I studied most of the documentation, I realize that what I want to do was an N-integration with the DIABLO framework. I followed the steps described here: 6 N-Integration | mixOmics vignette

########### Here are **the steps I followed** and below the questions I have:

**1. Clean my data** (23 A samples vs 33 B samples, 60 OTU and 123 genes).

**2. Define X** (microbiome + transcriptome) and Y (outcomes labels, A and B).

**3. Define design matrix**. In my case, I’m still testing which is the best design for my analysis, trying to perform the same models with 0, 0.5 and 1 (I don’t know how related are my datasets).

####### For example, design = 0.5

design_all ← matrix(0.5, ncol = length(X_all), nrow = length(X_all), dimnames = list(names(X_all), names(X_all)))

**4. Calculate the number of components** in all the distances and select the overall.BER to have the best ncomp. I’m still testing also which is the best distance for my analysis (max.dist, centroids.dist or mahalanobis.dist).

diablo.all ← block.plsda(X_all, Y_all, ncomp = 10, design = design_all)

perf.diablo.all = perf(diablo.all, dist = “all”, validation = ‘Mfold’, folds = 6, nrepeat = 5)

plot(perf.diablo.all)

perf.diablo.all$choice.ncomp$WeightedVote

####### I tried with max.dist in this example

ncomp ← perf.diablo.all$choice.ncomp$WeightedVote[“Overall.BER”, “max.dist”]

ncomp

**5. Calculate which is the best number of variables** to select with tune.block.splsda:

set.seed(123)

####### microbiome data: 60 OTUs, transcriptome data: 123 genes.

test.keepX_all ← list(microbiome = seq(3, 60, 3), transcriptome = seq(3, 123, 3))

tune.diablo.all ← tune.block.splsda(X_all, Y_all, ncomp = ncomp,

test.keepX = test.keepX_all, design = design_all,

validation = ‘Mfold’, folds = 10, nrepeat = 100,

BPPARAM = BiocParallel::SnowParam(workers = 34),

dist = “max.dist”, progressBar = TRUE)

list.keepX_all ← tune.diablo.all$choice.keepX

**6. Obtain the final model**:

diablo.all ← block.splsda(X_all, Y_all, ncomp = ncomp, keepX = list.keepX_all, design = design_all)

**7. Make several plots**: circosPlot, see the networks, plotIndiv(), etc.

**8. ¿Validate the model?** (I’m not sure if this part is correct)

set.seed(123)

perf.diablo.tcga ← perf(diablo.all, validation = ‘Mfold’, folds = 10, nrepeat = 10, dist = ‘max.dist’)

perf.diablo.tcga # Lists the different outputs

perf.diablo.tcga$MajorityVote.error.rate

$max.dist

comp1 comp2 comp3 comp4

A 0.9173913 0.8652174 0.8043478 0.7695652

B 0.4121212 0.4969697 0.5030303 0.5272727

Overall.ER 0.6196429 0.6482143 0.6267857 0.6267857

Overall.BER 0.6647563 0.6810935 0.6536891 0.6484190

auc.diablo.tcga ← auroc(diablo.all, roc.block = “microbiome”, roc.comp = c(1,2,3,4), print = FALSE)

####### The AUC in this plot is 0.8959

auc.diablo.tcga ← auroc(diablo.all, roc.block = “transcriptome”, roc.comp = c(1,2,3,4), print = FALSE)

####### The AUC in this plot is 0.747

########### **Questions**:

**1.** My way of testing which design / distance to use is based on what metrics I get in perf.diablo.tcga$MajorityVote.error.rate and the AUC obtained in each block (microbiome / transcriptome). Is this the right way to know which is the best design / distance?

**2.** Am I doing the model validation right? Or should I separate my data 80/20 and test the model with blind data? I understand that cross validation is enough (Mfold), but I wanted to ask you because I am not sure.

**3.** I thought the model would return a single ROC curve for the model overall, but no, it returns one for the microbiome and one for the transcriptome. Would there be any way to have a ROC curve for the model overall?

**4.** Do you know what would be good metrics to accept a model as good? Obviously the AUC should be as close to 1 and the ER and BER as close to 0, but it is to know where could be the cut off between a bad model and an acceptable one, just based in your experience.

Sorry for the long query, I assure you that I have searched a lot on the internet to find the answers but I am a bit stuck on this.

If you see any other errors that I might have missed in my code, or something that I am not understanding, I will be very grateful to read your comments.

Thank you very much in advance and for your great work!

Marta