Hello Team,

I had a few questions regarding DIABLO analysis and would really appreciate if I could get some clarification.

- I started with normalizing my two datasets where: (i) microbiome data was transformed using centered log ratios and low counts were removed; and metabolomic data were log-transformed). Could you please confirm if this was done properly since the length of the microbiome data reduced to 299 from 3453 (before removing low counts)?

```
#Microbiome
data.offset <- asvfin1+1
sum(which(data.offset == 0))
low.count.removal = function(
data, # OTU count data frame of size n (sample) x p (OTU)
percent=0.01 # cutoff chosen
){
keep.otu = which(colSums(data)*100/(sum(colSums(data))) > percent)
data.filter = data[,keep.otu]
return(list(data.filter = data.filter, keep.otu = keep.otu))
}
result.filter <- low.count.removal(data.offset, percent=0.01)
data.filter <- result.filter$data.filter
data.clr <- logratio.transfo(as.matrix(data.filter), logratio = 'CLR', offset = 0)
#Metabolome
metablog <- log(metabfin1)
```

- Next, I tried to calculate total number of components for the DIABLO model and the number of variables per dataset and per component using with the tune function in the DIABLO package. I am recommended 2 components so I assume this is okay as there is a decrease in error rate from component 1 to 2. I was wondering why I was only shown 2 components in my results whereas the results in the tcga study example provided 5 components? Is it to do with the outcome variable since my outcome is binary?

```
sgccda.res <- block.splsda(X = X, Y = Y, ncomp = 5,
design = design)
set.seed(124)
perf.diablo <- perf(sgccda.res, validation = 'Mfold', folds = 10, nrepeat = 10)
plot(perf.diablo)
```

```
perf.diablo$choice.ncomp$WeightedVote
max.dist centroids.dist mahalanobis.dist
Overall.ER 1 2 2
Overall.BER 1 2 1
ncomp <- perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]
```

Here is the code I used to calculate the number of variables to keep from each dataset, and it suggested 6 variables from each dataset. I am confused as to why these variables change in numbers when I change the test.keepX. I know this may be answered before but I am having difficulty with understanding how to use the test.keepX…I randomly selected 1:72 but I don’t think I am doing this correctly.

```
test.keepX <- list (asvclr = c(1:72),
metablog = c(1:72))
tune.TCGA <- tune.block.splsda(X = X, Y = Y, ncomp = ncomp,
test.keepX = test.keepX, design = design,
validation = 'Mfold', folds = 10, nrepeat = 2,
dist = "centroids.dist", scale = FALSE)
list.keepX <- tune.TCGA$choice.keepX
list.keepX
$metablog
[1] 6 2
$asvclr
[1] 6 3
```

Finally, I also noticed that changing the nrepeat may also influence the results. Is there a standard/acceptable value to use for nrepeat?

Thank you so much for your help!!