DIABLO data transformation and tuning

Hello Team,

I had a few questions regarding DIABLO analysis and would really appreciate if I could get some clarification.

  1. I started with normalizing my two datasets where: (i) microbiome data was transformed using centered log ratios and low counts were removed; and metabolomic data were log-transformed). Could you please confirm if this was done properly since the length of the microbiome data reduced to 299 from 3453 (before removing low counts)?
data.offset <- asvfin1+1
sum(which(data.offset == 0))

low.count.removal = function(
  data, # OTU count data frame of size n (sample) x p (OTU)
  percent=0.01 # cutoff chosen
  keep.otu = which(colSums(data)*100/(sum(colSums(data))) > percent)
  data.filter = data[,keep.otu]
  return(list(data.filter = data.filter, keep.otu = keep.otu))

result.filter <- low.count.removal(data.offset, percent=0.01)
data.filter <- result.filter$data.filter

data.clr <- logratio.transfo(as.matrix(data.filter), logratio = 'CLR', offset = 0)

metablog <- log(metabfin1)
  1. Next, I tried to calculate total number of components for the DIABLO model and the number of variables per dataset and per component using with the tune function in the DIABLO package. I am recommended 2 components so I assume this is okay as there is a decrease in error rate from component 1 to 2. I was wondering why I was only shown 2 components in my results whereas the results in the tcga study example provided 5 components? Is it to do with the outcome variable since my outcome is binary?
sgccda.res <- block.splsda(X = X, Y = Y, ncomp = 5, 
                          design = design)

perf.diablo <- perf(sgccda.res, validation = 'Mfold', folds = 10, nrepeat = 10)




            max.dist centroids.dist mahalanobis.dist
Overall.ER         1              2                2
Overall.BER        1              2                1

ncomp <- perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]

Here is the code I used to calculate the number of variables to keep from each dataset, and it suggested 6 variables from each dataset. I am confused as to why these variables change in numbers when I change the test.keepX. I know this may be answered before but I am having difficulty with understanding how to use the test.keepX…I randomly selected 1:72 but I don’t think I am doing this correctly.

test.keepX <- list (asvclr = c(1:72),
                    metablog = c(1:72))

tune.TCGA <- tune.block.splsda(X = X, Y = Y, ncomp = ncomp, 
                              test.keepX = test.keepX, design = design,
                              validation = 'Mfold', folds = 10, nrepeat = 2,
                              dist = "centroids.dist", scale = FALSE)

list.keepX <- tune.TCGA$choice.keepX

[1] 6 2

[1] 6 3

Finally, I also noticed that changing the nrepeat may also influence the results. Is there a standard/acceptable value to use for nrepeat?

Thank you so much for your help!!