DIABLO data transformation and tuning

Hello Team,

I had a few questions regarding DIABLO analysis and would really appreciate if I could get some clarification.

  1. I started with normalizing my two datasets where: (i) microbiome data was transformed using centered log ratios and low counts were removed; and metabolomic data were log-transformed). Could you please confirm if this was done properly since the length of the microbiome data reduced to 299 from 3453 (before removing low counts)?
#Microbiome
data.offset <- asvfin1+1
sum(which(data.offset == 0))

low.count.removal = function(
  data, # OTU count data frame of size n (sample) x p (OTU)
  percent=0.01 # cutoff chosen
){
  keep.otu = which(colSums(data)*100/(sum(colSums(data))) > percent)
  data.filter = data[,keep.otu]
  return(list(data.filter = data.filter, keep.otu = keep.otu))
}

result.filter <- low.count.removal(data.offset, percent=0.01)
data.filter <- result.filter$data.filter

data.clr <- logratio.transfo(as.matrix(data.filter), logratio = 'CLR', offset = 0)

#Metabolome
metablog <- log(metabfin1)
  1. Next, I tried to calculate total number of components for the DIABLO model and the number of variables per dataset and per component using with the tune function in the DIABLO package. I am recommended 2 components so I assume this is okay as there is a decrease in error rate from component 1 to 2. I was wondering why I was only shown 2 components in my results whereas the results in the tcga study example provided 5 components? Is it to do with the outcome variable since my outcome is binary?
sgccda.res <- block.splsda(X = X, Y = Y, ncomp = 5, 
                          design = design)

set.seed(124)
perf.diablo <- perf(sgccda.res, validation = 'Mfold', folds = 10, nrepeat = 10)

plot(perf.diablo) 

plot(perf.diablo)

perf.diablo$choice.ncomp$WeightedVote

            max.dist centroids.dist mahalanobis.dist
Overall.ER         1              2                2
Overall.BER        1              2                1

ncomp <- perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]
  1. Here is the code I used to calculate the number of variables to keep from each dataset, and it suggested 6 variables from each dataset. I am confused as to why these variables change in numbers when I change the test.keepX. I know this may be answered before but I am having difficulty with understanding how to use the test.keepX…I randomly selected 1:72 but I don’t think I am doing this correctly.
test.keepX <- list (asvclr = c(1:72),
                    metablog = c(1:72))

tune.TCGA <- tune.block.splsda(X = X, Y = Y, ncomp = ncomp, 
                              test.keepX = test.keepX, design = design,
                              validation = 'Mfold', folds = 10, nrepeat = 2,
                              dist = "centroids.dist", scale = FALSE)

list.keepX <- tune.TCGA$choice.keepX
list.keepX

$metablog
[1] 6 2

$asvclr
[1] 6 3

  1. Finally, I also noticed that changing the nrepeat may also influence the results. Is there a standard/acceptable value to use for nrepeat?

Thank you so much for your help!!

G’day @DJT,

Sorry that this didn’t have a reply yet. Just working my way through the backlog now.

  1. Looking at the code, it seems that this has been done correctly. A steep decline in the number of OTUs is fairly normal. Microbiome data is notoriously sparse. Leaving all 3453 OTUs in would not only slow the whole process down, but would like result in spurious relations being reported as significant.

  2. Looking at the DIABLO TCGA Case Study, the number of components selected as optimal was actually 2. However, even if 5 was selected you must consider the differences between the data. Microbial and genomic/proteomic expression data are extremely different in their characteristics. A model produced on one type can’t really be equivocated to a model built on the other.

  3. If you change the test.keepX range, only values within that range will be tested. More importantly though, within the tune() functions (including tune.block.splsda()), random sampling is used at each repeat. Unless you stipulate a seed (via the set.seed() function, see here) each time you run a tune() function, the sampling of each fold will be different. This means that each model used as part of the tuning process is different to any times you have run that function before.

  4. When it comes to early exploration and simple model building, nrepeat = 5 or 10 is fine. However, once you are wanting to finalise your model parameters to yield your final model, the standard is around nrepeat = 100. At the end of the day, the more repeats, the better. It just depends on how many cores your computer has (and how much time you’re willing to run it for!), but 100 would be the minimum.

Hope this helped.

Cheers,
Max.