Generic questions about DIABLO: perf, keepX and no variable selection

I am hoping to use the mixOmics package developed by you and your team. I am trying to learn how to implement mixOmics, specifically the DIABLO method, for trial datasets my lab has produced (~5000 proteins, ~800 metabolites).I do have a few questions which I hope will not be much of a bother to you. Apologies in advance for any banal questions, I am still quite new to bioinformatic analysis.

  1. May I please ask for clarification on what is meant by a ‘component’ and what is a ‘dimension’ according to your bookdown page.
  2. I have tried to run perf.diablo multiple times with the same dataset but noticed that the plot and the optimal number of components changes sometimes. Am I right in assuming that this may be because of the error rate during each run?
  3. During the keepX step where we are supposed to determine the optimal number of components for the final DIABLO model, how do we decide the length of the keepX grid or what numbers to include?
  4. Is there a way to run DIABLO on the entire dataset without selecting for specific variables? I have tried the block.plsda method but was unable to produce a circos plot and the circle correlation plot appeared similar to the one I obtained from the initial PLS analysis.

Thank you again for your help!

Kind regards

Afternoon! Absolutely no apologies required. Took me quite a while to get my head around everything. I’ll answer all these questions in relation to DIABLO specifically to keep this focused.

  1. Within the vocabulary we use within the team, “component” and “dimension” are mostly synonymous. When you provide your data to the DIABLO model, it constructs “components” as part of its dimension reduction and variation extraction procedure. These components are just linear combinations of your input features (also referred to as variables). More specifically, if you have three input features: x1, x2 and x3, component 1 will equal to:
    a1*x1 + a2*x2 + a3*x3.
    a1, a2 and a3 are called loading values and together form a loading vector. You can construct as many components as there are input features (and each will have different associated loading vectors). However, the one of the main point of mixOmics is to reduce the number of features, so we’re always aiming to find the minimum possible while still accounting for as much variation with the data as possible. When you see “dimension”, you can think of this as the number of output components of the DIABLO model. The resulting dataframe will be of dimensions N x C, where N is the number of input samples and C is the number of components – hence the terminology dimension. The Glossary may be of some use for you.

  2. So, the perf() function uses repeated cross-validation. Cross validation involves taking a random section (or “fold”) of the data and setting it aside (referred to as “test” or “validation” data). All the remaining (“training”) data is then used to generate a model. The model is then applied onto the test data so it can make predictions as to what the novel samples’ class labels are. These predictions are compared to the ground truth of these samples, yielding the error rate. As this is repeated and the cross-validation selects samples randomly each time, the models produced and the samples they are tested on will be slightly different. Hence, you will never get the same output from the perf() function.
    This comes with one exception, when the set.seed() function is used (you’ll see this in the case studies on our site). By setting a “seed”, the same selections will be made for each model, meaning you can reproduce our results.

  3. This is a very context (and data) specific question. If you have thousands of features (lets say, 3000), you may want to try multiples of 10 or 20 up until a few hundred:
    seq(10, 400, 10) = [10, 20, 30 ... 380, 390, 400].
    If you have only a few hundred, or less than a hundred, features to start with, a smaller range (and increased “resolution”) is better:
    seq(10, 5, 50) = [10, 15, 20, ..., 45, 50]
    The range of values is ultimately dictated by how many input variables you have.
    Given the number of features you have, trying to significantly reduce the number of components when compared to variable count will be extremely important. I would advise you to read my comment on this post here as to some further advise when picking the keepX range and resolution.

  4. Yes, you aren’t required to produce a sparse model (ie, selecting only some features). However, not doing so means that there in an abundance of features present, making some visualisations (eg. the circosPlot()) illegible and somewhat useless. Given your data, I would definitely recommend exploring the sparse models at the very least.
    The DIABLO framework has some logical and mathematical similarity to the PLS methodology, sometimes resulting in similar plots.

I hope these answers cleared up your queries and feel free to post here with any more questions.

Cheers,

Max.

Max, good morning. Shouldn’t it be possible to display a circosPlot of a block.plsda by setting a high cutoff? That way maybe it is viewable. I tried it and the R package doesn’t allow me to do it. It is strange because in the book it is clear that it does. Greetings and thanks!

circosPlot does not currently support the usage of block.plsda objects. Just use block.splsda but don’t supply keepX at all, meaning all features will be used - essentially mimicking block.plsda.

library(mixOmics)

data(breast.TCGA)
X = list(miRNA = breast.TCGA$data.train$mirna,
         mRNA = breast.TCGA$data.train$mrna,
         proteomics = breast.TCGA$data.train$protein)
Y = breast.TCGA$data.train$subtype

obj <- block.splsda(X, Y)

circosPlot(obj, cutoff=0.7)

Created on 2022-06-15 by the reprex package (v2.0.1)

1 Like