Generic questions about DIABLO: perf, keepX and no variable selection

Afternoon! Absolutely no apologies required. Took me quite a while to get my head around everything. I’ll answer all these questions in relation to DIABLO specifically to keep this focused.

  1. Within the vocabulary we use within the team, “component” and “dimension” are mostly synonymous. When you provide your data to the DIABLO model, it constructs “components” as part of its dimension reduction and variation extraction procedure. These components are just linear combinations of your input features (also referred to as variables). More specifically, if you have three input features: x1, x2 and x3, component 1 will equal to:
    a1*x1 + a2*x2 + a3*x3.
    a1, a2 and a3 are called loading values and together form a loading vector. You can construct as many components as there are input features (and each will have different associated loading vectors). However, the one of the main point of mixOmics is to reduce the number of features, so we’re always aiming to find the minimum possible while still accounting for as much variation with the data as possible. When you see “dimension”, you can think of this as the number of output components of the DIABLO model. The resulting dataframe will be of dimensions N x C, where N is the number of input samples and C is the number of components – hence the terminology dimension. The Glossary may be of some use for you.

  2. So, the perf() function uses repeated cross-validation. Cross validation involves taking a random section (or “fold”) of the data and setting it aside (referred to as “test” or “validation” data). All the remaining (“training”) data is then used to generate a model. The model is then applied onto the test data so it can make predictions as to what the novel samples’ class labels are. These predictions are compared to the ground truth of these samples, yielding the error rate. As this is repeated and the cross-validation selects samples randomly each time, the models produced and the samples they are tested on will be slightly different. Hence, you will never get the same output from the perf() function.
    This comes with one exception, when the set.seed() function is used (you’ll see this in the case studies on our site). By setting a “seed”, the same selections will be made for each model, meaning you can reproduce our results.

  3. This is a very context (and data) specific question. If you have thousands of features (lets say, 3000), you may want to try multiples of 10 or 20 up until a few hundred:
    seq(10, 400, 10) = [10, 20, 30 ... 380, 390, 400].
    If you have only a few hundred, or less than a hundred, features to start with, a smaller range (and increased “resolution”) is better:
    seq(10, 5, 50) = [10, 15, 20, ..., 45, 50]
    The range of values is ultimately dictated by how many input variables you have.
    Given the number of features you have, trying to significantly reduce the number of components when compared to variable count will be extremely important. I would advise you to read my comment on this post here as to some further advise when picking the keepX range and resolution.

  4. Yes, you aren’t required to produce a sparse model (ie, selecting only some features). However, not doing so means that there in an abundance of features present, making some visualisations (eg. the circosPlot()) illegible and somewhat useless. Given your data, I would definitely recommend exploring the sparse models at the very least.
    The DIABLO framework has some logical and mathematical similarity to the PLS methodology, sometimes resulting in similar plots.

I hope these answers cleared up your queries and feel free to post here with any more questions.

Cheers,

Max.