I am currently working on a project with 138 samples (2 classes, slightly unbalanced) and ~300k features. I have been following the sPLS-DA vignette and this case study to build my model, but I came across something that I think may be an issue - when running
perf(), like so:
plsdaPerf <- perf(plsdaRes,
validation = "Mfold",
folds = 3,
progressBar = FALSE,
nrepeat = 50)
6 is selected as the optimal number of components using
plsdaPerf$choice.ncomp, and the plot showing how the classification error rate with each number of components looks normal (i.e., similar to the case study, BER and other distance metrics start decreasing significantly after the third component). However, when I use
tune.splsda as follows
tune.splsda.srbct <- tune.splsda(x, y,
ncomp = 8,
validation = "Mfold",
folds = 3,
dist = "max.dist",
progressBar = FALSE,
measure = "BER",
test.keepX = keepXlist,
nrepeat = 50)
The optimal number of components is instead 1 according to
tune.splsda.srbct$choice.ncomp$ncomp. I understand that to build the final model, it would be more appropriate to choose 1 as the number of components, however, I wonder if I should be concerned that the optimal number of components is so different between the PLS tuning and the sPLS-DA tuning.
choice.ncomp parameter for both
perf() functions utilises a t-test to determine if the addition of another component reduces the BER to a statistically significant degree. However, these functions are not using the same models at each step. You provided a standard
perf(), meaning all features were used for each component. In comparison, you provided the object
tune.splsda(), meaning that the number of features to be used for each component is restricted when compared to
It’s likely that in this case a difference in component composition between the models created by
tune() (due to different sets of features being considered) is causing the mismatch between outputs. For instance, the decrease in BER between 1 and 2 components seen via
perf() is reflective of a model using all features and is hence statistically significant. By using a subset of these features via the
tune() function, the statistical significance of the BER change between 1 and 2 components is reduced. Hence,
tune() suggests leaving just 1 component whereas
perf(), with access to all features, suggests using 6 components.
As an added note, one may take from this that
perf() should be used in preference as it takes in “the most information” to make its decision. This is note the case. There are many downsides to using all (in your case, 300 000) features, such as bloated runtime, potential overfitting and a lack of interpretability. The general pipeline I always suggest is:
- Use the
perf() on an arbitrarily high number of components
- The output of this will indicate your maximum
ncomp to worry about
- Determine roughly how much you want to simplify your model (ie. how many features to consider per component)
- Run the
tune() function iteratively with the
perf(). From this, determine the exact number of features and components to yield the optimal model. I’d suggest following my advice on this post for determine the
- Use your final, optimised model
Hope this was of some help.
Thank you so much for your detailed response, that was very helpful. I see my mistake now in providing the standard
plsda() instead of the
perf(). I built a model arbitrarily choosing 10 components and using the
splsda() function instead; after running
perf() on that model, the optimal number of components (using max.dist and BER) is 4. When I run
tune() with ncomp = 4, it suggests I use 3 components, which I think makes sense.
I read through your post to better understand how to determine the
test.keepX values, but I am still a bit confused. After running
tune() using the following
test.keepX grid -
c(5:10, seq(10, 100, 10)), 3 components were selected as the optimal number by running
tune.splsda$choice.ncomp$ncomp. By running
tune.splsda$choice.keepX, I can see that the optimal number of variables per component is being chosen as 50, 100, and 80, respectively. I built a tentatively final model with these parameters, and tested it on my test set as suggested on the post that you sent. The model performed very well (89% sensitivity, which is a metric I am really interested in based on what I am trying to achieve with my data). When I plot the samples that I trained the model on (see below), I also see very good discrimination.
However, I think that 250 features may be excessive, particularly since my features are likely highly correlated (the data is from a DNA methylation array), but I am not sure how to determine which features to keep - my question is whether I should run
tune() again with a different
test.keepX() grid, or whether it makes sense to arbitrarily reduce the features - I have also checked the stability of the features being chosen in different CV folds by running
perf() on my final
splsda model with 3 components, and component 1 in particular seems to have the most stable variables across folds. In addition, I have checked the features being selected as predictive by the final model (of all 3 components), and 48/50 of the features from component 1 are biologically significant to one of my classes, whereas the biological significance of those from component 2 and 3 is unknown (to me).
Thank you so much,
Great to hear my first response was of some help. Also good to hear that selecting your
keepX values is a little less of a mystery now.
whether I should run
tune() again with a different
test.keepX() grid, or whether it makes sense to arbitrarily reduce the features
Your analysis of the stability of each feature would have been my first piece of advice (had you not already done it). Accompanying this, I’d suggest looking at the
plotLoadings() function but I believe this is what you did when you said “I have checked the features being selected as predictive by the final model”. If it wasn’t
plotLoadings() you were referring to, then definitely have a look at it. It will provide key insights into the importance of each feature to each component and how they relate to the categorical response.
Having completed those, here is my recommendation. Build the following models:
- Run a model on the most stable features (eg. >80% from the
- Run a model on the most contributing features to each component (eg. top 10 components from
plotLoadings() on each component)
tune() again but with a restricted
keepX grid. Take output and generate a model with this optimal
Before evaluating their sensitivity, I’d examine the features of each model. Have a look at what features the third model selected and if they are similar to either of the other two listed above. This will inform you on the efficacy of
tune() in selecting “powerful” features within your data. Also, if
keepX values at the maximum of your inputted grid, then its safe to assume that reducing feature count significantly will hinder the model.
After this exploration, run these models on the same test set as before. Evaluate the change in sensitivity (assuming a decrease). For each, ask yourself whether the decrease in classification ability is worth the reduction in model complexity and go from there.
Once again, hope this has helped a bit.
Once again thank you for your detailed response, being new to mixOmics this is incredibly helpful! Yes, the plotLoadings() is what I had previously looked at. It seems interesting that the loadings on component 1 are mostly associated to one class (if I am interpreting this correctly) whereas those on component 2 and 3 are associated to both classes.
I ran the three models that you suggested:
For the first model (on the most stable features), I used one component and 18 features (those that were Freq > 0.8 from the
perf() output, no features on either component 2 or 3 are > 0.8, which is also what made me think of reducing the features earlier on - on component 2, the most stable features are at around 0.5 and on component 3, 0.3)
The second model I ran on the top 10 features for each component (so, 3 components and 30 features in total)
tune() again with a keepX grid of seq(30,70,5), it suggested using one component and 70 features, and I built the third model using those parameters
The third model did select all 18 features from the first model among its 70 features, as well as the 10 features from the first component of the second model (which is redundant anyway) - however, there is no overlap with any features from the second or third components from the second model. Based on your previous response, my understanding anyway is that since the keepX values are being selected by
tune() at the maximum of the inputted grid, I shouldn’t necessarily reduce the feature count too much.
All three models had a slightly worse sensitivity than the first model that I initially built, but just by one or two misclassifications and I am working with a small sample size for my testing set, so the difference is negligible.
Based on this, would it make sense to try another iteration with a different keepX grid where the intervals between values are reduced, as you suggested in the post about deciding on keepX values that you linked earlier? Intuitively, I understand based on the loadings from the original model and looking at these three models that I should have more than one component (or at least that it wouldn’t hurt), but I still think that the features can probably be reduced based on looking at the stability.
Hello again my friend,
More than happy to help! Based on those loading plots, you have a pretty nice scenario on your hands. Having all selected features on the first component being highly discriminative means you can likely develop a very simple model without sacrificing classification performance. The lower stabilities on the second and third components support this. All this goes to show that one component is likely sufficient. However, we shouldn’t jump the gun. If we look at the sample plot above, the decision boundary forms as a combination of the first two components (it is diagonal).
If these simpler models are reducing your models performance “negligibly”, then I’d definitely recommend using one of these going forward.
To succinctly answer your question, yes. I’d recommend going through multiple iterations, increasing the resolution but decreasing the range at each step. This will keep runtime managable and yield an optimum for you to use.
Hi again Max,
Thank you! It’s nice to confirm that the first component seems to be the most discriminative. I tried to build two more models by increasing the resolution and decreasing the range of the grid.
With a grid of
seq(50,70,3), once again
tune() selected 3 components, each with 65, 65, and 68 features, respectively. This model, with 198 features compared to the 250 in the first model, had the exact same performance, with 0 misclassifications on the training data, as would be expected, and the exact same performance on the testing data as the first model.
tune() again with a grid of
seq(45, 65, 3) selected 1 component with 63 features. This model performed slightly worse, with 7 misclassifications on the training data, but only 2 more misclassifications on the testing data.
At this point, I ran
perf() on the three models and I am a bit confused by the performance plot - it seems to me that on the models with 3 components, the second component slightly worsens performance, while the third one improves it? I’m not sure that I understand why this is the case.
In addition, despite the third model having a slightly worse performance, since it does still seem like the features on the first component are most powerful to discriminate the classes, I converted the prediction distances from the third model (model 6 on the figure above) that I found from running
predict(model, x_test, dist=max.dist)$predict into class probabilities by scaling the vector of prediction distances for class 1 using a sigmoid function and then computing the vector of probabilities for class 2 as 1 - class 1. I found that the class probabilities of the misclassifications on the training data are less confident (i.e., mean probability of belonging to the wrong class was 0.51) compared to the instances that were correctly classified (mean probability of belonging to the correct class was 0.81). I plan to use class probabilities to apply my model to external data, and I was also wondering if this conversion from prediction distances to class probabilities makes sense - I read the supplementary of the mixOmics manuscript to understand the prediction distances, but could not find information about using class probabilities.
With this in mind, my question is whether I should be inclined to keep 3 components despite having a more complex model given the
perf() performance plots and the performance when predicting on testing data, or whether a simpler model would be more efficient given what I found regarding the class probabilities.
Thank you so much again for your help,
Sorry - I didn’t actually do a sigmoid transformation, what I did was scale the vector of prediction distances of class 1 by doing an unity-based normalization (x-min(x)/max(x)-min(x), and then created the vector of probabilities for class 2 by subtracting 1 - class 1!