Best criteria for sPLS-DA feature selection: VIP, weight coeff, stability?

Hello everybody,
I am currently analyzing a phytochemical data set using sPLS-DA.
My goal was to see if trees of treatment group A differ in their chemical profile from trees of treatment group B and if so, which chemical features may drive these differences. I have thousands of features. For my work I would however prefer to only focus on the top 10-30 features.

However, when it comes to feature selection using sPLS-DA I am a bit lost (I worked through the Mixomics SRBCT tutorial).
I tried out several approaches:

  1. extract the VIPs from my initial PLS-DA model (command: “vip( plsda_model)”). Only select e.g. features with the hightest VIP (>1). There is some literature on what VIP value is considered influential

  2. used the tuned sparsePLS-DA approach to identify the features with the highest weight coefficients (command: “selectVar(final_splsda_model, comp=opt_nr_comp)$value”).
    Question1: is there some literature how to interpret these weight coefficients: The higher the more influential I assume but is there a cut off? e.g. everyting +/- 0.1 is considered influential. I only want to focus on the top 10 most influental features.

  3. I determine the stability of the selected features above (see: 2)) to narrow down the field ie select only the features with the highest weight coefficients (see Question1) that are the most stable: Also here the question:
    Question2: is there some literature how to judge if a feature is stable or not? i.e. is a stability of 0.75 considered stable? ( the feature was selected 3 out of 4 times to be used for a component)

From my limited understanding I thing I would either go with 2 or 3. Yet, I somehowe would need to be able to justify my approache (ideally by referring to some previous work)

Thanks a lot for your advice

Michael

Question 1: There is most certainly literature about how to interpret these values - I am not aware of any relating to your specific context however. What will not be included in that literature is some global cut off for an “influential” loading value. By removing a few features from your dataset, the loading values of the model can be vastly different, hence the threshold of importance is entirely case dependent. There is no global threshold (eg +/- 0.1) as the range of loading values will be different for every use of the algorithm.

Additionally, the loading values are not comparable across different packages (eg. caret) and methods. For example, in mixOmics, we apply regularisation to the loading values as part of the sparse methodology. Most other packages do not apply this, but use other transformations and processing. So once again, it is impossible to provide some global cut off as different implementations of sPLS-DA.

If you want to focus on the top 10 most influential features, then extract the 10 features with the highest loading/stability/vip (or whatever metric you decide to use).

On that note, avoid using VIP. Loadings are likely going to be more appropriate to assess feature importance, particularly for sparse methods. Testing showed that VIP is prone to overestimation.


Question 2: Again, this is totally case dependent and there is no magic threshold for stability. You need to run your method lots of times to see the distribution of stabilties across your features and decide your threshold from there. This may be decided by selected top n features (eg top 10 most stable), or because you see a large drop off in stability (eg “elbow method”). Previous literature may inform this decision too. I can’t search for that literature for you - I don’t know the context of your study nor the nature of your data.

Additionally, you say you ran stability measures across 4 models. These results are likely to be quite unreliable. You should be running stability estimates over 50-100 models at least.


Sorry that my answer probably not what you wanted to hear. Unfortunately, there are no global truths with this sort of analysis. Based on the growing popularity of methods like sPLS-DA, there will be work out there which can guide your decision making - it’s just a matter of finding it. A methodology which I have found quite good for assessing feature importance is as such:

  • Randomly split data into training and testing sets
  • Generate a model using training samples
  • Extract loadings of each feature from this model and store them
  • Assess model performance on testing samples and extract an error measure
  • Repeat above steps ~100 times
  • Average the loadings values for each given feature across all models, potentially weighting each model by its predictive performance (lower error rate → greater weight)

This will result in a vector of values, one for each feature. This will combine measures of both stability and loadings, and can be scaled by each models’ performance if desired. From here, you can look at feature importance by integrating multiple metrics.

1 Like

Dear Max,
Thank you very much for your reply. It was very useful. However, it raised also some follow up questions:

First of all I was not very clear when it came to specifiying my model structure. Sorry for that.
For assessing my model perfomance I split my data set into 10 groups and the model perfomance was assessed 100 times:
perf(plsda.result, validation =“Mfold”, folds= 10,nrepeat=100)

One question I have concerns your suggestion to randomly split the data into training and test sets:
The way I understood it is that the chosen Mfold cross-validation approach is already taking care of this (i.e. 10 groups that will be assessed against each other 100 times). Is that assumption true?
In this case I would not have to specifically generate models only using the training sets.

Another question concerns the suggested loading-based feature selection:
For the feature selection I followed the example on the mixomics page:
http://mixomics.org/case-studies/splsda-srbct-case-study/
Under the header “selecting the number of variables”
It says that I have to set up a grid of possible keepX values that will be tested for each component
The example then suggests use the following grid:
“list.keepX ← c(1:10, seq(20, 300, 10))”. I am dealing with roughly 5000 features. I have no idea if the suggested grid makes sense for my data. Do you know how to assess this?

Base on your suggestion I decided to use the loading values to assess the importance of my features.
Following the example of the mixomics page I determined that n=2 is the optimal number of components for my model. To identify the most important features I used the command:
features<-selectVar(splsda.final_ash, comp=2)$value
What is not clear to me is: Do I only test only the optimal number of components (here comp=2) or I it recommended to also explore comp=1?

Finally: When creating a sample plot with confidence ellipsoids I saw that my X-variate2 explains more variance than X-variate1 (see attached image). Any suggestions why that could be? Is that normal?

Thanks a lot for your advice

Mike

Is that assumption true?

The reason why I stated the splitting of the data was for that methodology that I suggested specifically. If you are just going to use the tune() function, then you don’t need to do this splitting manually.

I am dealing with roughly 5000 features. I have no idea if the suggested grid makes sense for my data. Do you know how to assess this?

No that grid is used for that context. You need to determine an appropriate grid given your data and your goals with the method. Ensure when you tune these models, you use an adequately high number of repeats (ie. nrepeat = 100). As this may take a lot of time, I’d also suggest doing the tuning over multiple steps - increasing the resolution and decreasing the range of the grid at each iteration. Eg. Start with test.keepX = seq(10, 150, 10) and based on the output (lets say it selects 50), then undergo tuning again with test.keepX = seq(30, 70, 5); rinse and repeat. In your case, with 5000, these grid will look quite different - but the process is the same - increase resolution and decrease range of the grid in iterative tuning steps.

Do I only test only the optimal number of components (here comp=2) or I it recommended to also explore comp=1?

You’ll definitely want to explore both components 1 and 2

Any suggestions why that could be? Is that normal?

This is totally normal. A brief explanation: your components are generated to discriminate your classes. Hence, the first component does this quite well by itself, but only requires a small portion of your data’s variance to do so. The second is then generated with more of a focus on captured variance as the first component separates the classes so well.

1 Like

Hi Max,
Thank you very much for your time and your explanations.
I am very sorry I have to reach out to you once more.

My main goal is: to see if the black and green samples differ in their metabolomic profile. Apparently they do (see uploaded figure in previous post). My second goal is to identify among my 2500ish chemical features (initially 5000 but many were filitered out) the 20 most influential features (top 10 for compound 1, top 10 for compound 2) that are driving the seperation between the green and black samples.

I am not necesarily interested in tuning my sPLS-DA model. However, the way I understood it ( following the SRBCT example on the mixomics page) it is a required step to identify my the most releavant features.Or not?

If it is necessary: Could you give me some more pointers how to set up my starting grid structure correclty? I am sorry , I know this is an unusual question but I could not find any information regarding this anywhere on the mixomics page. There are plenty of examples but to my knowledge none describes the logic behind setting up the starting points of the grid structure. In the FAQ section there is an example in which they start with seq(20, 300, 10).
I played around with the grid structure and it seems to be highly relevant. e.g. when I used the grid structure: seq(20, 300, 10) I ended up with 300 features being chosen for compound 1 and 290 being chosen for compound2. I assumed this meant that I had to use a wider grid so I basically did what you suggested above just the other way around, I steadily increased the grid space and decreased the resolution.When I ran a grid with (seq (10, 600, 20) I ended up with 380 featuer selected for on comp1 and 600 on comp 2…I further increased the grid dimensions to (seq(20, 3000, 40)) and ended up with 700 features selected for comp1 and 1740 selected for comp 2. Finally, for the first time the selected features for comp2 are not equal to the grid max (i.e. 1740 < 3000). I assume this indicates that we reached our optimal number of features? Does this also indicate that almost all of my 1740 features are relevant for comp 2 but only 700 for comp1? So the differences between treatments are driven by many many chemical features rather than by a few individual metabolites? would the model now be ready for feature selecetion (i.e. the top 10 features of comp 1 all have a weight coefficients around 0.10, the top 10 features of comp2 have weight coefficients of around 0.06). Or would I now have to decrease the grid width and increase the resolution to narrow down the field?
Moreover in the FAQ section it says:

This iterative process can be avoided depending on the research question. If the aim is to identify a small set of predictors, then focus the grid around lower values (eg. seq(1, 20, 1)).

This is exactly what I want. Will this approach provide me with the (up to) 20 most influential features for each compound?
If I use this grid structure, I end up with 20features for comp1 and 17 features for comp2.
Moreover the Top20 most influential features selected via the grid seq(1, 20, 1) and seq(20, 3000, 40) are identical. The top 17 of the samller grid and the larger grid however are rather differentent (some compounds are identical but many are not)
Hence for identifying the most relevant features would you recommend to only focus on the features of comp1 given that they are more reliably selected?

Alternatively: given my main goal (see above) could I not just use my initial pls-da model (the one that produced the plot) to identify the most relevant feautes (i.e. select the top 20ish features of each component)?. In that case the weight coefficients (or VIPs if more legtitimate in this situation) of each feature are rather low (highes values of comp1 around o.o5).Would that be a legitimate approach? Or would you stick to what I did above?

Sorry for all the questions.
Whith much appreciation!
Michael

I am not necesarily interested in tuning my sPLS-DA model. However, the way I understood it, [tuning] is a required step to identify my the most releavant features.

Tuning is definitely required if you want to draw meaningful conclusions from the model. If you’re just using sPLSDA for some quick exploration before you begin your real analysis, then tuning isn’t as important.

I assume this indicates that we reached our optimal number of features?

Not necessarily. From here, if you want to truly optimise the model you need to start reducing resolution. To what degree you do this is dependent on how finely-tuned you want your model to be.

Does this also indicate that almost all of my 1740 features are relevant for comp 2 but only 700 for comp1?

Depends on your definition of relevant. This just means that using these numbers of features on your first and second component allows the model to best discriminate your classes. You have to decide a “threshold of importance” so to speak.

So the differences between treatments are driven by many many chemical features rather than by a few individual metabolites?

Exactly. Some systems cannot be neatly explained by a number of factors that is desirable to us. Just because we want to explain a system with x number of variables doesn’t mean that we can explain the system with x variables. Cellular biochemistry works using highly complex systems so its hardly surprising your model performs better when it has access to more of that information.

would the model now be ready for feature selection … or would I now have to decrease the grid width and increase the resolution to narrow down the field?

I don’t know what sort of error rates your model is achieving so I can’t comment on if “its ready”. I would personally narrow down your feature number once or twice and then move onto developing your final model using the resulting feature counts.

Will this approach provide me with the (up to) 20 most influential features for each compound?

It will attempt to do so. This will give you the 20 features which best discriminate your classes. However, given your previous models selected hundreds (or thousands) of features as optimal, I can’t imagine considering only 20 features will result in a good model.

Moreover the Top20 most influential features selected via the grid seq(1, 20, 1) and seq(20, 3000, 40) are identical.

This is great in terms of validating your claim about these 20 “key” features.

The top 17 of the samller grid and the larger grid however are rather differentent

This is likely a result of the way the sPLSDA algorithm functions. The first component is going to be very consistent across various runs of sPLSDA, but subsequent components are going to be much less stable.

would you recommend to only focus on the features of comp1 given that they are more reliably selected?

Not without actually applying stability analysis (something like the methodology I outlined in my first response). Your first component does seem to be performing most of the discrimination though.

Would that be a legitimate approach? Or would you stick to what I did above?

The alternative method would be easier, yes. However, the claims you could make from it would require many more assumptions and would be much less reliable. I would personally stick to the more comprehensive process - but this is up to you and how much time you’re willing to spend building you model.

To summarise somewhat:

I understand that it may be a bit frustrating that many of my answer “depend on” this and and are “not necessarily” that. mixOmics provides you with methods to explore your data in a non-deterministic manner. This means that there is no one way to go about it.

The desire to have a neat set of (eg. 20) features which are the most influential is a great goal. However, you need to consider the possibility that looking at only 20 features is not actually going to provide you the understanding that you want.

Everything you’re doing is great! However, any single one of these methods is NOT going to be the “best” - there is no best. You have to decide based on the amount of time you have, how accurate you want/need the model to be and how complex you can let the model become. Take stock of all these results you’ve attained so far and assess from there.

As a general, methodological recommendation: make a model with 20 (or less) features per component and another model with however many the tune.*() function suggests (eg. 1700+). Compare the results of these models and determine if the improvement in model performance provided by the hundreds of additional features is worth it. It will also show you how valid your results of the smaller model are in the face of a more holistic examination of the system

1 Like

Hi Max,
Thank you so much for your help. This was super helpful!
You provided excellent advice that greatly helped me to get a much better understanding of how to tackle and interpret a sPLS-DA (sofar I only used completely different statistical approaches that are much less “explorative”)

Finally, I have two last fundamental beginner questions that I did not find an answer for:

  1. I created my sample plot (see above: green / black dots and ellipsoids) follwoing the SRBCT case study. i.e. I used my entire data set, I determined the optimal number of components and then created the sampel plot using the “splsda” and the “plotIndiv” function. I found that the 95% ellipsoids are not overlapping concluding that my two treatments (black vs. green) seem to have different overall chemical fingerprints. However: when subjecting the same data to a PCA I did not find a clear seperation between the two treatments.
    Is it still correct to conclude that the chemical fingerprints between the treatments differ? Since PLS-DA is a supervised method it may be much “easier” for this algorithm to find differences (or is PLS-DA just a way advanced algorithm than PCA?). Or do I have to back the conclusions of my sample plot somehow up (e.g. providing error rates?). Also since I only have roughly 20 data points per treatment I do not see the option to use a “test set” for generating such plot (also: there will always be the “splsda” function be involved)

  2. Are the components1 and 2 of a PCA and PLS-DA both showing the % of explained variance? Are they comparable? i.e. oftentimes I feel that the (principal) component values of a PCA are larger than the component values of a PLS-DA. Nevertheless leads the PLS-DA approach to a much better seperation.

That’s it I promise!

Best
Michael

Is it still correct to conclude that the chemical fingerprints between the treatments differ?

mixOmics is not running any sort of statistical test here - meaning you cannot draw concrete conclusions. While that figure provides great evidence for that claim, it does not “prove” or “conclude” this.

Since PLS-DA is a supervised method it may be much “easier” for this algorithm to find differences

It’s not that its easier, but that sPLSDA is an algorithm which focuses on discrimination. PCA has no “concern” for class labels.

or is PLS-DA just a way advanced algorithm than PCA?

While there are similarities between them, they are very different algorithms

Or do I have to back the conclusions of my sample plot somehow up (e.g. providing error rates?)

You should certainly provide more than 1 figure to back your rhetoric. Examining error rates is a great place to start.

Also since I only have roughly 20 data points per treatment I do not see the option to use a “test set” for generating such plot

20 samples per group is much more than I see most people with access to on this forum. You can certainly use a portion of these as a testing set. A 3:1 split between training:testing should be perfect.

Are the components1 and 2 of a PCA and PLS-DA both showing the % of explained variance? Are they comparable?

Yes they are directly comparable. It is just a measure of the variability in the component, scaled by the total variability in the data. Remember though, sPLS-DA is not constructing components to maximise captured variance

oftentimes I feel that the (principal) component values of a PCA are larger than the component values

Yes, that’s because PCA seeks to maximise captured variance while sPLSDA does not - it maximises class discrimination

1 Like

Thanks again. This is very useful!
Best wishes,
Michael