Spls / keepx / keep specific variables

Hello,

Thanks to have created this tool.

I’m using the “spls” function to select variables but I want to keep specific variables in my model (ie. dont have them with a coefficient of 0). Is it possible to achieve that please?

Have a nice day,
Jérémy Tournayre

I was working on an implementation for this exact feature some weeks ago. Unfortunately, the conclusion I came to was that the most mathematically valid way to achieve this is the simplest. Hence, here’s what I’d do:

  • Use tune.spls to determine the optimal number of features.
  • Take these values and pass them to spls to give you a list of the optimal features for your analysis.
  • Extract these features (by name or index) along with your desired features to keep from your input dataframe.
  • Run pls (the NON-sparse variant) using just this subset dataframe which includes your desired features and your selection of the optimal features selected by the previous model.

I hope this makes sense. If not, let me know and I can go into a bit more depth for you.

Hello,

Thanks!
If I understand correctly the specific features will be added artificially on the optimal features. So, the model will not be “sparse”, no?

To be very optimal, I think if the specific features can be declared in the spls (so the sooner as possible) some of the optimal features can be removed because the specific features replace them.

Maybe I misunderstood something?

Have a nice day,
Jérémy Tournayre

If I understand correctly the specific features will be added artificially on the optimal features. So, the model will not be “sparse”, no?

In the context of mixOmics, “sparse” just refers to the methods which select a subset of features to use - rather than using all input features. By taking the features selected by the spls() in addition to your specific features, its still “sparse” as only a subset of input features are being utilised.

To be very optimal, I think if the specific features can be declared in the spls (so the sooner as possible) some of the optimal features can be removed because the specific features replace them.

I believed the same when attempting this implementation. However, this is entirely inappropriate. The short version is that by artificially setting aside some features, the mathematical consistency of the function is lost. The long version you can find by reading my responses here and here.

A potential ammendent I’d make to my previous comment in this thread: I stated in my last dot point to use the NON-sparse method on the selected and optimal features. Using the sparse method (eg spls()) might be better here to address your concern. By using the sparse method the specific features can “replace” the optimal features if this results in a better model.

Let me know if this all makes sense