I am not necesarily interested in tuning my sPLS-DA model. However, the way I understood it, [tuning] is a required step to identify my the most releavant features.
Tuning is definitely required if you want to draw meaningful conclusions from the model. If youâre just using sPLSDA for some quick exploration before you begin your real analysis, then tuning isnât as important.
I assume this indicates that we reached our optimal number of features?
Not necessarily. From here, if you want to truly optimise the model you need to start reducing resolution. To what degree you do this is dependent on how finely-tuned you want your model to be.
Does this also indicate that almost all of my 1740 features are relevant for comp 2 but only 700 for comp1?
Depends on your definition of relevant. This just means that using these numbers of features on your first and second component allows the model to best discriminate your classes. You have to decide a âthreshold of importanceâ so to speak.
So the differences between treatments are driven by many many chemical features rather than by a few individual metabolites?
Exactly. Some systems cannot be neatly explained by a number of factors that is desirable to us. Just because we want to explain a system with x number of variables doesnât mean that we can explain the system with x variables. Cellular biochemistry works using highly complex systems so its hardly surprising your model performs better when it has access to more of that information.
would the model now be ready for feature selection ⌠or would I now have to decrease the grid width and increase the resolution to narrow down the field?
I donât know what sort of error rates your model is achieving so I canât comment on if âits readyâ. I would personally narrow down your feature number once or twice and then move onto developing your final model using the resulting feature counts.
Will this approach provide me with the (up to) 20 most influential features for each compound?
It will attempt to do so. This will give you the 20 features which best discriminate your classes. However, given your previous models selected hundreds (or thousands) of features as optimal, I canât imagine considering only 20 features will result in a good model.
Moreover the Top20 most influential features selected via the grid seq(1, 20, 1) and seq(20, 3000, 40) are identical.
This is great in terms of validating your claim about these 20 âkeyâ features.
The top 17 of the samller grid and the larger grid however are rather differentent
This is likely a result of the way the sPLSDA algorithm functions. The first component is going to be very consistent across various runs of sPLSDA, but subsequent components are going to be much less stable.
would you recommend to only focus on the features of comp1 given that they are more reliably selected?
Not without actually applying stability analysis (something like the methodology I outlined in my first response). Your first component does seem to be performing most of the discrimination though.
Would that be a legitimate approach? Or would you stick to what I did above?
The alternative method would be easier, yes. However, the claims you could make from it would require many more assumptions and would be much less reliable. I would personally stick to the more comprehensive process - but this is up to you and how much time youâre willing to spend building you model.
To summarise somewhat:
I understand that it may be a bit frustrating that many of my answer âdepend onâ this and and are ânot necessarilyâ that. mixOmics
provides you with methods to explore your data in a non-deterministic manner. This means that there is no one way to go about it.
The desire to have a neat set of (eg. 20) features which are the most influential is a great goal. However, you need to consider the possibility that looking at only 20 features is not actually going to provide you the understanding that you want.
Everything youâre doing is great! However, any single one of these methods is NOT going to be the âbestâ - there is no best. You have to decide based on the amount of time you have, how accurate you want/need the model to be and how complex you can let the model become. Take stock of all these results youâve attained so far and assess from there.
As a general, methodological recommendation: make a model with 20 (or less) features per component and another model with however many the tune.*()
function suggests (eg. 1700+). Compare the results of these models and determine if the improvement in model performance provided by the hundreds of additional features is worth it. It will also show you how valid your results of the smaller model are in the face of a more holistic examination of the system