Confused - choice.keepX turning all components instead of top choice?

I just have a quick question on understanding exactly what $choice.keepX and $choice.keepY mean. After running perf and then tune.spls (ncomp=10 and then a list of keepX variables), when I then run tune.spls$choice.keepX it returns all the components and how many features for each one.

I thought this function is supposed to return the optimal number of components to be used?
Or is it just telling me the optimal number of features to use with each comp?

For example this would be my output:

> tune.spls.cor$choice.kncomp
> tune.spls.cor$choice.keepX

comp1  comp2  comp3  comp4  comp5  comp6  comp7  comp8  comp9 comp10 
100     25     50    500     25     25     25     25    500    300

So, how do I know which comp to actually use?

Here is the figure that is produced:

Am I using the comp with the highest correlation value? I cannot find a clear description on what this figure is supposed to show me. From the figure I would think to use comp 1 with any of the keepX values because they’re all about cor=1?

I thought this function is supposed to return the optimal number of components to be used?
Or is it just telling me the optimal number of features to use with each comp?

So the tune() actually does both of these things! As you identified, the $choice.keepX object tells you the optimal number of features to use to construct each given component. The $choice.ncomp object (which you’ve typed as choice.kncomp - assuming just a typo) will tell you the optimal number of components to use for your model. You can also use the perf() function to determine the optimal number of components - but its usually easier to just use tune().

Based on your figure, it seems that a single component is optimal. This is due to the maximisation of the correlation on the first component. I would advise against using visual inspection to determine this though - make sure you use choice.ncomp. A t.test is used to determine the optimal component count so sometimes the figure can be misleading (but this is not the case with your example).

Am I using the comp with the highest correlation value? I cannot find a clear description on what this figure is supposed to show me. From the figure I would think to use comp 1 with any of the keepX values because they’re all about cor=1?

You’ve mostly got the right idea! When building models, we are engaging in a constant balancing act. In most scenarios, adding more components which each use more features will increase model accuracy. However, model simplicity (fewer components and feature) is optimal. So we’re trying to maximise model accuracy (measured here by correlation) while attempting to use the minimal number of features/components.

For example, on your first component, 100 features is selected even though using 300, 500, or 1000 increases the correlation value. This is because the aforementioned t.test determined that the addition of features beyond 100 improves accuracy negligibly while vastly increasing complexity. This is not optimal. Hence, 100 features strikes that balance between accuracy and complexity.

I hope this all helps with your understanding a little bit. Please reach out if not.

Thank you so much your explanations are extremely helpful. I understand this a lot better now. :slight_smile: