Interpretation of the perf plot

Hi!

I was wondering if you can help me gain a better intuition for the plot output of the perf function, such as the plot below:

The documentation of the plot.perf function says:

Function to plot classification performance for supervised methods, as a function of the number of components.

From this I understand that with each step on the x axis an additional component was used in the classification model, i.e. an additional component was considered for the distance calculation based on which the final prediction was made.

I would have assumed that the more components - the better the error rate, or at least I wouldn’t expect the error rate to become worst, as seen here when moving from 2 components to 3/4/5.

So I first wanted to make sure I understand correctly and that these errors are not per component on its own, but rather per a model based on all components up to the stated component number. Secondly, how would you interpret this worsening between a 2-component model to a 3-component one?

Many many thanks in advance,
Efrat

errors are not per component on its own, but rather per a model based on all components up to the stated component number

Your understanding is correct! For the third position on the x-axis, a model with three components was used, not just a model with the third component.

how would you interpret this worsening between a 2-component model to a 3-component one?

While it does seem counterintuitive, this is more common that you may think. One of the major considerations we have to keep in mind when modelling is preventing overfitting. For some sets of data, adding more components results in very specific information being learnt by the model. Then, when it faces more novel samples, that information leads it down the wrong path as it isn’t generalisable. In this scenario, the third component is likely causing this overfitting, hence decreasing model accuracy.

A related explanation could be that of redundancy. Some datasets only have enough variance (or information) in them for a certain number of components. Attempting to construct components beyond that which the data can yield means the model has to almost invent information which almost isn’t there. Using this invented component then decreases model accuracy as it is not properly reflective of the data

Thank you Max for the clarification!