Multilevel PCA interpretation - biais?

Dear mixOmics team,

I have been using mixOmics package for my research and have recently started working with longitudinal samples, where I have 2 time points sampling on repeated individuals distributed in 2 groups of indiviuals.
I am working with metabolomic datas. Due to the large amount of metabolites and oa small number of individuals (non-balanced), I am well awared of the potential biais in the models I am using.

To examine how individuals are displayed in regards to groups and time, using all metabolites, in an unsupervised way, I need to perform a PCA.
However, in order to focus on the effect of treatment on my groups and reduce invididual variation, I am using a multilevel PCA, where I have used the individual number as a multilevel agrument.

I am having difficulty interpreting the results and would appreciate your expert advice.

-When I performed a PCA without a multilevel argument, the groups seemed to overlap, with one group showing changes after treatment.

Image1

-However, when I used a PCA with multilevel argument the four groups were distinguished and there was a greated separation for the red goup. The blue group, which showed less obvious changes in the PCA without the multilevel argument, showed changes in the multilevel PCA. However the multilevel PCA suggests that metabolome of the groups were different at baseline and that the difference between the two groups at baseline and after treatment is conserved.
Image2

My concern is that I am introducing some form of supervision of my groups by using the multilevel argument as I did. Shoult I trust my model and interpret the multilevel PCA? Or is there a better way to analyze my data to visualize is the metabolome is modified after treatment, and in a different way dependinf on the treatment ?

I apologize for the level of my explanation, and thank you in advance for any input you can provide on this,

Best regards,
Maëlle.

hi @mbonhomme,

No, I dont think there is any bias in the analysis here, if anything it’s a good example that multilevel is performing well for your case :slight_smile:

If you look at the paper from Liquet et al., you will see that the samples are normalised individually to remove the individual effect, so we are not trying implicitly to regroup the samples according to the time points. The overlap you saw on a classic PCA shows that the individual variation is interfering with the time variation you were hoping to see.

At this stage PCA is unsupervised, so I would say this is good news. When you move to a sPLS-DA multilevel for variable selection, you will have access to performance measures (perf()) using cross-validation to evaluate the overfitting of your approach, if any.

Kim-Anh

Hi Kim-Anh,
thank you for your clear answer and informations given; appreciated ! :slight_smile:

My sPLS-DA is unfortunatly not performed on R but on metaboanalyst, and I get the following performance parameters ( Q2, Accuracy,R2), can I evaluate the overfitting of my approach with these parameters?

Best regards,
Maëlle

hi @mbonhomme,

You would need to perform some cross-validation ‘manually’ with metaboanalyst, by separating into training and test sets, and then recording the error rate on the prediction … Our mixOmics R function perf() has been quite complicated to code so I would not think it would be easy to do. Or you could just move to mixOmics for this assessment, noting that there would be some (slight or not slight) differences between the approaches. (For example we center and scale the data sets on the variables).

Kim-Anh