Hi!
I have a dataset with 3 different timepoints. It’s about 24 Plants from 9 different locations. But in each timepoints about 14-15 plants were taken to measure their metabolom. The data is unbalanced, so I have Plant1_T1_Loc1 but it I don’t necessarily have the same plant at timepoint T2 because it died and another plant nearby from those 24 was taken instead.
I made a multilevel PCA which shows that the timepoints show differences. The chosen locations seem not play influence.
I then also tried to confirm that by a sPLS-DA with that code, and the locations do not cluster at all.
X ← tlog #log-2 transformed normalized data
Y ← as.factor(location)
summary(Y)
This will be a tricky dataset to draw meaningful conclusions from. Without consistency of samples across time nor space, there are going to be a lot of spurious relationships in your data. Due to this, it’s unsurprising the locations didn’t cluster.
Additionally, within your call to splsda(), you use sampleID as the multilevel parameter. If you’re wanting to control for the time measurements or location, you need to pass that information to multilevel. I’d recommend exploring the withinVariation() function too as it gives you a bit more control
My question is now if I can use the timepoints as group and additionally multilevel in sPLS-DA?
I’m not sure I fully understand the question as I don’t know why you would do this. If you use the timepoints as your multilevel parameter, the algorithm will attempt to “remove” the between-timepoint variation in your dataset. Then, if you pass the same timepoint vector as your Y in splsda(), it will attempt to generate a model which best discriminates between the timepoints. However, you would have removed the variation between timepoints, meaning splsda() is unlikely to perform well at all.