PLS-DA with season predictor

Hi,
I have previous posted this question and received some advice but I seem to not be able to perform the analysis. So, in short I am running a PLS-DA model to classify a binary trait (0/1) using a number of perdictors. While most of predictors are numeric, I am able to run the model normally, now I have one predictor with four levels of factors representing season (summer, autumn, winter, spring), I coded them as 1,2,3,4 respectively.
Previously, Kim Anh suggested me to use unmap(data$season) so I did that which resulted in 4 predictors then combine with the other predictors to run the model. Is this the right way? It seems to not improve the model accuracy compared to the multilevel approach as indicated below.
Multilevel option: I found on the website that we might be able to use multilevel option (http://mixomics.org/case-studies/multilevel-vac18/)
I followed this approach as shown in the code below:
design.train <- data.frame(sample = fert.train$Season)

Y<-fert.train[,1]
X<-fert.train[,2:556]
plsda.fert<-plsda(X,Y,ncomp = 10,scale = TRUE, multilevel = design.train)

External validation

Y<-fert.test[,1]
X<-fert.test[,2:555]
design.test<- data.frame(season = fert.test$Season)
predR<-predict(plsda.fert,X,ncomp=ccomp,multilevel=design.test)

The model ran well, but in some validation set, when there was some records not repeated, the model produced an error:

Error in FUN(X[[i]], …) :
A multilevel analysis can not be performed when at least one some sample is not repeated.

This makes me assume that I am using the wrowng codes as this is for repeated records. BUT it did show to improve preidction accuracy.

Can you please give me some comments about these 2 options?
Thanks,
Phuong

Hi Phuong,

The season variable, as you highlighted can be analysed in different way:

  • multilevel: here you ‘correct’ for the season effect. Is it what you are intending to do?
  • using an indicator matrix with unmap: here you want to include season as a predictor in your X data.

It is unclear at the moment what yo would like to do with the season variable, whether you think is should help to discriminate your outcome Y.
A multilevel approach is useful if (as shown in the example) you have a strong season effect on a PCA plot that interferes with the Y information (thus, a PCA multilevel should show that the season effect is reduced). Usually you dont need to use predict() unless you already have a test set. Instead you should consider directly the perf() function that performs cross-validation internally. Perhaps that will solve the error that you are facing (as we cross-validate whilst ensuring all repeated measures are represented inside the function).

Kim-Anh

Hi Kim Anh,

Thanks for you reply. Regarding your question on how I want to include seasonal effect into the model: normally I would say the second option of including season as a predictor in X data. In this case, we have a predictor variable called “season” with four levels (spring, summer, autumn, winter). However, the current version of PLS-DA in mixOmics cannot handle ‘class/factor’ predictor, that was why you advised me to use unmap() function previously. And yes, I used that and it worked fine but no improved prediction accuracy was observed, of course this could be true as season might not affect the Y.
However, when I ran the model and considered ‘season’ through multilevel option, I found in the first two validation sets the accuracy was improved. I could not give a conclusion because the model was crasshed when the validation set does not hav all four seasons present. If you can help me to run ‘multilevel’ in this case, then I would be able to say which option is better to be used.

For the perf () function: yes I am aware of this, but I am doing external validation so that’s why I used predict().

Cheers,
Phuong

Ok, just checking with you though that if you added multilevel = season, then you removed your season variable in the X data set to avoid overfitting!

The problem with multilevel is that it relies on repeated measurements for all samples.
Option 1: remove the test sample that is not repeated
Option 2: using the withinVariation() to extract the within variance data of your test set (if it runs, I am not sure!) then predict without the multilevel option (as it is included in withinVariation).

Kim-Anh