Projecting new samples onto pca.plsda space


Thank you for creating the mixomics package, it has been really helpful in multiomics analysis. I was wondering if there was a way of projecting new samples into an existing splsda space. Let me develop the question:

We have a sample processing methodology that is supposed to enrich for cancer derived CNV. Each patient that we analyze ends up being divided into four different samples and only one of them should be more “cancer like”. In adittion, I have CNV data (read counts) from both healthy patients and cancer patients as my positive and negative controls. I performed both a pca and a splsda on the controls to select for cancer specific differences. The cluster perfectly without the need of many features. Now I would like to project the patients that we processed into the splsda space that I created with my controls. What I’m expecting is that the more “cancer like” samples will cluster closer to the cancer controls compared to the other three. Is there a way to do something along these lines?

Thanks, sorry if the question is a bit confusing.


Hi @montoyam,

Assuming that you observe perfect clustering using PCA, why don’t you try to create a sPCA or PCA model with all your data together in the first place? If the enriched samples cluster with positive controls, and the non-enriched samples cluster with negative controls, i would not go any further, as this is by far the most superior way to demonstrate that your sample processing methodology works.

Otherwise see this example using the predict function:

X <- liver.toxicity$gene
Y <- as.factor(liver.toxicity$treatment[, 4])

samp <- sample(1:4, nrow(X), replace = TRUE)
test <- which(samp == 1)
train <- setdiff(1:nrow(X), test)

plsda.train <- plsda(X[train, ], Y[train], ncomp = 2)
plsda.predict <- predict(plsda.train, X[test, ], dist = "max.dist")

background <- background.predict(plsda.train, comp.predicted = 2)

plotIndiv(plsda.train, comp = 1:2, = "X-variate",style="graphics",ind.names=FALSE, background = background, pch = 1)
points(plsda.predict$variates[, 1], plsda.predict$variates[, 2], pch = 4, cex = 1) 
text(plsda.predict$variates[, 1], plsda.predict$variates[, 2], rownames(plsda.predict[["MajorityVote"]][["max.dist"]]), pos = 1, cex = 0.7)

In your case you would just skip the data splitting and use data from the four different samples as the test dataset instead. I am curious to see if this works, since the response variables in the test and train datasets are comparable, yet two different things.

  • Christopher
1 Like

Hi Christopher,

it worked nicely! Thank you so much for the help. I had to play around with the data since I was running into the “system is computationally singular” error but I imagined it was because I had multicollinearity due to having a large matrix with many similar values. After reducing my dataset to only differential variables, the whole thing worked.

Thanks for the help,