Project method for (si)pca and consistent element names for pca and (si)pca

Dear mixOmics developers,

Many thanks for putting so much good work in an excellent R package, it really streamlined the visualisation and analysis of my mass spec data! :slight_smile:

A lot of my work deals with finding good models for prediction of new, unknown samples, and after comparing many classification algorithms, PLS-DA is among the best algorithms (the best I found for my data so far is actually a kernel version of PLS-DA provided by the rchemo package, but itā€™s hard to interpret the kernel hyperparameter and the gain in accuracy isnā€™t very high).

But before classifying new data, I like to compare it to known data in an unguided, model-free (pca) way. Of course, one could simply do another PCA with the old and new data combined, but sometimes one does not want to create a new PCA space with completely unknown (and perhaps very different) samples?

Therefore, would it be meaningful to add a project() function (or predict method) for pca objects, which takes the pca object (which contains the rotation matrix) and new data as arguments, and returns a matrix of dimension [n_new, ncomp] with the scores of the new data?

Usually, itā€™s not hard to project new data into a PCA space defined by known data, but centering and scaling makes it a bit error-prone. Using prcomp(), a projection of new data into an existing PCA space would look like this:

res.pca ā† prcomp(Xtrain, scale. = TRUE, center = TRUE) # or scale. = F
res.new ā† as.matrix(scale(Xnew, scale = res.pca$scale, center = res.pca$center)) %*% res.pca$rotation

Now, res.pca$x and res.new can be plotted and interpreted together, and completely model-free. Iā€™m not sure whether this would also be as straightforward in the case of spca and ipca. In any case, the rchemo package has a generic transform() method, which projects new data using a given rotation matrix (fitted model) of various kinds (pca, pls, fda, ā€¦).

I noted that the centering and scaling vectors (or FALSE) are returned in pca objects ($center and $scale), but not in spca or ipca objects (admittedly, centering is part of the algorithm 1 of ipca). Instead, there appears to be a copy of the scores matrix (object$x and object$variates$X) for ipca and spca objects, but not for pca objects, which store it in object$variates$X. Also, the rotation matrix is stored in $loadings$X for pca and objects, but twice in $rotation and $loadings$X in spca and ipca objects. I found it a bit hard to find all the corresponding matrices in the various object classes.

Therefore, my suggestions for improvement would be:

  • return the scaling and centering vectors (or FALSE) also in spca and ipca (and sipca) objects
  • check consistency of list elements of related object classes, especially whether itā€™s needed to have the scores matrix ($x and $variates$X) or rotation matrix ($rotation and $loadings$X) returned twice in spca and ipca objects, since these can be heavy if ncomp is large.
  • (if meaningful) write a project.pca() or predict,pca-method that projects new data into an existing projection space, using the appropriate centering and scaling, where applicable (pca, spca? ipca?, ā€¦).

Many thanks again for the package,
Simon


Simon Crameri, Ph.D.

ETH Zurich
Plant Ecological Genetics
Institute of Integrative Biology (IBZ)
Department of Environmental Systems Science
UniversitƤtstrasse 16, CHN G27
8092 Zurich, Switzerland

Hi @Simone,

Thank you very much for your detailed post and for sharing your suggestions and thoughts. Weā€™ve been gradually changing some of these functions and have had to prioritise given the limited resources we have, Weā€™ll certainly take these onboard in future developments.

I opened two new issues to ensure we have these on our radar (feature request: projection functions to project new data into exisiting pca models Ā· Issue #154 Ā· mixOmicsTeam/mixOmics Ā· GitHub and pca family to have more consistent output values Ā· Issue #153 Ā· mixOmicsTeam/mixOmics Ā· GitHub)

Thanks again

Al

Hi @aljabadi Great, and many thanks for creating the new issues on github. Given that resources are limited, let me know if youā€™d like me to propose e.g. a predict.pca() function (accessible via the generic predict function) that takes a pca-object as an input and outputs the scores matrix. Regarding ipca, I would first need to know how it works in detail to be of any help.

Hi @scrameri,

My pleasure. Actually, that would be most welcome! We always welcome contributions of any form from the amazing community. Be it development or updating the documentation.