Dear mixOmics developers,
Many thanks for putting so much good work in an excellent R package, it really streamlined the visualisation and analysis of my mass spec data!
A lot of my work deals with finding good models for prediction of new, unknown samples, and after comparing many classification algorithms, PLS-DA is among the best algorithms (the best I found for my data so far is actually a kernel version of PLS-DA provided by the rchemo package, but it’s hard to interpret the kernel hyperparameter and the gain in accuracy isn’t very high).
But before classifying new data, I like to compare it to known data in an unguided, model-free (pca) way. Of course, one could simply do another PCA with the old and new data combined, but sometimes one does not want to create a new PCA space with completely unknown (and perhaps very different) samples?
Therefore, would it be meaningful to add a
project() function (or
predict method) for pca objects, which takes the pca object (which contains the rotation matrix) and new data as arguments, and returns a matrix of dimension [n_new, ncomp] with the scores of the new data?
Usually, it’s not hard to project new data into a PCA space defined by known data, but centering and scaling makes it a bit error-prone. Using
prcomp(), a projection of new data into an existing PCA space would look like this:
res.pca ← prcomp(Xtrain, scale. = TRUE, center = TRUE) # or scale. = F
res.new ← as.matrix(scale(Xnew, scale = res.pca$scale, center = res.pca$center)) %*% res.pca$rotation
res.new can be plotted and interpreted together, and completely model-free. I’m not sure whether this would also be as straightforward in the case of spca and ipca. In any case, the
rchemo package has a generic
transform() method, which projects new data using a given rotation matrix (fitted model) of various kinds (pca, pls, fda, …).
I noted that the centering and scaling vectors (or FALSE) are returned in
pca objects (
$scale), but not in
ipca objects (admittedly, centering is part of the algorithm 1 of ipca). Instead, there appears to be a copy of the scores matrix (
object$variates$X) for ipca and spca objects, but not for pca objects, which store it in
object$variates$X. Also, the rotation matrix is stored in
$loadings$X for pca and objects, but twice in
$loadings$X in spca and ipca objects. I found it a bit hard to find all the corresponding matrices in the various object classes.
Therefore, my suggestions for improvement would be:
- return the scaling and centering vectors (or
FALSE) also in spca and ipca (and sipca) objects
- check consistency of list elements of related object classes, especially whether it’s needed to have the scores matrix (
$variates$X) or rotation matrix (
$loadings$X) returned twice in spca and ipca objects, since these can be heavy if ncomp is large.
- (if meaningful) write a
predict,pca-methodthat projects new data into an existing projection space, using the appropriate centering and scaling, where applicable (pca, spca? ipca?, …).
Many thanks again for the package,
Simon Crameri, Ph.D.
Plant Ecological Genetics
Institute of Integrative Biology (IBZ)
Department of Environmental Systems Science
Universitätstrasse 16, CHN G27
8092 Zurich, Switzerland