Dear mixOmics developers,
Many thanks for putting so much good work in an excellent R package, it really streamlined the visualisation and analysis of my mass spec data!
A lot of my work deals with finding good models for prediction of new, unknown samples, and after comparing many classification algorithms, PLS-DA is among the best algorithms (the best I found for my data so far is actually a kernel version of PLS-DA provided by the rchemo package, but itās hard to interpret the kernel hyperparameter and the gain in accuracy isnāt very high).
But before classifying new data, I like to compare it to known data in an unguided, model-free (pca) way. Of course, one could simply do another PCA with the old and new data combined, but sometimes one does not want to create a new PCA space with completely unknown (and perhaps very different) samples?
Therefore, would it be meaningful to add a project()
function (or predict
method) for pca objects, which takes the pca object (which contains the rotation matrix) and new data as arguments, and returns a matrix of dimension [n_new, ncomp] with the scores of the new data?
Usually, itās not hard to project new data into a PCA space defined by known data, but centering and scaling makes it a bit error-prone. Using prcomp()
, a projection of new data into an existing PCA space would look like this:
res.pca ā prcomp(Xtrain, scale. = TRUE, center = TRUE) # or scale. = F
res.new ā as.matrix(scale(Xnew, scale = res.pca$scale, center = res.pca$center)) %*% res.pca$rotation
Now, res.pca$x
and res.new
can be plotted and interpreted together, and completely model-free. Iām not sure whether this would also be as straightforward in the case of spca and ipca. In any case, the rchemo
package has a generic transform()
method, which projects new data using a given rotation matrix (fitted model) of various kinds (pca, pls, fda, ā¦).
I noted that the centering and scaling vectors (or FALSE) are returned in pca
objects ($center
and $scale
), but not in spca
or ipca
objects (admittedly, centering is part of the algorithm 1 of ipca). Instead, there appears to be a copy of the scores matrix (object$x
and object$variates$X
) for ipca and spca objects, but not for pca objects, which store it in object$variates$X
. Also, the rotation matrix is stored in $loadings$X
for pca and objects, but twice in $rotation
and $loadings$X
in spca and ipca objects. I found it a bit hard to find all the corresponding matrices in the various object classes.
Therefore, my suggestions for improvement would be:
- return the scaling and centering vectors (or
FALSE
) also in spca and ipca (and sipca) objects - check consistency of list elements of related object classes, especially whether itās needed to have the scores matrix (
$x
and$variates$X
) or rotation matrix ($rotation
and$loadings$X
) returned twice in spca and ipca objects, since these can be heavy if ncomp is large. - (if meaningful) write a
project.pca()
orpredict,pca-method
that projects new data into an existing projection space, using the appropriate centering and scaling, where applicable (pca, spca? ipca?, ā¦).
Many thanks again for the package,
Simon
Simon Crameri, Ph.D.
ETH Zurich
Plant Ecological Genetics
Institute of Integrative Biology (IBZ)
Department of Environmental Systems Science
UniversitƤtstrasse 16, CHN G27
8092 Zurich, Switzerland