Customize plotIndiv() plot using plsda with mixOmics

Hello folks,

I am very new to mixOmics but found it very useful to run PLSDA. I have run the code for my data and try to customize the plot using ggplot, however, I found it difficult to customize the plotIndiv() and plotVar() . Here is my code and data. The data were grouped into two subgroups, cluster1, and cluster2. The variables are the elements. The dataset was uploaded in the google drive, here is the link for the data:
data used for plsda

My goal is to generate figures like these plots in the literature, does anyone can help me modify my code to have something like this? Thank you.
The red labels in the plot are the labeled groups, how can I add my clusters into my variable plot?

Simon

library(mixOmics)
library(ggplot2)
library(ggrepel)
plsda_data<-read.csv("plsda_data.csv",header = TRUE, sep = ",", stringsAsFactors = FALSE)
#set up the data as X expression matrix and Y as factor
X<-plsda_data[1:252, 2:26]
Y<-plsda_data$Cluster
summary(Y)
dim(X)
#PLSDA analysis 
data.plsda <- plsda(X, Y, ncomp = 25)  
plotIndiv(data.plsda , 
               comp = c(1,2),
               group = plsda_data$Cluster, 
               ind.names = FALSE, 
               ellipse = FALSE, 
               legend = TRUE, 
               title = 'PLSDA results',
               X.label = 'PLS-DA 1',
               Y.label = 'PLS-DA 2')
#output the variables
plotVar(data.plsda,
        comp=c(1,2),
        var.names =TRUE,
        legend = FALSE,
        plot=TRUE,
        overlap = TRUE,
        style='ggplot2',
        rad.in = 0,
        cutoff = 0,
        cex = 8,
        font = 2,
        col = 4,
        pch = 21,
        title = "PLS-DA results for variable correlations")

There a good few things I should comment on here:

  1. The code you provided was not valid for the data. I’m not sure if these were just typos, but the code results in quite a few errors. Eg, plsda_data has 25 columns yet you try to slice 2:26. Also you don’t slice the Y dataframe at all despite slicing X to only contain 1:252 rows. The code you provided would not run as is and you should check if for these inconsistencies

  2. ncomp=25 is fairly unnecessary. I’m not sure why you’d do this as it seems you are only visualising on the first two components. Additionally, the maximum it should allow in this context is 23 as there is 23 predictor features.

  3. Looking at your data, there are three clusters, not two.

  4. The data provided is wildly different to the data you show in the above plots. I’m not sure if this is relevant or not.

  5. Your code (with adjustments) will successfully colour each sample by its cluster label within plotIndiv(). You cannot colour the variables in plotVar() by cluster as you are plotting the features, not the samples.

I hope all this clarifies some usage of R and the mixOmics package. Below is a version of your code which will actually run.

Cheers,
Max.

library(mixOmics)

plsda_data <- read.csv("C:/Users/Work/Desktop/plsda_data.csv",header = TRUE, sep = ",", stringsAsFactors = FALSE)

X<- plsda_data[1:252, 2:24]
Y<-plsda_data$Cluster[1:252]
table(Y)
#> Y
#> Cluster1 Cluster2 Cluster3 
#>      100      139       13
dim(X)
#> [1] 252  23

data.plsda <- plsda(X, Y, ncomp = 23) 

plotIndiv(data.plsda , 
               comp = c(1,2),
               group = plsda_data$Cluster[1:252], 
               ind.names = FALSE, 
               ellipse = FALSE, 
               legend = TRUE, 
               title = 'PLSDA results')

plotVar(data.plsda,
        comp=c(1,2),
        var.names =TRUE,
        legend = FALSE,
        plot=TRUE,
        overlap = TRUE,
        style='ggplot2',
        rad.in = 0,
        cutoff = 0,
        cex = 8,
        font = 2,
        col = 4,
        pch = 21,
        title = "PLS-DA results for variable correlations")

Created on 2022-06-15 by the reprex package (v2.0.1)

Hello Max,

Thank you very much for your response. In fact, I made a mistake to upload my data. Here is the dataset for the code that I provided.

plsda_data_test
I just set n = 25 to compute every component and usually, the first two components always show the best discrimination.

I can generate the plots you showed using plotIndv() and plotVar(), but the problems for me now are:

(1) how can I customize the plot such as only showing the t1(16%) and t2(12%) in the plotIndiv() plots? if i am understanding plsda right, the plotIndiv() can show the score plot while the plotVar() shows the loadings. So is it possible for me to extract the scores and loadings from the plsda matrix?
(2) is it possible to use ggplot to customize the labels and text size with plotVar() or plotIndiv() ?

Thank you very much,

Cheers,
Simon

(1) Unfortunately your understanding is a bit off. plotIndiv() uses the variates component to plot the samples’ projection into the latent space. plotVar() does not use loadings, rather the correlation of each feature with the two displayed components (its called a Correlation Circle Plot). The output of the plsda() function will have a $loadings and a $variates component, this is what you’re looking for.

(2) If you read the documentation for plotVar() and plotIndiv(), you’ll find the answer to your question. Look at the size.xlabel, size.ylabel and cex parameters.