Setting up data for PLS

Hello,
I was hoping you could help with how to properly setup data for PLS analysis. I am trying to run a simple analysis using gene expression data from two groups of samples (control vs case). While I can get through the tutorial OK, when I try to execute with my data, I am having issues with X and Y data matrix. Could you please direct me to an actual example how the data should be setup so sPSL-DA reads my files appropriately? I have tried viewing the srbct dataset and modeling after the example with no luck.

Thank you!

Hi @wilk0211,

X should be a matrix with unique gene identifiers as column names and row names should be the sample identifier. Y should be a vector with all your classes. If you share the error, a screenshot and/or the code, I can help you identify the problem.

  • Christopher

Thank you so much. I really appreciate the response. I will try to execute the procedure again in the next couple of days and keep you posted.

Thank you!

Jordan

OK, so I have tried a few different approaches and still seem to have an issue. In R Studio, I am importing two Excel datasets:
X is genes and identifiers:

gene1 gene2
Control1 893 64
Case1 354 73

Y is class:
x
Control
Case

Here are the errors I get:
image

I have also gotten:

MyResult.splsda ← splsda(X, Y, keepX = c(50,50)) # 1 Run the method
Error in [<-(*tmp*, classification == groups[j], j, value = 1) :
subscript out of bounds

Thanks again for the help.

Hi @wilk0211,

It seems there is something wrong with Y. It should be a factor/class vector with the same length as number of samples, but this is not the case when you assign RNA_PLS_Y to Y. Doing it manually as you did in the lower lines: Y <- c("control", "control" .....) should work, which tells me that there is something wrong with X also. Try to use the import function in Rstudio (File → Import Dataset → From excel), make sure to set “First Row as Names” and that all the columns are numeric.

  • Christopher