Predict only predicts outcome for one of four samples

stepra · September 8, 2022, 10:09am

Hi,
I am running a model on 71% of my data, i.e. 13 samples. Which is very few, I know. I then have 4 samples in my test set that I want to predict their outcome with.
The model I get has an AUROC of 1 and 100% accuracy in predicting the outcome of the 13 samples used. This indicates serious overfitting to me. I thus find it essential to keep a few samples out and run predict on those. While the code works perfectly fine for my X.train data:pred=predict(MyResult.splsda.final, X.train) Prediction <- pred$class$mahalanobis.dist[, 1] #using just comp1
giving me: PA.04 PA.06 PA.08 PA.09 PA.10 PA.12 PA.13 PA.14 PA.18 PA.35 PA.38 "R_Ep" "R_Ep" "NR_Ep" "R_Ep" "NR_Ep" "NR_Ep" "R_Ep" "R_Ep" "R_Ep" "R_Ep" "NR_Ep" PA.39 PA.40 "R_Ep" "NR_Ep"
it does not work for my X.test:

pred=predict(MyResult.splsda.final, X.test)
Prediction <- pred$class$mahalanobis.dist[, 1]

giving me:PA.05 PA.23 PA.36 PA.37 "R_Ep" "" "" ""

I do not understand why the other samples do not have a prediction. I have looked into the input file and I do not think that anything is wrong there. R correctly assigns the outcome when splitting the data:

prop.table(table(Resp.test$Epilepsy))

giving me:

NR_Ep  R_Ep 
  0.5   0.5

So it does not seem to be a format or spelling issue. I do not know where to look further and I hope you can help me solving this issue.
Thank you so much for your time!
/Stef

stepra · September 16, 2022, 6:55am

I know you guys are probably very busy! However, I have been stuck at this step for over a week now and really hope you can help me solve this, so I can move on with my analysis.
I very much appreciate your help @MaxBladen @kimanh.lecao @aljabadi
Best wishes,
Stef

MaxBladen · September 19, 2022, 10:31pm

I am unable to reproduce the error here. I receive the following results:

suppressMessages(library(mixOmics))
data(breast.TCGA)

train.samples <- c(1:6, 77:83)
X.train <- breast.TCGA$data.train$mirna[train.samples, ]
Y.train <- as.vector(breast.TCGA$data.train$subtype[train.samples])

test.samples <- c(7,8, 84, 85)
X.test <- breast.TCGA$data.train$mirna[test.samples, ]
Y.test <- as.vector(breast.TCGA$data.train$subtype[test.samples])

MyResult.splsda.final <- splsda(X.train, Y.train,
                                keepX = c(100,100)) # tried a range of values

# USING TRAINING DATA
pred=predict(MyResult.splsda.final, X.train) 
Prediction <- pred$class$mahalanobis.dist[, 1]
Prediction 
#>    A0FJ    A13E    A0G0    A0SX    A143    A0DA    A0B0    A18S    A0CS    A0EI 
#> "Basal" "Basal" "Basal" "Basal" "Basal" "Basal"  "LumA"  "LumA"  "LumA"  "LumA" 
#>    A0IO    A0T6    A1AU 
#>  "LumA"  "LumA"  "LumA"

# USING TESTING DATA
pred=predict(MyResult.splsda.final, X.test) 
Prediction <- pred$class$mahalanobis.dist[, 1]
Prediction
#>    A0B3    A0I2    A07Z    A0XS 
#>  "LumA" "Basal"  "LumA"  "LumA"

^{Created on 2022-09-20 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       Australia/Sydney
#>  date     2022-09-20
#>  pandoc   2.18 @ D:/Programs/Work Programs/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version  date (UTC) lib source
#>  assertthat     0.2.1    2019-03-21 [1] CRAN (R 4.2.1)
#>  BiocParallel   1.30.3   2022-06-07 [1] Bioconductor
#>  cli            3.3.0    2022-04-25 [1] CRAN (R 4.2.1)
#>  codetools      0.2-18   2020-11-04 [1] CRAN (R 4.2.1)
#>  colorspace     2.0-3    2022-02-21 [1] CRAN (R 4.2.1)
#>  corpcor        1.6.10   2021-09-16 [1] CRAN (R 4.2.0)
#>  DBI            1.1.3    2022-06-18 [1] CRAN (R 4.2.1)
#>  digest         0.6.29   2021-12-01 [1] CRAN (R 4.2.1)
#>  dplyr          1.0.9    2022-04-28 [1] CRAN (R 4.2.1)
#>  ellipse        0.4.3    2022-05-31 [1] CRAN (R 4.2.1)
#>  evaluate       0.16     2022-08-09 [1] CRAN (R 4.2.1)
#>  fansi          1.0.3    2022-03-24 [1] CRAN (R 4.2.1)
#>  fastmap        1.1.0    2021-01-25 [1] CRAN (R 4.2.1)
#>  fs             1.5.2    2021-12-08 [1] CRAN (R 4.2.1)
#>  generics       0.1.3    2022-07-05 [1] CRAN (R 4.2.1)
#>  ggplot2      * 3.3.6    2022-05-03 [1] CRAN (R 4.2.1)
#>  ggrepel        0.9.1    2021-01-15 [1] CRAN (R 4.2.1)
#>  glue           1.6.2    2022-02-24 [1] CRAN (R 4.2.1)
#>  gridExtra      2.3      2017-09-09 [1] CRAN (R 4.2.1)
#>  gtable         0.3.0    2019-03-25 [1] CRAN (R 4.2.1)
#>  highr          0.9      2021-04-16 [1] CRAN (R 4.2.1)
#>  htmltools      0.5.3    2022-07-18 [1] CRAN (R 4.2.1)
#>  igraph         1.3.4    2022-07-19 [1] CRAN (R 4.2.1)
#>  knitr          1.40     2022-08-24 [1] CRAN (R 4.2.1)
#>  lattice      * 0.20-45  2021-09-22 [1] CRAN (R 4.2.1)
#>  lifecycle      1.0.1    2021-09-24 [1] CRAN (R 4.2.1)
#>  magrittr       2.0.3    2022-03-30 [1] CRAN (R 4.2.1)
#>  MASS         * 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.1)
#>  Matrix         1.4-1    2022-03-23 [1] CRAN (R 4.2.1)
#>  matrixStats    0.62.0   2022-04-19 [1] CRAN (R 4.2.1)
#>  mixOmics     * 6.20.0   2022-04-26 [1] Bioconductor (R 4.2.0)
#>  munsell        0.5.0    2018-06-12 [1] CRAN (R 4.2.1)
#>  pillar         1.8.1    2022-08-19 [1] CRAN (R 4.2.1)
#>  pkgconfig      2.0.3    2019-09-22 [1] CRAN (R 4.2.1)
#>  plyr           1.8.7    2022-03-24 [1] CRAN (R 4.2.1)
#>  purrr          0.3.4    2020-04-17 [1] CRAN (R 4.2.1)
#>  R.cache        0.16.0   2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3    1.8.2    2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo           1.25.0   2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils        2.12.0   2022-06-28 [1] CRAN (R 4.2.1)
#>  R6             2.5.1    2021-08-19 [1] CRAN (R 4.2.1)
#>  rARPACK        0.11-0   2016-03-10 [1] CRAN (R 4.2.1)
#>  RColorBrewer   1.1-3    2022-04-03 [1] CRAN (R 4.2.0)
#>  Rcpp           1.0.9    2022-07-08 [1] CRAN (R 4.2.1)
#>  reprex         2.0.2    2022-08-17 [1] CRAN (R 4.2.1)
#>  reshape2       1.4.4    2020-04-09 [1] CRAN (R 4.2.1)
#>  rlang          1.0.4    2022-07-12 [1] CRAN (R 4.2.1)
#>  rmarkdown      2.16     2022-08-24 [1] CRAN (R 4.2.1)
#>  RSpectra       0.16-1   2022-04-24 [1] CRAN (R 4.2.1)
#>  rstudioapi     0.14     2022-08-22 [1] CRAN (R 4.2.1)
#>  scales         1.2.1    2022-08-20 [1] CRAN (R 4.2.1)
#>  sessioninfo    1.2.2    2021-12-06 [1] CRAN (R 4.2.1)
#>  stringi        1.7.8    2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr        1.4.1    2022-08-20 [1] CRAN (R 4.2.1)
#>  styler         1.7.0    2022-03-13 [1] CRAN (R 4.2.1)
#>  tibble         3.1.8    2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyr          1.2.0    2022-02-01 [1] CRAN (R 4.2.1)
#>  tidyselect     1.1.2    2022-02-21 [1] CRAN (R 4.2.1)
#>  utf8           1.2.2    2021-07-24 [1] CRAN (R 4.2.1)
#>  vctrs          0.4.1    2022-04-13 [1] CRAN (R 4.2.1)
#>  withr          2.5.0    2022-03-03 [1] CRAN (R 4.2.1)
#>  xfun           0.32     2022-08-10 [1] CRAN (R 4.2.1)
#>  yaml           2.3.5    2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] D:/Programs/Work Programs/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

To help I’d first need a reproducible example (as described here and in the banner). From there, we can potentially discuss me accessing your data and scripts to evaluate them for issues. Feel free to directly message me via the forum

stepra · September 26, 2022, 3:45pm

Thank you @MaxBladen! I can indeed reproduce your code. I can also predict my train set without a problem. Just when it comes to the test set, it is not working. This is what I get from pred2=predict(MyResult.splsda.final, X.test)

But I have no NAs in my test set:
which(is.na(X.test))
integer(0)
So I am not sure why I get NaN as predict and variables.
Thank you very much for looking into this!
/Stef

MaxBladen · September 26, 2022, 9:50pm

I don’t know whats causing this. While I would usually offer to take your data and debug it for you, I’ve got a contract starting this week which will prevent me from doing any mixOmics work for a month or so.

The size and content of the B.hat component makes no sense to me. My gut says this is where the issue is arising

stepra · September 27, 2022, 5:55am

I understand @MaxBladen and I appreciate all the support you have been giving me so far. You helped me resolve several issues in the past and I am very grateful for it! Thus, I have been able to publish one article with mixOmics analyses, one is close to acceptance and this is part of the third manuscript. Also, I have convinced my colleagues to use it
Is there anyone else in the team who could take a look at this in the mean time? If so, could you tag that person here, please?
Best wishes,
PS: What is B.hat?

MaxBladen · September 27, 2022, 10:10pm

I’m really glad I’ve been able to help. It’s a fantastic package but can be a bit tricky to get into initially.

Unfortunately, I’m pretty much the only one who monitors the forums these days. I can ask some of my colleagues about your issue - but I don’t know when they’d get back to me.

B.hat are the regression coefficients used to generate the predicted values for the novel data. The first dimension represents each feature (995), the second dimension represents each level of your response vector and the third dimension represents each component.

Topic		Replies	Views
Prediction results Analysis	5	200	August 10, 2023
PLS-DA: prediction of new upcoming samples Support	10	510	July 27, 2023
Prediction gives multiple results for one observation	1	327	December 3, 2020
Prediction PLS-DA Analysis	3	981	July 4, 2021
Diablo prediction result per sample Analysis	3	343	April 26, 2021

Predict only predicts outcome for one of four samples

Related topics