Training and test samples

Hi there,
I am trying implement sPLS-DA and Diablo in my omic data and have a question and hope you can help.
So, first I am using sPLS-DA to identify age-associated cognitive, methylation and transcriptomic alteration ( have categorized age in 3 groups ([50-60[, [60-70 and [70,…[ ). I did not split my data in training and testing because by goal is not to predict but just to identify age-associated alterations.
Next i want to to use the selected features from sPLS-DA models and integrate them in Diablo. Here the idea is to test whether the age-associated alteration (methyl|trans) identified previously with sPLS-DA can discriminate individuals with higher and lower cognitive performance and to look at the correlation between these variables.

Question 1: is it correct to split my data set in training/test in the Diablo model and predict cognitive performance if i used all samples in sPLS-DA models?

Question 2 : what is the rational for selecting the number of folds, does it depend on the number of samples. I have tested 5 ;10 and leave one out and noted that the stability of variable selection is variable. Cant figure out what is more appropriated.
Ps. number of samples|features per data set, methylation 41|734 668 ; transcriptomic; 75|26 000 ; Methylation and transcriptomic 35|more or less 200.

Question 3: is it correct to use all features selected from the sPLS-DA into the Diablo model or should i only include the stable ones?

Question 4: i have very big classification error rates in parameter tunning (ncomponents and nfeatures) can i still proceed with these analysis…

Thanks for you time.
Sonya

A few thoughts about your methodology:

  • How have you validated your sPLS-DA models without using them to predict novel samples? What metrics are you using?
  • With the sPLS-DA models, were you examining which features had the highest loadings? It’s a bit unclear what you mean by “alterations”.
  • Why are you using the selected features from sPLS-DA to generate a DIABLO model when you could build select features based on an integrative approach (ie. tune.block.splsda())?

Question 1

I would say if you are planning on using training and testing sets via DIABLO then you should recreate that same methodology with sPLS-DA prior to DIABLO application. Also, relating to my third point above, it would make a bit more sense to select your features using the tune.block.splsda() method (tuning DIABLO) rather than sPLS-DA.

Question 2

First note: to run DIABLO, each of the blocks (data frames) need to have the same samples. Assuming that all 35 samples in your transcriptomic data are present in the other blocks, you are capped to a sample size of 35 in DIABLO. Based on this, my recommendation would be to set folds = 5, it means that each training fold will have 7 samples per fold which will give you an idea of the general performance.

Question 3

How are you measuring stability? Again, I’d recommend building your DIABLO model using DIABLO methods rather than sPLS-DA. While the former is an extension of the latter, you will get the best results by examining all blocks simultaneously.

Question 4

I don’t know what your goals are so I can’t say for sure whether you’re fine to continue or not. I would recommend some hesitation if you have really high error rates.

Good morning Max,
Thanks for your reply.

I am validating sPLS-DA by looking at the stability of selected features. Is this ok?

With the sPLS-DA I want to select features that best discriminate 3 age-groups, then I proceed with the DIABLO to see how these features correlate to discriminate between two cognitive groups (high and low cognitive performers). Basically how age-related alterations correlate to discriminate higher and lower cognitive performers. Does it make sense?

But let’s say I just look at the integrative part with diablo, I would have 35 samples ( 19/10/16 per group), 1300 SNP, 734 668 CpG and 27 000 transcriptomic features.

Would you still recommend a 5 fold cross validation?

Would it be ok to integrate all the features without any reduction of features or should I first try to reduce the number of features before integrating with Diablo?

I was trying to use nearZeroVar() function for methylation and transcriptomic data but im having some issues:

zero<-nearZeroVar( metil)

zero
$Position
integer(0)

$Metrics
[1] freqRatio percentUnique
<0 rows> (or 0-length row.names)

When i run nearZeroVar with tune.splsda:

list.keepX <-( c(5:10,seq(20, 100, 10))) # grid of possible keepX values that will be tested for each component

list.keepX
[1] 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100

tune.splsda.srbct ← tune.splsda(metil,Y, ncomp = 1,

  • validation = ‘loo’, dist = ‘max.dist’, progressBar = FALSE,
  • measure = “BER”, test.keepX = list.keepX, nearZeroVar(metil))
    Error in tune.splsda(metil, Y, ncomp = 1, validation = “loo”, dist = “max.dist”, :
    '‘already.tested.X’ must be a vector of keepX values

dont know what is the problem.

Once more thanks for your time.
Sonya

validating sPLS-DA by looking at the stability of selected features. Is this ok?

Out of curiosity, how many models are you assessing stability over?

Would you still recommend a 5 fold cross validation?

I’d still say experiment with 10 folds, but my guess would be 5 is about the best value to use here.

Would it be ok to integrate all the features without any reduction of features or should I first try to reduce the number of features before integrating with Diablo?

Again, it’s best if you experiment with both. By doing so, you can gain an idea of the importance of feature reduction to you system, which may elucidate the complexity of said system. This case study might help!

You need to pass TRUE or FALSE to the near.zero.var parameter. You are passing the entire nearZeroVar() function in the position which corresponds to the already.tested.X parameter. You’re call should look like:

tune.splsda.srbct ← tune.splsda(metil,Y, ncomp = 1,
                validation = ‘loo’, dist = ‘max.dist’, progressBar = FALSE,
                measure = “BER”, test.keepX = list.keepX, near.zero.var=TRUE)

Out of curiosity, how many models are you assessing stability over?

By models do you mean kfolds -repeats? I am testing the use of 5fold -100 repeats, 10fold-100 repeats and loo,but I’m not sure how I will access what is best…

I’d still say experiment with 10 folds, but my guess would be 5 is about the best value to use here.

Yes , I will do that and test 5fold -100 repeats, 10fold-100 repeats and loo and, see what gives a smaller error .

I am having problems understanding what nearZeroVar() is doing exactly. Is the function the same as the one from caret package? What are the parameters? How can I see how many variables are excluded from the model?

Thanks you for your patience

It is the same as the caret package function.

I’d recommend reading the documentation for the nearZeroVar() function (via ?nearZeroVar). I find it’s usually more intuitive to use this function prior to model building rather than setting the near.zero.var parameter in a given method to TRUE.