How to choose number of "set seed"

Hello, in DIABLO, how do I best choose the number of set.seed()? What is it dependent on?

Hi Andreas,
I am not sure you should choose a specific number for set.seed(). If I well understand, set.seed() is used to fix the random in functions. It is used when you want to make your result reproductible. For instance, if you want to get 10 values from a gaussian distibution of mean 0 and standard deviation 1, you do

rnorm(n = 10, mean = 0, sd = 1)

If you run it twice,

rnorm(n = 10, mean = 0, sd = 1)
rnorm(n = 10, mean = 0, sd = 1)

you will get two different results, as the function takes 10 random values in this distribution.
If you want to get the same values twice, you fix a seed number before running rnorm. For instance, if you run

set.seed(524)
rnorm(n = 10, mean = 0, sd = 1)
set.seed(524)
rnorm(n = 10, mean = 0, sd = 1)

you will get the same results. Here I take the number “524” but choose any number you want as long as you take the same twice. In your own case, if you just want to have reporductible results, then it doesn’t matter if you use set.seed(123), set.seed(546) or set.seed(1) as long as you still use the same.

One case when I use set.seed with DIABLO is when I want to permute values in my initial data to see if my conclusions are consistent. The function ‘permute’ takes random values so if you run it twice on the same data without set.seed(), you will get two different results. I basically do:

for(i in 1:100){
set.seed(i)
new_data ← permute(my_data)
block.pls(new_data, …)
}

Most of the time, block.pls doesn’t converge (warning message “SGCCA did not converge”) but from time to time there is no warning message, then I work only with these good permutations that I can reproduce by knowing the set.seed used.

Hope it helps
Emile

Dear Emile, thank you so much for your taking your time responding to my question. I guess my problem is that I am not sure what I need set.seed for when I try to tune and optimize a classifier model using DIABLO. What actually is it that I “set”? If the output model and its feature composition depend on it, it would be good to know how to chose it best. Am I missing something?

Hi Andreas,

I can give you my opinion on it but I am not an expert at all on this subject, you should probably wait for advice from mixOmics’ team, they are much more experienced than me in this field.
Please don’t hesitate to correct me if I say anything wrong or tell if you have a different opinion on this subject!

I am not sure it is a good idea to search the set.seed() value that gives you the best results, in my opinion you should just set any value you want, or maybe don’t use set.seed() at all.
Indeed, set.seed() only fix the random part of a function, you can use it when you want you or your coworkers to get (and then work on) the exact same result when running the code. If you are looking for the best result you can find by fixing the random part, it could be not representative of your data, for instance it could be an outlier result on which you shouldn’t give any interpretation as it loses the biological meaning of your omics data.

Moreover, if your results are very different when you change set.seed(), it is not a good news, it means that the random part of the function has a huge impact on the final result, although you’re interested on the impact of your omics data, it could mean that there is no strong link between your data.

My advice would be to run 10 times your code and see if you get similar results. If you get really different results, your conlusion could be “with this dataset, I can’t conclude anything, I should try doing more pretreatment on my data and/or change DIABLO parameters to get better results”. If all your results are really similar and seem pretty, congratulations ! If they are all similar but not pretty, maybe that is because there is no real link between your datasets, but also maybe because of a ‘bad’ pretreatment and/or DIABLO parameters. If some of your results are really similar but one or two are very different to the others, then the question is to know if this is due to an outlier in the random part of the function or if there is a biological meaning on your data which could explain it.

Hope it helps,
Emile

I concur with @emile.mardoc’s answer (thanks Emile!).
The set.seed is only used for our material to be completely reproducible amongst users, but there will still be some discrepancy (this is due to the parallelisation we use in R). For your own study I would not recommend you use the set.seed(). As Emile said, if you run a large number of repeats, the results should be very similar from one run to the other. If not, it could be because the sample size is too small to generalise using cross-validation.

Once you have chosen your parameters, you can run your final model and the selected variables will be the same. Only the tuning can be a bit unstable.

Kim-Anh

1 Like

Thank you so much. What is actually set under "set.seed, what is that value?

Thank you so much, Emile.