Really simple question but something I find a bit confusing. After tuning I understand I can use 2 components for keepX and 25 variables. So I would choose keepX=c(25,25). So if my matrix is fungi abundances for 300 fungi families, would this keepX mean that in the first component (this is for sPLS/DIABLO) it is choosing just the first 25 fungi in comes across in my matrix? Or is it analyzing all 300 fungi with the genes (my keepY is 100 for genes) and keeping the top 25 with highest correlations? Generally it always appears my component 1 heatmap is all red (positive correlations). So then in component 2, is it keeping 25 fungi with the LOWEST correlations? I’m just unsure HOW it is choosing 25 variables. Same question for keepY.
Simply put, a regularisation process is used to reduce the number of features used per component. Using your example, the loading values of all 300 fungi families is taken. The optimisation problem is adjusted by LASSO penalisation. This shrinks the loading values until there are only 25 non-zero loading values. These 25 loading values correspond to the 25 features to be used on component 1
Okay so just to be sure I understand, if I choose keepX = c(25,25) and keepY=x(50,50), and I have 300 fungi and 1000 genes. In the first component the fungi and genes kept are the 25 and 50, respectively, with the highest correlations. Then in the second component, it would be 25 fungi and 50 genes that have the lowest correlation values? So it is still looking at all the data and then choosing the top contributors of the variation in the data?
In the first component the fungi and genes kept are the 25 and 50, respectively, with the highest correlations
Your correct about the number of features selected. However, the way they are chosen is not by their highest correlation (to what?). Components are constructed (in sPLS and DIABLO) to maximise the covariance between itself and equivalent components from the other blocks of data.
Then in the second component, it would be 25 fungi and 50 genes that have the lowest correlation values?
Again, not the lowest correlation values. On the second component, the same process of regularisation and maximisation of covariance will occur. It just can’t use the same features that were used for the first component, so it selected the “next best” set.
So how would I ensure that my keepX and keepY would be extracting the strongest covariance? It seems when I play around with it, most of the data is the same but then there are some changes here and there. I don’t know if I should be using keepX and keepY or trying to use all the data at once - then extract the strongest correlations that way. If I set a keepX and keepY, would I be missing out on any important correlations between (for example) fungi and gene expression data?
So how would I ensure that my keepX and keepY would be extracting the strongest covariance?
If you’re using DIABLO, it constructs components to maximise covariance. It does so mathematically, so you don’t need to do this yourself
It seems when I play around with it, most of the data is the same but then there are some changes here and there
This is due to the model approaching the global optimal model consistently, but the random state of each run causes minor perturbations. Explore the usage of set.seed()
if you want to control the random state.
If I set a keepX and keepY, would I be missing out on any important correlations between (for example) fungi and gene expression data?
This is something you have to experiment with. The answer is totally context and data dependent, so I can’t answer that for you. Some general advice for when you tune these models: use an adequately high number of repeats (ie. nrepeat = 100
). As this may take a lot of time, I’d also suggest doing the tuning over multiple steps - increasing the resolution and decreasing the range of the grid at each iteration. Eg. Start with test.keepX = seq(10, 150, 10)
and based on the output (lets say it selects 50
), then undergo tuning again with test.keepX = seq(30, 70, 5)
; rinse and repeat.