I am trying to implement a ML algorithm in which I would like to use a 10 fold cross validation process but I would just like to get confirmation if my procedure is correct.
I am doing a binary classification and have about 50 samples of each class in each of the 10 folders that I created, called
fold 2, and so on.
sklearn command is:
x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.3, random_state=1000)
Am I totally wrong here and this procedure is actually just doing a 30% test and 70% train process? For the 10 fold cross validation, I should be using:
from sklearn.model_selection import KFold kf = KFold(n_splits=2, random_state=42, shuffle=True)
Am I totally wrong here and this procedure is actually just doing a 30% test and 70% train process?
test_size=0.3 gives you a 30% test size and a 70% train size. We know this from reading the documentation.
test_size float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split
If you’re repeating this 10 times with different
random_state, then there will be some repeated elements in the test set among the 10 repetitions. The purpose of k-fold cross-validation is to create k disjoint sets, and each set used in turn as a holdout. Your procedure is not a cross-validation, because the sets you’ve produced by this procedure will never be disjoint (you can prove this with the pigeonhole principle).
kf = KFold(n_splits=2, random_state=42, shuffle=True)
This isn’t a 10-fold CV because
n_splits=2. We know this from reading the documentation. The argument
n_splits should be the number of folds. You’ve said you want 10 splits.
Answered By – Sycorax
Answer Checked By – Clifford M. (AngularFixing Volunteer)