I have a dateset given to me that was previously split in training and validation (test) data. I need to further split the training data into a separate training data and calibration set, I don’t want to touch my current validation (test) set. I don’t have access to the original dataset.
I would like to do this randomly, so that every time I can run my script, I get a different training and calibration test. I am aware of the .sample() function but my training dataset is of 44000 rows.
training = dataset.loc[dataset['split']== 'train'] print("Training Created") #print(training.head()) validation = dataset.loc[dataset['split']== 'valid'] print("Validation Created") #print(validation.head())
Where I would need something like this:
# proper training set x_train = breast_cancer.values[:-100, :-1] y_train = breast_cancer.values[:-100, -1] # calibration set x_cal = breast_cancer.values[-100:-1, :-1] y_cal = breast_cancer.values[-100:-1, -1] # (x_k+1, y_k+1) x_test = breast_cancer.values[-1, :-1] y_test = breast_cancer.values[-1, -1]
Unsure what to do with the second split
Example of Dataset
Object | Variable | Split Cancer1 55 Train Cancer5 45 Train Cancer2 56 Valid Cancer3 68 Valid Cancer4 75 Valid
It seems as you already have a column with
validation sets assigned. The usual way is to use
sklearn.model_selection.train_test_split. So to further split your training data into training and “calibration”, just use it on the train set (note that you have to split into
# initial split into train/test train = df.loc[df['Split']== 'train'] test = df.loc[df['Split']== 'valid'] # split the test set into features and target x_test = test.loc[:,:-1] y_test = test.loc[:,-1] # same with the train set X_train = train.loc[:,:-1] y_train = train.loc[:,-1] # split into train and validation sets X_train, X_calib, y_train, y_calib = train_test_split(X_train, y_train)
Answered By – yatu
Answer Checked By – Clifford M. (AngularFixing Volunteer)