I am working on a random forest model for a medical prediction project. The dataset I am using contains patients’ info including features, diagnosis, and patient ID (unique for each patient). Now, instead of splitting the dataset solely based on the proportion (i.e 75% of data to train, 25% to test), I wish to split the dataset based on patients (i.e. randomly select patients) while also satisfying the "75% to train, 25% to test" ratio. Can anyone help to provide some ideas on how I can achieve this? Thanks in advance!
You can use
initial_split from package rsample. In this example, the column am serves as the stratum (in your case it would be your patient id):
library(tidyverse) library(rsample) my_split <- initial_split(mtcars, strata = am) train_data <- training(my_split) test_data <- testing(my_split) # Original data mtcars %>% count(am) %>% mutate(prop = n/sum(n)) am n prop 1 0 19 0.59375 2 1 13 0.40625 # training data train_data %>% count(am) %>% mutate(prop = n/sum(n)) am n prop 1 0 15 0.6 2 1 10 0.4
Answered By – deschen
Answer Checked By – Marilyn (AngularFixing Volunteer)