Splitting dataset to test and train set in R for random forest model?

Issue

I am working on a random forest model for a medical prediction project. The dataset I am using contains patients’ info including features, diagnosis, and patient ID (unique for each patient). Now, instead of splitting the dataset solely based on the proportion (i.e 75% of data to train, 25% to test), I wish to split the dataset based on patients (i.e. randomly select patients) while also satisfying the "75% to train, 25% to test" ratio. Can anyone help to provide some ideas on how I can achieve this? Thanks in advance!

Solution

You can use initial_split from package rsample. In this example, the column am serves as the stratum (in your case it would be your patient id):

library(tidyverse)
library(rsample)

my_split <- initial_split(mtcars, strata = am)

train_data <- training(my_split)
test_data <- testing(my_split)

# Original data
mtcars %>%
  count(am) %>%
  mutate(prop = n/sum(n))

  am  n    prop
1  0 19 0.59375
2  1 13 0.40625

# training data
train_data %>%
  count(am) %>%
  mutate(prop = n/sum(n))

  am  n prop
1  0 15  0.6
2  1 10  0.4

Answered By – deschen

Answer Checked By – Marilyn (AngularFixing Volunteer)

Leave a Reply

Your email address will not be published.