Split data into training and test set: How to make sure all factors are included in training set?

Issue

I have a data frame called b. I split this into a training set and test set.

smp_size <- floor(0.75 * nrow(b))
set.seed(123)
train_ind <- sample(seq_len(nrow(b)), size = smp_size)
b_train <- b[train_ind, ]
b_test <- b[-train_ind, ]

b contains a variable/column, let’s say x, that I use as factor() with many different categories.

I use b_train to get a linear model with the function lm(). After that I use the function predict() with the lm() object and b_test. Unfortunately, b_train$x does not include all different types of characters in b$x. Therefore, the function predict() can not be used, since b_test$x contains categories that are not in b_train$x.

How to make sure that all types of categories are included in b_train$x ?

Solution

This can be easily done using caret package’s createDataPartition() function.

library(caret)
samp = createDataPartition(as.factor(b$x), p = 0.75, list = F)

train = b[samp,]
test = b[-samp,]

Answered By – Not_Dave

Answer Checked By – Gilberto Lyons (AngularFixing Admin)

Leave a Reply

Your email address will not be published.