# Split data into train and test stratified on label

## Issue

I have a data frame (df) with two columns (Numbers and Letters). See reproducible example:

``````Numbers<- c(2.370653,3.811336,5.255120, 6.501197,7.848100,9.343938,10.843479,12.164387,13.476807,14.922644,16.419281,17.664224,19.112835,20.660367,21.962732,23.213675)
Letters<-c("a","b","c","c","d","a","b","d","d","a","a","c","b","c","c","c")
df <- as.data.frame(cbind(Numbers,Letters))
``````

I want randomly to split the data frame into two date frames of equal size and with the same number of Letters in each. I have found the stratified() function that takes a sample with 50% of each of the Letters:

``````test <- stratified(df, "Letters", .5)
``````

But this is not really the same as splitting the data frame into two data frames. I do not want any of the same values from df\$Numbers in the two data frames – just the same amount of df\$Letters in each. Can you help me?

## Solution

Try this approach with `rsample` which is close to what you want. And the comment of @AllanCameron is totally valid, you can split three into two pieces of 1.5 for each sample:

``````library(rsample)
#Code
set.seed(123)
split_strat <- initial_split(df, prop = 0.5,
strata = 'Letters')
train_strat <- training(split_strat)
test_strat <- testing(split_strat)
``````

Check for proportions:

``````table(train_strat\$Letters)

a b c d
2 2 3 2

table(test_strat\$Letters)

a b c d
2 1 3 1
``````

Answered By – Duck

Answer Checked By – Mildred Charles (AngularFixing Admin)