# How to efficiently split each row into test and train subsets using R?

## Issue

I have a data table that provides the length and composition of given vectors
for example:

``````set.seed(1)

dt = data.table(length = c(100, 150),
n_A = c(30, 30),
n_B = c(20, 100),
n_C = c(50, 20))
``````

I need to randomly split each vector into two subsets with 80% and 20% of observations respectively. I can currently do this using a for loop. For example:

``````dt_80_list <- list() # create output lists
dt_20_list <- list()

for (i in 1:nrow(dt)){ # for each row in the data.table

sample_vec <- sample( c(   rep("A", dt\$n_A[i]), # create a randomised vector with the given nnumber of each component.
rep("B", dt\$n_B[i]),
rep("C", dt\$n_C[i]) ) )

sample_vec_80 <- sample_vec[1:floor(length(sample_vec)*0.8)] # subset 80% of the vector

dt_80_list[[i]] <- data.table(   length = length(sample_vec_80), # count the number of each component in the subset and output to list
n_A = length(sample_vec_80[which(sample_vec_80 == "A")]),
n_B = length(sample_vec_80[which(sample_vec_80 == "B")]),
n_C = length(sample_vec_80[which(sample_vec_80 == "C")])
)

dt_20_list[[i]] <- data.table(   length = dt\$length[i] - dt_80_list[[i]]\$length, # subtract the number of each component in the 80% to identify the number in the 20%
n_A = dt\$n_A[i] - dt_80_list[[i]]\$n_A,
n_B = dt\$n_B[i] - dt_80_list[[i]]\$n_B,
n_C = dt\$n_C[i] - dt_80_list[[i]]\$n_C
)
}
dt_80 <- do.call("rbind", dt_80_list) # collapse lists to output data.tables
dt_20 <- do.call("rbind", dt_20_list)
``````

However, the dataset I need to apply this to is very large, and this is too slow. Does anyone have any suggestions for how I could improve performance?

Thanks.

## Solution

(I assumed your dataset consists of many more rows (but only a few colums).)

Here’s a version I came up with, with mainly three changes

• use `.N` and `by=` to count the number of "A","B","C" drawn in each row
• use the size argument in `sample`
• join the original `dt` and `dt_80` to calculate `dt_20` without a for-loop
``````## draw training data
dt_80 <- dcast(
dt[,row:=1:nrow(dt)
][, .(draw=sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*length)  )
, by=row
][,.N,
by=.(row,draw)],
row~draw,value.var="N")[,length80:=A80+B80+C80]

## draw test data
dt_20 <- dt[dt_80,
.(A20=n_A-A80,
B20=n_B-B80,
C20=n_C-C80),on="row"][,length20:=A20+B20+C20]
``````

There is probably still room for optimization, but I hope it already helps 🙂

EDIT

Here I add my initial first idea, I did not post this because the code above is much faster. But this one might be more memory-efficient which seems crucial in your case. So, even if you already have a working solution, this might be of interest…

``````library(data.table)
library(Rfast)

dt[,row:=1:nrow(dt)]

## sampling function
sampfunc <- function(n_A,n_B,n_C){
draw <- sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*(n_A+n_B+n_C))
out <- Rfast::Table(draw)
return(as.list(out))
}

## draw training data
dt_80 <- dt[,sampfunc(n_A,n_B,n_C),by=row]
``````