How can I specify a training set and test set from separate dataframes?


I have a dataframe with a mixture of news articles and Facebook posts (full texts) with a corresponding label (a single set of labels for all the texts – both the articles and the posts). However, I want to train my classifier on both types of texts (articles and posts), yet only have facebook posts in my test set. Is there anyway to specify a group of rows (grouped by a ‘source’ column) from which to extract the test set?

I’m using

sklearn.model_selection import train_test_split

and simpletransformers for the classification model.



Splitting is done the following way:

# create X
X = df[<columns>]
# create y
y = df[<one column>]
# split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, stratify = y)

If you have two dataframes, you need to unite them before:

df = df1.append(df2)

Answered By – gtomer

Answer Checked By – David Marino (AngularFixing Volunteer)

