I have a dataframe with a mixture of news articles and Facebook posts (full texts) with a corresponding label (a single set of labels for all the texts – both the articles and the posts). However, I want to train my classifier on both types of texts (articles and posts), yet only have facebook posts in my test set. Is there anyway to specify a group of rows (grouped by a ‘source’ column) from which to extract the test set?
sklearn.model_selection import train_test_split
and simpletransformers for the classification model.
Splitting is done the following way:
# create X X = df[<columns>] # create y y = df[<one column>] # split to train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, stratify = y)
If you have two dataframes, you need to unite them before:
df = df1.append(df2)
Answered By – gtomer
Answer Checked By – David Marino (AngularFixing Volunteer)