I have problems to preprocess the dataset as a whole with columntransformer – maybe you can help:
First I read in my dataset:
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=seed)
Then I do my preprocessing:
preprocessor = ColumnTransformer( transformers= [ ("col_drop", "drop",["col1","col2",]), ('enc_1', BinaryEncoder(), ["Bank"]), ('enc_2', OneHotEncoder(), ["Chair"]), ('log', FunctionTransformer(np.log1p, validate=False), log_features), ('log_p', FunctionTransformer(np.log1p, validate=False), ["target_y]), ('pow', PowerTransformer(method="yeo-johnson"), pow_features) ], remainder='passthrough',n_jobs=-1)
And after that I call a pipeline with my preprocessor:
This produces the error: A given column is not a column of the dataframe
And this makes in a way sense, because I use the preprocessor to do a nlog1p function on
target_y, which is basically my target feature, which is only present in y_train and y_test.
I assume that this causes the error, because the target is not in X_train.
Question: Is it possible to preprocess X and y at once or is it mandatory to use another columntransformer/pipeline for my y values? Is there any good solution for this?
You cannot preprocess targets in a
Pipeline (unless you plan on putting them together with the independent variables and then splitting them out later); however, there is the
TransformedTargetRegressor (docs) meant for this use-case.
Answered By – Ben Reiniger
Answer Checked By – Gilberto Lyons (AngularFixing Admin)