After making a desicion tree function, I have decided to check how accurate is the tree, and to confirm that atleast the first split is the same if I’ll make another trees with the same data
from sklearn.model_selection import train_test_split import pandas as pd import numpy as np import os from sklearn import tree from sklearn import preprocessing import sys from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold
def desicion_tree(data_set:pd.DataFrame,val_1 : str, val_2 : str): #Encoder -- > fit doesn't accept strings feature_cols = data_set.columns[0:-1] X = data_set[feature_cols] # Independent variables y = data_set.Mut #class y = y.to_list() le = preprocessing.LabelBinarizer() y = le.fit_transform(y) # Split data set into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1) # 75% # Create Decision Tree classifer object clf = DecisionTreeClassifier(max_depth= 4, criterion= 'entropy') # Train Decision Tree Classifer clf.fit(X_train, y_train) # Predict the response for test dataset y_pred = clf.predict(X_test) #Perform cross validation for i in range(2, 8): plt.figure(figsize=(14, 7)) # Perform Kfold cross validation #cv = ShuffleSplit(test_size=0.25, random_state=0) kf = KFold(n_splits=5,shuffle= True) scores = cross_val_score(estimator=clf, X=X, y=y, n_jobs=4, cv=kf) print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std())) tree.plot_tree(clf,filled = True,feature_names=feature_cols,class_names=[val_1,val_2]) plt.show() desicion_tree(car_rep_sep_20, 'Categorial', 'Non categorial')
Down , I wrote a loop in order to rectreate the tree with splitted values using Kfold. The accuracy is changing (around 90%) but the tree is the same, where did I mistaken?
cross_val_score clones the estimator in order to fit-and-score on the various folds, so the
clf object remains the same as when you fit it to the entire dataset before the loop, and so the plotted tree is that one rather than any of the cross-validated ones.
To get what you’re after, I think you can use
cross_validate with option
return_estimator=True. You also shouldn’t need the loop, if your cv object has the number of splits desired:
kf = KFold(n_splits=5, shuffle=True) cv_results = cross_validate( estimator=clf, X=X, y=y, n_jobs=4, cv=kf, return_estimator=True, ) print("%0.2f accuracy with a standard deviation of %0.2f" % ( cv_results['test_score'].mean(), cv_results['test_score'].std(), )) for est in cv_results['estimator']: tree.plot_tree(est, filled=True, feature_names=feature_cols, class_names=[val_1, val_2]) plt.show();
Alternatively, loop manually over the folds (or other cv iteration), fitting the model and plotting its tree in the loop.
Answered By – Ben Reiniger
Answer Checked By – Dawn Plyler (AngularFixing Volunteer)