Pima Dataset
#import pandas library
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
col_names = ['pregnant','glucose','bp','skin','insulin','bmi','pedigree','age','label']
pima = pd.read_csv("diabetes.csv",header = None , names = col_names)
feature_cols = ['pregnant','insulin','bmi','age','glucose','bp','pedigree']
x = pima[feature_cols]
y = pima.label
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion = 'entropy' , max_depth = 3)
# Train Decision Tree Classifer
clf = clf.fit(x_train , y_train)
#Predict the response for test dataset
y_pred = clf.predict(x_test)
# Model Accuracy
print("Accuracy:" , metrics.accuracy_score(y_test, y_pred)*100)
"""
train_test_split is used to split the dataset into training and testing sets. Here, 40% of the data is reserved for testing (test_size=0.4),
and random_state=1 sets a seed for random number generation.
This means that the data will be split in the same way every time you run the code with random_state=1, ensuring reproducibility.
Consistent Results: When you set random_state=1, every time you run the code with this setting,
you will get the same random split of the data into training and testing sets.
This is very useful for debugging, testing different models, and ensuring that your results are consistent across multiple runs of your code.
So, by setting random_state=1, you are making sure that even though randomness is involved in splitting your data,
the outcome will be the same every time you run the code.
This is particularly important when you want to compare the performance of different models or share your work with others,
as it allows others to reproduce your results exactly.
"""
Comments
Post a Comment