Supervised Learning

Introduction

What is Supervised Learning?

In this section, we explore a number of ‘supervised learning’ techniques in an attempt to predict the outcome of BBWAA votes! For some background, supervised learning, unlike unsupervised learning is the branch of machine learning where the algorithms are given the underlying truth about each point (in our case the voting result - elected, expired, etc.). With the target information present for the machine, it ‘learns’ to connect the the patterns in the data with the specific outcomes, hopefully reaching a place where it can make accurate predictions for new, unseen data.

Within the realm of supervised learning, there are two major tasks that we will utilize in this section. They are classification and regression. Classification is used when the target variable is categorical in nature¹. For example, prediction the outcome of the BBWAA vote would be a classification task, with categories like elected, expired, in limbo, etc. Within classification, there are two common tasks - binary classification and multi-class classification. The example just described would be a multi-class problem, as there are more than two possible outcomes. On the other hand, if we only work to predict election vs. non-election, it becomes a binary classification problem, as there are only two possible classes¹.

The second process, regression, is used to make predictions of a numeric or continuous variable². While the outcomes of the the BBWAA voting are categorical, the underlying percentage of votes recieved can be predicted via regression. The process, at a high level, remains similar, with the regression model ‘learning’ the connections between the underlying data and the target values, but instead the output is now continuous².

Supervised Methods Used

K-Nearest Neighbors

K-Nearest neighbors is one the more simpler methods out there, taking direct advantage of the fact, which we saw earlier, that points of similar target classes/values tend to exist close to one another³. With this fact, the K-Neighbors algorithm takes in a value for K, and for any prediction returns the average or majority of the K closest other points to it³. For example, if the algorithm is making a prediction for a point with k=3, and the 3 other data points closest to the predicion point are [elected, elected, eliminated], the prediction returned would be elected due to the majority vote. If instead we were utiling K-Neighbors for regression, and the 3 nearest points were [50%, 60%, 70%] (votes received), the prediction would be 60% via the mean.

Decision Tree/Random Forest

Decision trees, like K-Nearest Neighbors, can be used for both regression and classification tasks. To make predictions, a decision tree splits the data a series of times, grouping based on the underlying value of specific features⁴. Once the data has been split, the final resulting group becomes the prediction. To ‘learn’, the decision tree determines which features and values to split the data on, such that it results in groupings that accurately reflect the true outcomes⁴.

Random forests utilize a multitude of smaller decision trees, return a prediction of the most common result from the smaller trees, or average of the predictions in the regression case, similar to the majority ruling in K-Neighbors⁵

Linear Regression

Linear Regression is the most popular form of regression, making predictions based on the line (or hyperplane in more than 3 dimensions) learned during training that minimizes the prediction error of all points in the dataset. There are a few methods to find this optimal line, but they each work to find it such that the sum of the squares of the distance between each datapoint and the line is minimized. It it only used for regression tasks rather than classification tasks, and assumes a linear relationship between each of the features and the target variable⁶.

Logistic Regression

While technically titled as a regression, Logistic regression is most often used for classification problems⁷. This is because logistic regression predictions are bound to 0-1, making them interpretable as percentages of ‘success’. Thus, when utilizing logistic regression for binary classification, say of ‘election’, a model output of 0.76 would become a prediction of elected as the probability of success is >50%. Logistic regression can also be utilized for multi-class predictions⁷.

Code

# General imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, mean_squared_error

# Import the BBWAA voting data for batters
batter_df = pd.read_csv('../../data/processed-data/batter_df_for_prediction.csv')

# Create the targets for multi-class classification, binary classification, and regression
multi_targets = batter_df.outcome
binary_targets = multi_targets == 'elected'
reg_targets = batter_df.votes_pct

# Drop the informational columns that we dont use to predict (like name)
# Plus the targets columns and the scandal column (which is all 0s)
batter_df = batter_df.drop(columns=['name', 'player_id', 'votes_pct', 'outcome', 'position', 'scandal'])

As a refresher, we are making predictions on the follow columns:

batter_df.columns

Index(['voting_year', 'year_on_ballot', 'ly_votes_pct', 'b_war', 'b_h', 'b_hr',
       'b_sb', 'b_bb', 'b_so', 'b_batting_avg', 'b_onbase_plus_slugging_plus',
       'b_home_run_perc', 'b_strikeout_perc', 'b_base_on_balls_perc',
       'b_cwpa_bat', 'b_baseout_runs', 'mvps', 'gold_gloves', 'batting_titles',
       'all_stars', 'G_p_app', 'G_c', 'G_1b', 'G_2b', 'G_3b', 'G_ss',
       'G_lf_app', 'G_cf_app', 'G_rf_app', 'G_dh'],
      dtype='object')

Before we make predictions, we process our data once more with the methods described in the unsupervised learning section. This includes reducing dimensionality with PCA, and scaling all data with a standard scaler.

# Preprocess the data with PCA and a standard scaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(batter_df)

# We keep 95% of the variance in our PCA decomposition
pca = PCA(n_components=.95)
decomposed_data = pca.fit_transform(scaled_data)

print(f'The processed dataframe now had the shape {decomposed_data.shape}')

The processed dataframe now had the shape (3108, 20)

As a final step before training models and making predictions, we split our data into two categories: training data and testing data. We do this so that we can ensure our models adapt well to unseen data in the future. After splitting our data into these two groups, we will train the models on the training data, before feeding it the ‘unseen’ testing data to make predictions on. These blind test results will give us the best understanding of how our model adapts to data in the future. While splitting the data, we also ensure that the proportion of succesfull elections is similar across the two splits, as the majority of our dataset is non-elections.

Within this training data, we also undertake one further step that allows us to tweak our model in ways that optimize it. When training each model, we utilize a technique called cross-validation. This process splits the training data into a predefined number (k) sets, before training the model on all but each set k times, and validating the score on each ‘unseen’ final set. This allowed to train each model with different inputs (like the number of neighbors), without breaking into our final testing data!

Binary Classification

# Split data into training and testing datasets for binary classification
# We reserve 20% of the data for testing
x_train, x_test, y_train, y_test = train_test_split(decomposed_data, binary_targets, test_size=0.2, random_state=5000, stratify=binary_targets)

Logistic Regression

# Initialize Logistic Regression Model and fit to the training data
logistic_model = LogisticRegression()
logistic_model.fit(x_train, y_train)

# Utilize cross validation training to train the logistic model
# for a baseline score
baseline_accuracy = cross_val_score(logistic_model, x_train, y_train).mean() * 100

print(f'Accuracy for the baseline Logistic Regression Model is {round(baseline_accuracy, 2)}%')

Accuracy for the baseline Logistic Regression Model is 97.63%

This is a great result! When attempting to predict whether a player will be elected to the Hall of Fame via cross validation, we are correct > 97% of the time!. We will confirm this result later however, on the unseen test data.

K-Nearest Neighbors

# Initialize the KNeighbor model and fit to the training data
knn_classifier = KNeighborsClassifier()
knn_classifier.fit(x_train, y_train)

# Utilize cross validation training to train the KNeighbors model
# for a baseline score
baseline_accuracy = cross_val_score(knn_classifier, x_train, y_train).mean() * 100

print(f'Accuracy for the baseline K-Neighbors Classifier Model is {round(baseline_accuracy, 2)}%')

Accuracy for the baseline K-Neighbors Classifier Model is 97.02%

We can potentially build upon this result for K-Neighbors by altering the number of neighbors used in classification. We do this by running the cross validation procedure on a number of different values for neighbor, and use the best one for final training. This process is known as hyperparameter tuning.

knn_classifier = KNeighborsClassifier()

# Define the neighbor values to search
# We test each value from 5 to 50 in increments of 5
knn_grid ={'n_neighbors': [int(n) for n in np.linspace(5, 50, 10)]}

# Set up the grid search
knn_grid = GridSearchCV(knn_classifier, knn_grid)

# Fit the grid search, finding the optimal value
knn_grid.fit(x_train, y_train)

# Print the best model and its score
print(f'The best K-Neighbors model has {knn_grid.best_params_["n_neighbors"]} Neighbors')
print(f'Accuracy for the optimized K-Neighbors Classifier Model is {round(knn_grid.best_score_*100, 2)}%')

The best K-Neighbors model has 5 Neighbors
Accuracy for the optimized K-Neighbors Classifier Model is 97.02%

Decision Tree

# Initialize the tree and fit to the training data
tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(x_train, y_train)

# Utilize cross validation training to train the model for a baseline score
baseline_accuracy = cross_val_score(tree_classifier, x_train, y_train).mean() * 100

print(f'Accuracy for the baseline Decision Tree Classifier Model is {round(baseline_accuracy, 2)}%')

Accuracy for the baseline Decision Tree Classifier Model is 97.02%

We again use a grid search in an attempt to optimize the tree classifier. The hyperparameters include the maximimun number of splits the tree makes, as well as the minimum number of data points that can be included in an individual split.

tree_classifier = DecisionTreeClassifier()

# Define the hyperparameter search grid
tree_grid ={'max_depth':[int(n) for n in np.linspace(5, 50, 10)] + [None],
           'min_samples_split':[2, 3, 4, 5, 6]}

# Set up the grid search
tree_grid = GridSearchCV(tree_classifier, tree_grid)

# Fit the grid search, finding the optimal value
tree_grid.fit(x_train, y_train)

# Print the best model and its score
print(f'The best Decision Tree model has a max depth of {tree_grid.best_params_["max_depth"]} and a minimum of {tree_grid.best_params_["min_samples_split"]} points per split')
print(f'Accuracy for the optimized Decision Tree Classifier Model is {round(tree_grid.best_score_*100, 2)}%')

The best Decision Tree model has a max depth of 5 and a minimum of 3 points per split
Accuracy for the optimized Decision Tree Classifier Model is 97.75%

Random Forest

# Initialize the tree and fit to the training data
forest_classifier = RandomForestClassifier()
forest_classifier.fit(x_train, y_train)

# Utilize cross validation training to train the model for a baseline score
baseline_accuracy = cross_val_score(forest_classifier, x_train, y_train).mean() * 100

print(f'Accuracy for the baseline Random Forest Classifier Model is {round(baseline_accuracy, 2)}%')

Accuracy for the baseline Random Forest Classifier Model is 97.83%

We next optimize the random forest classifier with the same hyperparameters as the Decision Tree, plus an extra that defines the number of Decision Trees the forest uses.

forest_classifier = RandomForestClassifier()

# Define the hyperparameter search grid
forest_grid ={'max_depth':[int(n) for n in np.linspace(5, 20, 4)] + [None],
           'min_samples_split':[2, 3, 4, 5, 6],
           'n_estimators':[50, 100, 150]}

# Set up the grid search
forest_grid = GridSearchCV(forest_classifier, forest_grid)

# Fit the grid search, finding the optimal value
forest_grid.fit(x_train, y_train)

# Print the best model and its score
print(f'The best Random Forest model has a max depth of {forest_grid.best_params_["max_depth"]}, a minimum of {forest_grid.best_params_["min_samples_split"]} points per split, and {forest_grid.best_params_["n_estimators"]} trees')
print(f'Accuracy for the optimized Decision Tree Classifier Model is {round(forest_grid.best_score_*100, 2)}%')

The best Random Forest model has a max depth of 15, a minimum of 5 points per split, and 150 trees
Accuracy for the optimized Decision Tree Classifier Model is 98.07%

Because the optimized Random Forest did not improve results over the baseline, we consider the baseline the best version.

With these strong scores on the training set, we move next into making predictions on the test set to see how well our model generalizes to unseen data. Beyond considering models for their accuracy during this stage, we can also rely on two other metrics for classification tasks. These are Precision and Recall. Precision tells us what our accuracy is among positive predictions (if we say elected how often are we right?), while recall tells us among all of our positive predicions, how many of the true positives did we find (what percent of the actual HOFers did we find?)

Finally, because precision and recall always move in opposite directions as one another, we also introduce the F1 metric. This metric balances both precision and recall, offering a more general view on how the model is performing.

# Test logistic model
logistic_preds = logistic_model.predict(x_test)
logistic_accuracy = accuracy_score(y_test, logistic_preds)
logistic_precision = precision_score(y_test, logistic_preds)
logistic_recall = recall_score(y_test, logistic_preds)
logistic_f1 = f1_score(y_test, logistic_preds)

# Test KNN model
knn_test_preds = knn_grid.best_estimator_.predict(x_test)
knn_accuracy = accuracy_score(y_test, knn_test_preds)
knn_precision = precision_score(y_test, knn_test_preds)
knn_recall = recall_score(y_test, knn_test_preds)
knn_f1 = f1_score(y_test, knn_test_preds)

# Test Decision Tree model
tree_test_preds = tree_grid.best_estimator_.predict(x_test)
tree_accuracy = accuracy_score(y_test, tree_test_preds)
tree_precision = precision_score(y_test, tree_test_preds)
tree_recall = recall_score(y_test, tree_test_preds)
tree_f1 = f1_score(y_test, tree_test_preds)

# Test Random Forest model
forest_test_preds = forest_grid.best_estimator_.predict(x_test)
forest_accuracy = accuracy_score(y_test, forest_test_preds)
forest_precision = precision_score(y_test, forest_test_preds)
forest_recall = recall_score(y_test, forest_test_preds)
forest_f1 = f1_score(y_test, forest_test_preds)

# Convert scoring to a DataFrame
results = {
    "Model": ["Logistic Regression", "KNN", "Decision Tree", "Random Forest"],
    "Accuracy": [logistic_accuracy, knn_accuracy, tree_accuracy, forest_accuracy],
    "Precision": [logistic_precision, knn_precision, tree_precision, forest_precision],
    "Recall": [logistic_recall, knn_recall, tree_recall, forest_recall],
    "F1 Score": [logistic_f1, knn_f1, tree_f1, forest_f1]
}

# Convert the dictionary into a pandas DataFrame
results_df = pd.DataFrame(results)

print('Binary Classification Results by Model\n')

# Display the DataFrame
print(results_df)

Binary Classification Results by Model

                 Model  Accuracy  Precision    Recall  F1 Score
0  Logistic Regression  0.979100   0.875000  0.368421  0.518519
1                  KNN  0.971061   0.666667  0.105263  0.181818
2        Decision Tree  0.975884   1.000000  0.210526  0.347826
3        Random Forest  0.975884   0.750000  0.315789  0.444444

Here an interesting trend emerges. While the accuracy of each model is quite high, the secondary metrics are not as rosy. This is occuring because of the class imbalance in our dataset. With almost 97% of the datapoints being non-elections, we could achieve a 97% accuracy by randomly guessing, or just guessing non-election for every single point. Thus, the precision and recall tell a better story as to the fact that when we guess a succesfull election, we are fairly accurate, but we often fail to predict all of the elections.

That said, we do still see significant improvement over a random guesser, which tells us that our model is working to predict elections succesfully. With a random baseline just under 97%, we improve incorrect predictions by ~30% with our logistic model.

Multi-Class Classification

We do the same exercise as above, this time using the full set of classes [elected, eliminated, expired, limbo]

# Split data into training and testing datasets for binary classification
# We reserve 20% of the data for testing
x_train, x_test, y_train, y_test = train_test_split(decomposed_data, multi_targets, test_size=0.2, random_state=5000, stratify=multi_targets)

# Initialize Logistic Regression Model and fit to the training data
logistic_model = LogisticRegression()
logistic_model.fit(x_train, y_train)

# Utilize cross validation training to train the logistic model
# for a baseline score
baseline_accuracy = cross_val_score(logistic_model, x_train, y_train).mean() * 100

print(f'Accuracy for the baseline Logistic Regression Model is {round(baseline_accuracy, 2)}% \n')

# Initialize the KNeighbor model and fit to the training data
knn_classifier = KNeighborsClassifier()
knn_classifier.fit(x_train, y_train)

# Utilize cross validation training to train the KNeighbors model
# for a baseline score
baseline_accuracy = cross_val_score(knn_classifier, x_train, y_train).mean() * 100

print(f'Accuracy for the baseline K-Neighbors Classifier Model is {round(baseline_accuracy, 2)}%')

knn_classifier = KNeighborsClassifier()

# Define the grid values to search
# We test each value from 5 to 50 in increments of 5
knn_grid ={'n_neighbors': [int(n) for n in np.linspace(5, 50, 10)]}

# Set up the grid search
knn_grid = GridSearchCV(knn_classifier, knn_grid)

# Fit the grid search, finding the optimal value
knn_grid.fit(x_train, y_train)

# Print the best model and its score
print(f'The best K-Neighbors model has {knn_grid.best_params_["n_neighbors"]} Neighbors')
print(f'Accuracy for the optimized K-Neighbors Classifier Model is {round(knn_grid.best_score_, 2)}% \n')


# Initialize the tree and fit to the training data
tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(x_train, y_train)

# Utilize cross validation training to train the model for a baseline score
baseline_accuracy = cross_val_score(tree_classifier, x_train, y_train).mean() * 100

print(f'Accuracy for the baseline Decision Tree Classifier Model is {round(baseline_accuracy, 2)}%')

tree_classifier = DecisionTreeClassifier()

# Define the hyperparameter search grid
tree_grid ={'max_depth':[int(n) for n in np.linspace(5, 50, 10)] + [None],
           'min_samples_split':[2, 3, 4, 5, 6]}

# Set up the grid search
tree_grid = GridSearchCV(tree_classifier, tree_grid)

# Fit the grid search, finding the optimal value
tree_grid.fit(x_train, y_train)

# Print the best model and its score
print(f'The best Decision Tree model has a max depth of {tree_grid.best_params_["max_depth"]} and a minimum of {tree_grid.best_params_["min_samples_split"]} points per split')
print(f'Accuracy for the optimized Decision Tree Classifier Model is {round(tree_grid.best_score_, 2)}% \n')


# Initialize the tree and fit to the training data
forest_classifier = RandomForestClassifier()
forest_classifier.fit(x_train, y_train)

# Utilize cross validation training to train the model for a baseline score
baseline_accuracy = cross_val_score(forest_classifier, x_train, y_train).mean() * 100

print(f'Accuracy for the baseline Random Forest Classifier Model is {round(baseline_accuracy, 2)}%')

forest_classifier = RandomForestClassifier()

# Define the hyperparameter search grid
forest_grid ={'max_depth':[int(n) for n in np.linspace(5, 20, 4)] + [None],
           'min_samples_split':[2, 3, 4, 5, 6],
           'n_estimators':[50, 100, 150]}

# Set up the grid search
forest_grid = GridSearchCV(forest_classifier, forest_grid)

# Fit the grid search, finding the optimal value
forest_grid.fit(x_train, y_train)

# Print the best model and its score
print(f'The best Random Forest model has a max depth of {forest_grid.best_params_["max_depth"]}, a minimum of {forest_grid.best_params_["min_samples_split"]} points per split, and {forest_grid.best_params_["n_estimators"]} trees')
print(f'Accuracy for the optimized Decision Tree Classifier Model is {round(forest_grid.best_score_, 2)}%')

Accuracy for the baseline Logistic Regression Model is 81.66% 

Accuracy for the baseline K-Neighbors Classifier Model is 82.5%
The best K-Neighbors model has 5 Neighbors
Accuracy for the optimized K-Neighbors Classifier Model is 0.83% 

Accuracy for the baseline Decision Tree Classifier Model is 82.1%
The best Decision Tree model has a max depth of 15 and a minimum of 2 points per split
Accuracy for the optimized Decision Tree Classifier Model is 0.83% 

Accuracy for the baseline Random Forest Classifier Model is 87.05%
The best Random Forest model has a max depth of 20, a minimum of 3 points per split, and 50 trees
Accuracy for the optimized Decision Tree Classifier Model is 0.88%

# Test logistic model
logistic_preds = logistic_model.predict(x_test)
logistic_accuracy = accuracy_score(y_test, logistic_preds)
logistic_precision = precision_score(y_test, logistic_preds, average='weighted')
logistic_recall = recall_score(y_test, logistic_preds, average='weighted')
logistic_f1 = f1_score(y_test, logistic_preds, average='weighted')

# Test KNN model
knn_test_preds = knn_grid.best_estimator_.predict(x_test)
knn_accuracy = accuracy_score(y_test, knn_test_preds)
knn_precision = precision_score(y_test, knn_test_preds, average='weighted')
knn_recall = recall_score(y_test, knn_test_preds, average='weighted')
knn_f1 = f1_score(y_test, knn_test_preds, average='weighted')

# Test Decision Tree model
tree_test_preds = tree_grid.best_estimator_.predict(x_test)
tree_accuracy = accuracy_score(y_test, tree_test_preds)
tree_precision = precision_score(y_test, tree_test_preds, average='weighted')
tree_recall = recall_score(y_test, tree_test_preds, average='weighted')
tree_f1 = f1_score(y_test, tree_test_preds, average='weighted')

# Test Random Forest model
forest_test_preds = forest_grid.best_estimator_.predict(x_test)
forest_accuracy = accuracy_score(y_test, forest_test_preds)
forest_precision = precision_score(y_test, forest_test_preds, average='weighted')
forest_recall = recall_score(y_test, forest_test_preds, average='weighted')
forest_f1 = f1_score(y_test, forest_test_preds, average='weighted')

# Convert scoring to a DataFrame
results = {
    "Model": ["Logistic Regression", "KNN", "Decision Tree", "Random Forest"],
    "Accuracy": [logistic_accuracy, knn_accuracy, tree_accuracy, forest_accuracy],
    "Precision": [logistic_precision, knn_precision, tree_precision, forest_precision],
    "Recall": [logistic_recall, knn_recall, tree_recall, forest_recall],
    "F1 Score": [logistic_f1, knn_f1, tree_f1, forest_f1]
}

# Convert the dictionary into a pandas DataFrame
results_df = pd.DataFrame(results)

print('Test results for Multi-Class Classification by Model\n')
# Display the DataFrame
print(results_df)

Test results for Multi-Class Classification by Model

                 Model  Accuracy  Precision    Recall  F1 Score
0  Logistic Regression  0.795820   0.793197  0.795820  0.794228
1                  KNN  0.834405   0.819622  0.834405  0.815207
2        Decision Tree  0.827974   0.819027  0.827974  0.822636
3        Random Forest  0.860129   0.844698  0.860129  0.847190

Regression

For our last task, we utilize the regression methods to predict the total vote percentage received by the player. For our metric of success, we use the ‘mean squared error’, which is the average of the squared errors for each individual prediction.

# Split data into training and testing datasets for regression
# We reserve 20% of the data for testing
x_train, x_test, y_train, y_test = train_test_split(decomposed_data, reg_targets, test_size=0.2, random_state=5000, stratify=binary_targets)

Linear Regression

# Initialize Logistic Regression Model and fit to the training data
linear_model = LinearRegression()
linear_model.fit(x_train, y_train)

# Utilize cross validation training to train the logistic model
# for a baseline score
baseline_accuracy = cross_val_score(linear_model, x_train, y_train, scoring='neg_mean_squared_error').mean()

print(f'MSE for the baseline Linear Regression Model is {round(baseline_accuracy*-1, 2)}')

MSE for the baseline Linear Regression Model is 165.66

K-Neighbors Regression

# Initialize the KNeighbor model and fit to the training data
knn_regressor = KNeighborsRegressor()


# Utilize cross validation training to train the KNeighbors model
# for a baseline score
baseline_accuracy = cross_val_score(knn_regressor, x_train, y_train, scoring='neg_mean_squared_error').mean()

print(f'MSE for the baseline K-Neighbors Classifier Model is {round(baseline_accuracy*-1, 2)}\n')

knn_regressor = KNeighborsRegressor()
knn_regressor.fit(x_train, y_train)

# Define the neighbor values to search
# We test each value from 5 to 50 in increments of 5
knn_grid ={'n_neighbors': [int(n) for n in np.linspace(5, 50, 10)]}

# Set up the grid search
knn_grid = GridSearchCV(knn_regressor, knn_grid, scoring='neg_mean_squared_error')

# Fit the grid search, finding the optimal value
knn_grid.fit(x_train, y_train)

# Print the best model and its score
print(f'The best K-Neighbors model has {knn_grid.best_params_["n_neighbors"]} Neighbors')
print(f'MSE for the optimized K-Neighbors Regressor Model is {round(knn_grid.best_score_*-1, 2)}')

MSE for the baseline K-Neighbors Classifier Model is 153.64

The best K-Neighbors model has 5 Neighbors
MSE for the optimized K-Neighbors Regressor Model is 153.64

# Initialize the model and fit to the training data
tree_regressor = DecisionTreeRegressor()

# Utilize cross validation training to train the model
# for a baseline score
baseline_accuracy = cross_val_score(tree_regressor, x_train, y_train, scoring='neg_mean_squared_error').mean()

print(f'MSE for the baseline Decision Tree Regressor Model is {round(baseline_accuracy*-1, 2)}\n')

tree_regressor = DecisionTreeRegressor()
tree_regressor.fit(x_train, y_train)

# Define the grid values to search
# We test each value from 5 to 50 in increments of 5
tree_grid = {'max_depth':[int(n) for n in np.linspace(5, 50, 10)] + [None],
            'min_samples_split':[2, 3, 4, 5, 6]}

# Set up the grid search
tree_grid = GridSearchCV(tree_regressor, tree_grid, scoring='neg_mean_squared_error')

# Fit the grid search, finding the optimal value
tree_grid.fit(x_train, y_train)

# Print the best model and its score
print(f'The optimized Decision Tree model has a max depth of {tree_grid.best_params_["max_depth"]} and a minimum of {tree_grid.best_params_["min_samples_split"]} points per split')
print(f'MSE for the optimized Decision Tree Regressor Model is {round(tree_grid.best_score_*-1, 2)}')

MSE for the baseline Decision Tree Regressor Model is 168.81

The optimized Decision Tree model has a max depth of None and a minimum of 5 points per split
MSE for the optimized Decision Tree Regressor Model is 162.21

Random Forest Regressor

# Initialize the model and fit to the training data
forest_regressor = RandomForestRegressor()
forest_regressor.fit(x_train, y_train)

# Utilize cross validation training to train the model
# for a baseline score
baseline_accuracy = cross_val_score(forest_regressor, x_train, y_train, scoring='neg_mean_squared_error').mean()

print(f'MSE for the baseline Random Forest Regressor Model is {round(baseline_accuracy*-1, 2)}\n')

tree_regressor = RandomForestRegressor()

# Define the grid values to search
forest_grid = {'max_depth':[int(n) for n in np.linspace(5, 20, 4)] + [None],
               'min_samples_split':[4,5,6], 
               'n_estimators':[100, 150]}

# Set up the grid search
forest_grid = GridSearchCV(forest_regressor, forest_grid, scoring='neg_mean_squared_error')

# Fit the grid search, finding the optimal value
forest_grid.fit(x_train, y_train)

# Print the best model and its score
print(f'The best Random Forest model has a max depth of {forest_grid.best_params_["max_depth"]}, a minimum of {forest_grid.best_params_["min_samples_split"]} points per split, and {forest_grid.best_params_["n_estimators"]} trees')
print(f'MSE for the optimized Decision Tree Regressor Model is {round(forest_grid.best_score_*-1, 2)}')

MSE for the baseline Random Forest Regressor Model is 98.09

The best Random Forest model has a max depth of None, a minimum of 5 points per split, and 150 trees
MSE for the optimized Decision Tree Regressor Model is 95.97

# Test linear model
linear_preds = linear_model.predict(x_test)
linear_mse = mean_squared_error(y_test, linear_preds)

# Test Nearest Neighbors model
knn_preds = knn_regressor.predict(x_test)
knn_mse = mean_squared_error(y_test, knn_preds)

# Test linear model
tree_preds = tree_grid.best_estimator_.predict(x_test)
tree_mse = mean_squared_error(y_test, tree_preds)

# Test linear model
forest_preds = forest_grid.best_estimator_.predict(x_test)
forest_mse = mean_squared_error(y_test, forest_preds)


# Convert scoring to a DataFrame
results = {
    "Model": ["Logistic Regression", "KNN", "Decision Tree", "Random Forest"],
    "MSE": [linear_mse, knn_mse, tree_mse, forest_mse],
}

# Convert the dictionary into a pandas DataFrame
results_df = pd.DataFrame(results)

print('Test results for Regression by Model\n')
# Display the DataFrame
print(results_df)

Test results for Regression by Model

                 Model         MSE
0  Logistic Regression  137.182293
1                  KNN   94.947084
2        Decision Tree   87.421412
3        Random Forest   65.643072

Once again, we see pretty strong results! With our random forest model we are able to predict voting outcomes with a MSE of ~65. Although we do see this is quite a large jump in accuracy from the training data, and may be due to randomness in the data. Even with this caveat however, other models also offer and MSE in the range of 85-95. We should however remember that the mean voting percentage is ~13%, given the multitude of players who do not make the HOF even once on the ballot. That said, if we use the MSE for just an average guesser that predicts ~13% each time, the MSE would be >400, so we do certainly see an improvement over this value!

This concludes the section on supervised learning, where we saw how it is possible to increase prediction accuracy of the BBWAA HOF ballot outcomes and voting percentage by utilizing an array of both classification and regression methods. For a more detailed report of the project as a whole, make sure to check out the report section!

References

IBM. Classification Models.

Geeksforgeeks. Regression in Machine Learning. Regression in Machine Learning.

IBM. KNN.

IBM. Decision Trees.

IBM. Random Forests.

IBM. Linear Regression.

IBM. Logistic Regression.