Data Cleaning

Summary

Like any project, cleaning our raw data collected in the prior step is critical to ensuring we have a dataset that can be explored and modeled to extract insights. In this section, we read in the raw scraped data, and process in the following steps:

  • Reading in all the scraped data
  • Looking carefully through each dataset and:
    • Ensuring data is read in properly
    • Checking for and potentially removing missing values
    • Dropping unnecessary columns that do not relate to the research questions
    • Creating new features for for analysis
    • Merging datasets together into a singular final dataset each for batters and pitchers

Code

Provide the source code used for this section of the project here.

If you’re using a package for code organization, you can import it at this point. However, make sure that the actual workflow steps—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

# General Imports
import pandas as pd
import pickle as pkl

from sklearn.impute import SimpleImputer

Reading in Raw Data

# Read in our datasets individually
all_inductees_df = pd.read_csv('../../data/raw-data/all_hof_inductees_table.csv')

yearly_hof_voting_df = pd.read_csv('../../data/raw-data/yearly_hof_voting_data.csv')

all_player_info = pd.read_csv('../../data/raw-data/all_player_info.csv')

with open('../../data/raw-data/all_player_stats.pkl', 'rb') as fpath:
    all_player_stats_dict = pkl.load(fpath)

Because we scraped our player stats into a dictionary with keys for each player, we next split this dictionary into DataFrames for each key type - Career Stats, Annual Stats, and Appearences by Position. While doing this, we also attatch the ID from the dictionary key into the underlying DataFrame so that we don’t lose the connection

# Define the list of all player IDs scraped
ids = list(all_player_stats_dict.keys())

for id in ids:
    if isinstance(all_player_stats_dict[id], dict):
        all_player_stats_dict[id].get('career_stats')['player_id'] = id
        all_player_stats_dict[id].get('annual_stats')['player_id'] = id
        all_player_stats_dict[id].get('appearances')['player_id'] = id

# Create a list of each DataFrame stored within each ID, accounting for the occurence of Non-dictionaries when scraping fails
all_player_career_stats = [all_player_stats_dict[player_id].get('career_stats') if isinstance(all_player_stats_dict[player_id], dict) else None for player_id in ids]
all_player_annual_stats = [all_player_stats_dict[player_id].get('annual_stats') if isinstance(all_player_stats_dict[player_id], dict) else None for player_id in ids]
all_player_career_appearences = [all_player_stats_dict[player_id].get('appearances')if isinstance(all_player_stats_dict[player_id], dict) else None for player_id in ids]

Cleaning Data

All Inducted Players DataFrame

Working through our DataFrames, we start with the DataFrame holding all succesfully inducted players. Start by taking a glance at the first few rows of the data.

all_inductees_df.head()
voting_year player living_status voting_body inducted_as votes vote_percentage
0 2024 Adrian Beltré 1979-Living BBWAA Player 366.0 95.1%
1 2024 Todd Helton 1973-Living BBWAA Player 307.0 79.7%
2 2024 Jim Leyland 1944-Living Contemporary Baseball Era Non-Players Manager NaN NaN
3 2024 Joe Mauer 1983-Living BBWAA Player 293.0 76.1%
4 2023 Fred McGriff 1963-Living Contemporary Baseball Era Player 16.0 100.0%

We observe a few necessary fixes for the table, including:

  • Player names not properly encoded
  • Voting bodies other than BBWAA
  • Non-Players included via non-BBWAA methods
  • NaN values for vote tallies for non-BBWAA methods

Additionally, there are some cosmetic alterations we can make, including:

  • Removing the living status, which we will not use for further analysis
### Clean the all_inductees_df DataFrame

# Define a function for encoding the player names
def decode_player_name(name):
    encoded_name = name.encode('latin-1')
    corrected_name = encoded_name.decode('utf-8')
    return corrected_name.strip()

# Encode the player names
all_inductees_df.player = all_inductees_df.player.apply(decode_player_name)

# Filter to only BBWAA votes
all_inductees_df = all_inductees_df[all_inductees_df.voting_body == 'BBWAA']

# Drop the living status column
all_inductees_df = all_inductees_df.drop(columns=['living_status'])

Looking at the table again, all previous concerns are now resolved:

all_inductees_df.head()
voting_year player voting_body inducted_as votes vote_percentage
0 2024 Adrian Beltré BBWAA Player 366.0 95.1%
1 2024 Todd Helton BBWAA Player 307.0 79.7%
3 2024 Joe Mauer BBWAA Player 293.0 76.1%
5 2023 Scott Rolen BBWAA Player 297.0 76.3%
12 2022 David Ortiz BBWAA Player 307.0 77.9%

Next, we confirm the full table is clean, with no missing values, and all the columns we desire.

all_inductees_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 134 entries, 0 to 345
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   voting_year      134 non-null    int64  
 1   player           134 non-null    object 
 2   voting_body      134 non-null    object 
 3   inducted_as      134 non-null    object 
 4   votes            134 non-null    float64
 5   vote_percentage  134 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 7.3+ KB

We also observe that the player names are now properly encoded - resulting in a cleaned dataframe. Thus, we save the data back to the processed data folder.

all_inductees_df.player.iloc[1:25]
1           Todd Helton
3             Joe Mauer
5           Scott Rolen
12          David Ortiz
13          Derek Jeter
16         Larry Walker
18         Roy Halladay
19       Edgar Martínez
20         Mike Mussina
21       Mariano Rivera
23    Vladimir Guerrero
24       Trevor Hoffman
25        Chipper Jones
27            Jim Thome
29         Jeff Bagwell
30           Tim Raines
31       Iván Rodríguez
34      Ken Griffey Jr.
35          Mike Piazza
36         Craig Biggio
37        Randy Johnson
38       Pedro Martinez
39          John Smoltz
41          Tom Glavine
Name: player, dtype: object
all_inductees_df.to_csv('../../data/processed-data/all_hof_inductees.csv', index=False)

Yearly BBWAA HOF Ballot Voting DataFrame

The next DataFrame is that containing the yearly voting results of the BBWAA ballot. We again print out the first few rows to observe the table structure.

yearly_hof_voting_df.head()
rank name player_page_url year_on_ballot votes votes_pct hof_monitor hof_standard experience WAR_career ... G_p GS SV IP H_p HR_p BB_p SO_p pos_summary voting_year
0 1 Ty Cobb https://www.baseball-reference.com/players/c/c... 1st 222 98.2% 445.0 75.0 24.0 151.4 ... 3.0 0.0 1.0 5.0 6.0 0.0 2.0 0.0 *8*9H7/3145 1936
1 2 Babe Ruth https://www.baseball-reference.com/players/r/r... 1st 215 95.1% 411.0 79.0 22.0 162.2 ... 163.0 147.0 4.0 1221.1 974.0 10.0 441.0 488.0 *9*71H83 1936
2 3 Honus Wagner https://www.baseball-reference.com/players/w/w... 1st 215 95.1% 313.0 75.0 21.0 131.0 ... 2.0 0.0 0.0 8.1 7.0 0.0 6.0 6.0 *6*O3594H7/81 1936
3 4 Christy Mathewson https://www.baseball-reference.com/players/m/m... 1st 205 90.7% 303.0 84.0 17.0 106.7 ... 636.0 552.0 30.0 4788.2 4219.0 89.0 848.0 2507.0 *1/H397 1936
4 5 Walter Johnson https://www.baseball-reference.com/players/j/j... 1st 189 83.6% 364.0 82.0 21.0 166.9 ... 802.0 666.0 34.0 5914.1 4913.0 97.0 1363.0 3509.0 *1H/897 1936

5 rows × 41 columns

Again, there are some immediate fixes we must complete. These include:

  • Correctly encoding player names
  • Converting the year on ballot and vote percentage columns to numeric values

As well as some cosmetic edits:

  • Dropping all the stats columns, plus rank, url, and raw vote tallies
    • We already have all the stats we will use in the stats DataFrames

For this DataFrame however, there is also some Feature Engineering/Creation we complete:

  • Building a column for player ID, taken from the scraped URL
  • Building a column for the players vote percentage in the year prior
    • Again, this leverages a single-step markov assumption, that current vote % is impacted by (and only by) last years vote total.
  • Create an ‘outcome’ column, based on whether the player was elected, expired, or staying in limbo
# Decode the player names
yearly_hof_voting_df.name = yearly_hof_voting_df.name.apply(decode_player_name)

# Convert the year on ballot and vote percent colulmns to numberic values
yearly_hof_voting_df.year_on_ballot = yearly_hof_voting_df.year_on_ballot.apply(lambda x: int(x[0]))
yearly_hof_voting_df.votes_pct = yearly_hof_voting_df.votes_pct.apply(lambda x: float(x[:-1]))

# Build the ID column
yearly_hof_voting_df['player_id'] = yearly_hof_voting_df['player_page_url'].apply(lambda x: x.split('/')[-1].split('.')[0])

# Build the prior year vote percent column
# For players who are on the ballot for the first year, we insert None, and will later impute a value
yearly_hof_voting_df['ly_votes_pct'] = yearly_hof_voting_df.apply(lambda x: yearly_hof_voting_df[(yearly_hof_voting_df.player_id == x.player_id) & (yearly_hof_voting_df.year_on_ballot == x.year_on_ballot - 1)].votes_pct.iloc[0] if x.year_on_ballot > 1 else None, axis=1)

# Create outcome column
def return_ballot_outcome(year_on_ballot, current_vote):
    '''Returns the outcome for a player in the BBWAA vote, depending on their vote share and numbher
      of previous years on the ballot'''
    
    if current_vote >= 75:
        return 'elected'
    elif current_vote < 5:
        return 'eliminated'
    elif current_vote < 75 and year_on_ballot == 9:
        return 'expired'
    else:
        return 'limbo'

yearly_hof_voting_df['outcome'] = yearly_hof_voting_df.apply(lambda x: return_ballot_outcome(x.year_on_ballot, x.votes_pct), axis=1)

# Drop unnecessary columns
yearly_hof_voting_df = yearly_hof_voting_df[['name', 'player_id', 'voting_year', 'year_on_ballot', 'votes_pct', 'ly_votes_pct', 'outcome']]

Because we’ve filled in the prior year’s vote percentage as None for players in their first year on the ballot, we now must impute, or fill in, those missing values. To do this, we will impute with the average vote percentage of first year players. We don’t impute with 0, as the values are numeric and relative, so 0 may undervalue these players in a model. We also don’t chose the overall mean value, as we already have the additional information that these are first year players, so we may introduce falsehoods by including ballot ‘veterans’ in the mean.

# Calculate the mean vote percentage garnered by first year players
mean_value = yearly_hof_voting_df[['year_on_ballot', 'votes_pct']].groupby(by='year_on_ballot').mean().loc[1].votes_pct

# Impute the mean value
yearly_hof_voting_df.ly_votes_pct = yearly_hof_voting_df.ly_votes_pct.fillna(mean_value)

Finally, we once again check that the first few rows of the dataset to ensure our changes have taken effect, and print info about the dataset as a whole to ensure we have completed the cleaning process, before we write the dataset out to the processed data folder.

yearly_hof_voting_df.head()
name player_id voting_year year_on_ballot votes_pct ly_votes_pct outcome
0 Ty Cobb cobbty01 1936 1 98.2 12.975208 elected
1 Babe Ruth ruthba01 1936 1 95.1 12.975208 elected
2 Honus Wagner wagneho01 1936 1 95.1 12.975208 elected
3 Christy Mathewson mathech01 1936 1 90.7 12.975208 elected
4 Walter Johnson johnswa01 1936 1 83.6 12.975208 elected
yearly_hof_voting_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4900 entries, 0 to 4899
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            4900 non-null   object 
 1   player_id       4900 non-null   object 
 2   voting_year     4900 non-null   int64  
 3   year_on_ballot  4900 non-null   int64  
 4   votes_pct       4900 non-null   float64
 5   ly_votes_pct    4900 non-null   float64
 6   outcome         4900 non-null   object 
dtypes: float64(2), int64(2), object(3)
memory usage: 268.1+ KB
# Write DataFrame to csv in processed data folder
yearly_hof_voting_df.to_csv('../../data/processed-data/all_bbwaa_voting.csv', index=False)

Player Names and IDs DataFrame

For the next DataFrame, we inspect the DataFrame containing all the General Player Information

all_player_info.head()
name id url_suffix
0 David Aardsma aardsda01 /players/a/aardsda01.shtml
1 Henry Aaron aaronha01 /players/a/aaronha01.shtml
2 Tommie Aaron aaronto01 /players/a/aaronto01.shtml
3 Don Aase aasedo01 /players/a/aasedo01.shtml
4 Andy Abad abadan01 /players/a/abadan01.shtml
all_player_info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23370 entries, 0 to 23369
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        23370 non-null  object
 1   id          23370 non-null  object
 2   url_suffix  23370 non-null  object
dtypes: object(3)
memory usage: 547.9+ KB

We see that the DataFrame is already clean and tidy, but we can drop the url suffix, as we will not need it for further research. We then save the DataFrame to the processed data folder

# Drop the URL column
all_player_info = all_player_info.drop(columns=['url_suffix'])

# Save the DataFrame
all_player_info.to_csv('../../data/processed-data/all_player_info.csv', index=False)

Career Stats for All Players

Moving onto the career stats for every player throughout history, we begin by dividing data between batters and pitchers. We accomplish this by filtering DataFrames based on whether the player meets thresholds for innings pitched and if not then plate appearences. During this process we also drop duplicated column names so that we can combine DataFrames in a future step.

  • If players do not meet either threshold, we simply ignore them as there is zero possibility they will be eligible for the HOF under BBWAA rules
batter_career_stats_l = []
pitcher_career_stats_l = []

for series in all_player_career_stats:
    if isinstance(series, pd.DataFrame):
        # Check if the player pitched more than 100 innings
        if 'p_ip' in series.columns and float(series.p_ip.T.iloc[0,0]) > 150 and series['player_id'].iloc[0] != 'ruthba01': # We also make Babe Ruth a Batter
            duplicated_columns = series.columns.duplicated()
            pitcher_career_stats_l.append(series.loc[:, ~duplicated_columns])
        
        # Check if the batter had at least 500 plate appearences
        elif 'b_pa' in series.columns and float(series.b_pa.T.iloc[0,0]) > 850:
            duplicated_columns = series.columns.duplicated()
            batter_career_stats_l.append(series.loc[:, ~duplicated_columns])

# Finally, we concatonate all the individual DataFrames into one
pitcher_career_stats = pd.concat([df for df in pitcher_career_stats_l], ignore_index=True)
batter_career_stats = pd.concat([df for df in batter_career_stats_l], ignore_index=True)

Now we inspect the first few rows of both the batting and pitching DataFrames

batter_career_stats.head()
b_war b_games b_pa b_ab b_r b_h b_doubles b_triples b_hr b_rbi ... p_avg_exit_velo p_hard_hit_perc p_ld_perc p_gb_perc p_fb_perc p_gb_fb_ratio p_wpa_def p_cwpa_def p_baseout_runs p_win_loss_perc
0 143.1 3298 13941 12364 2174 3771 624 98 755 2297 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -2.8 437 1046 944 102 216 42 6 13 94 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 8.7 855 3479 3044 355 772 99 43 11 324 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1.8 452 1965 1756 307 493 67 46 19 280 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.5 702 2227 2044 273 523 109 23 62 242 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 99 columns

pitcher_career_stats.head()
b_war b_games b_pa b_ab b_r b_h b_doubles b_triples b_hr b_rbi ... p_cwpa_def p_baseout_runs accomplishments player_id b_gb_fb_ratio b_run_scoring_perc b_extra_bases_taken_perc b_stolen_base_perc b_avg_exit_velo b_hard_hit_perc
0 -0.1 139 5 4 0 0 0 0 0 0 ... -0.1% 11.21 aardsda01 NaN NaN NaN NaN NaN NaN
1 -0.1 81 5 5 0 0 0 0 0 0 ... 8.1% 36.38 1x All-Star aasedo01 NaN NaN NaN NaN NaN NaN
2 -0.1 159 9 9 0 1 0 0 0 0 ... -1.3% 10.25 abadfe01 3.00 0.0 NaN NaN NaN NaN
3 -1.0 79 252 225 21 38 3 3 0 17 ... NaN NaN abbeybe01 NaN 35.6 NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.7% 18.95 abbotan01 NaN NaN NaN NaN NaN NaN

5 rows × 99 columns

Both DataFrames have a large number of features, so it is necessary to print out the full columnn list, where we will determine which are necessary

pitcher_career_stats.columns
Index(['b_war', 'b_games', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_doubles',
       'b_triples', 'b_hr', 'b_rbi', 'b_sb', 'b_cs', 'b_bb', 'b_so',
       'b_batting_avg', 'b_onbase_perc', 'b_slugging_perc',
       'b_onbase_plus_slugging', 'b_onbase_plus_slugging_plus', 'b_roba',
       'b_rbat_plus', 'b_tb', 'b_gidp', 'b_hbp', 'b_sh', 'b_sf', 'b_ibb',
       'pos', 'awards', 'b_batting_avg_bip', 'b_iso_slugging',
       'b_home_run_perc', 'b_strikeout_perc', 'b_base_on_balls_perc',
       'b_ld_perc', 'b_gb_perc', 'b_fb_perc', 'b_pull_perc', 'b_center_perc',
       'b_oppo_perc', 'b_wpa_bat', 'b_cwpa_bat', 'b_baseout_runs', 'p_war',
       'p_w', 'p_l', 'p_win_loss_perc', 'p_earned_run_avg', 'p_g', 'p_gs',
       'p_gf', 'p_cg', 'p_sho', 'p_sv', 'p_ip', 'p_h', 'p_r', 'p_er', 'p_hr',
       'p_bb', 'p_ibb', 'p_so', 'p_hbp', 'p_bk', 'p_wp', 'p_bfp',
       'p_earned_run_avg_plus', 'p_fip', 'p_whip', 'p_hits_per_nine',
       'p_hr_per_nine', 'p_bb_per_nine', 'p_so_per_nine',
       'p_strikeouts_per_base_on_balls', 'p_batting_avg', 'p_onbase_perc',
       'p_slugging_perc', 'p_onbase_plus_slugging', 'p_batting_avg_bip',
       'p_home_run_perc', 'p_strikeout_perc', 'p_base_on_balls_perc',
       'p_avg_exit_velo', 'p_hard_hit_perc', 'p_ld_perc', 'p_gb_perc',
       'p_fb_perc', 'p_gb_fb_ratio', 'p_wpa_def', 'p_cwpa_def',
       'p_baseout_runs', 'accomplishments', 'player_id', 'b_gb_fb_ratio',
       'b_run_scoring_perc', 'b_extra_bases_taken_perc', 'b_stolen_base_perc',
       'b_avg_exit_velo', 'b_hard_hit_perc'],
      dtype='object')
pitcher_career_stats.columns
Index(['b_war', 'b_games', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_doubles',
       'b_triples', 'b_hr', 'b_rbi', 'b_sb', 'b_cs', 'b_bb', 'b_so',
       'b_batting_avg', 'b_onbase_perc', 'b_slugging_perc',
       'b_onbase_plus_slugging', 'b_onbase_plus_slugging_plus', 'b_roba',
       'b_rbat_plus', 'b_tb', 'b_gidp', 'b_hbp', 'b_sh', 'b_sf', 'b_ibb',
       'pos', 'awards', 'b_batting_avg_bip', 'b_iso_slugging',
       'b_home_run_perc', 'b_strikeout_perc', 'b_base_on_balls_perc',
       'b_ld_perc', 'b_gb_perc', 'b_fb_perc', 'b_pull_perc', 'b_center_perc',
       'b_oppo_perc', 'b_wpa_bat', 'b_cwpa_bat', 'b_baseout_runs', 'p_war',
       'p_w', 'p_l', 'p_win_loss_perc', 'p_earned_run_avg', 'p_g', 'p_gs',
       'p_gf', 'p_cg', 'p_sho', 'p_sv', 'p_ip', 'p_h', 'p_r', 'p_er', 'p_hr',
       'p_bb', 'p_ibb', 'p_so', 'p_hbp', 'p_bk', 'p_wp', 'p_bfp',
       'p_earned_run_avg_plus', 'p_fip', 'p_whip', 'p_hits_per_nine',
       'p_hr_per_nine', 'p_bb_per_nine', 'p_so_per_nine',
       'p_strikeouts_per_base_on_balls', 'p_batting_avg', 'p_onbase_perc',
       'p_slugging_perc', 'p_onbase_plus_slugging', 'p_batting_avg_bip',
       'p_home_run_perc', 'p_strikeout_perc', 'p_base_on_balls_perc',
       'p_avg_exit_velo', 'p_hard_hit_perc', 'p_ld_perc', 'p_gb_perc',
       'p_fb_perc', 'p_gb_fb_ratio', 'p_wpa_def', 'p_cwpa_def',
       'p_baseout_runs', 'accomplishments', 'player_id', 'b_gb_fb_ratio',
       'b_run_scoring_perc', 'b_extra_bases_taken_perc', 'b_stolen_base_perc',
       'b_avg_exit_velo', 'b_hard_hit_perc'],
      dtype='object')

We will not need this many features in our final datasets. Thus, for pitchers, we will use a concise set of pitching metrics, alongside a few simple batting metrics. For batters, we will filter down to a more consice set of batting metrics, while leaving out any pitching metrics. This is because there is effectively no sample of batters who consistiently pitch (outside of a few notable exceptions)

Batting Metrics

Metric Description
b_war Wins above Replacement (full definition here)
b_pa Plate Appearances
b_h Hits
b_hr Home Runs
b_sb Stolen Bases
b_bb Walks
b_so Strikeouts
b_batting_avg Batting Average
b_onbase_plus_slugging On Base % plus Slugging % (full definition here)
b_onbase_plus_slugging_plus On Base % plus Slugging % indexed to league average value of 100 (full definition here)
b_home_run_perc Home Run Percent
b_strikeout_perc Strikeout Percent
b_base_on_balls_perc Walk Percent
b_cwpa_bat Championship win probability added (full definition here)
b_baseout_runs Run Expectancy Change (full definition here)
accomplishments Awards won throughout career

Pitching Metrics

Metric Description
p_war Wins above Replacement (full definition here)
p_w Wins
p_win_loss_perc Win percentage
p_earned_run_avg Earned Run Average (Earned Runs / Innings Pitched)
p_earned_run_avg_plus Earned Run Average indexed to league average value of 100 (full definition here)
p_g Games appeared in
p_gs Games started
p_sho Shutouts
p_sv Saves
p_ip Innings Pitched
p_so Strikeouts
p_whip Walks and Hits per Innings Pitched
p_fip Fielding Independent Pitching (full definition here)
p_strikeouts_per_base_on_balls Ratio of strikeouts to walks
p_batting_avg Batting Average of batters faced
p_onbase_plus_slugging On Base % + Slugging % of batters faced
p_home_run_perc Home Run %
p_strikeout_perc Strikeout %
p_cwpa_def Championship win probability added (full definition here)
p_baseout_runs Run Expectancy Change (full definition here)
b_war Wins above Replacement (full definition here)
b_batting_avg Batting Average (as a batter)
b_onbase_plus_slugging_plus On Base % plus Slugging % (as a batter)
accomplishments Awards won throughout career
# Define the important columns and filter the datasets down
batter_columns = [
    "b_war", "b_pa", "b_h", "b_hr", "b_sb", "b_bb", "b_so", "b_batting_avg",
    "b_onbase_plus_slugging", "b_onbase_plus_slugging_plus", "b_home_run_perc",
    "b_strikeout_perc", "b_base_on_balls_perc", "b_cwpa_bat", "b_baseout_runs",
    "accomplishments", 'player_id'
]

pitcher_columns = [
    "p_war", "p_w", "p_win_loss_perc", "p_earned_run_avg", "p_earned_run_avg_plus",
    "p_g", "p_gs", "p_sho", "p_sv", "p_ip", "p_so", "p_whip", "p_fip",
    "p_strikeouts_per_base_on_balls", "p_batting_avg", "p_onbase_plus_slugging",
    "p_home_run_perc", "p_strikeout_perc", "p_cwpa_def", "p_baseout_runs",
    "accomplishments", 'b_war','b_batting_avg', 'b_onbase_plus_slugging_plus', 'player_id'
]

batter_career_stats = batter_career_stats[batter_columns]
pitcher_career_stats = pitcher_career_stats[pitcher_columns]

Looking through the career stats DataFrames, we still see that there are some missing values present. Specifically, these are for a handfull of metrics, for players who played so long ago that these metrics were not tracked in games. To solve this problem, we impute with the median of each stat. We choose the median becuase the dataset consists of all major league players, so the underlying data is likely skewed with outliers from the few excellent players of each statistic.

pitcher_career_stats.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3835 entries, 0 to 3834
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   p_war                           3835 non-null   object
 1   p_w                             3835 non-null   object
 2   p_win_loss_perc                 3835 non-null   object
 3   p_earned_run_avg                3835 non-null   object
 4   p_earned_run_avg_plus           3835 non-null   object
 5   p_g                             3835 non-null   object
 6   p_gs                            3835 non-null   object
 7   p_sho                           3835 non-null   object
 8   p_sv                            3835 non-null   object
 9   p_ip                            3835 non-null   object
 10  p_so                            3835 non-null   object
 11  p_whip                          3835 non-null   object
 12  p_fip                           3835 non-null   object
 13  p_strikeouts_per_base_on_balls  3835 non-null   object
 14  p_batting_avg                   3271 non-null   object
 15  p_onbase_plus_slugging          3271 non-null   object
 16  p_home_run_perc                 3835 non-null   object
 17  p_strikeout_perc                3835 non-null   object
 18  p_cwpa_def                      3259 non-null   object
 19  p_baseout_runs                  3259 non-null   object
 20  accomplishments                 3835 non-null   object
 21  b_war                           3750 non-null   object
 22  b_batting_avg                   3605 non-null   object
 23  b_onbase_plus_slugging_plus     3605 non-null   object
 24  player_id                       3835 non-null   object
dtypes: object(25)
memory usage: 749.1+ KB
batter_career_stats.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4425 entries, 0 to 4424
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   b_war                        4425 non-null   object
 1   b_pa                         4425 non-null   object
 2   b_h                          4425 non-null   object
 3   b_hr                         4425 non-null   object
 4   b_sb                         4412 non-null   object
 5   b_bb                         4425 non-null   object
 6   b_so                         4267 non-null   object
 7   b_batting_avg                4425 non-null   object
 8   b_onbase_plus_slugging       4425 non-null   object
 9   b_onbase_plus_slugging_plus  4425 non-null   object
 10  b_home_run_perc              4425 non-null   object
 11  b_strikeout_perc             4267 non-null   object
 12  b_base_on_balls_perc         4425 non-null   object
 13  b_cwpa_bat                   3777 non-null   object
 14  b_baseout_runs               3777 non-null   object
 15  accomplishments              4425 non-null   object
 16  player_id                    4425 non-null   object
dtypes: object(17)
memory usage: 587.8+ KB

To complete the imputing, we must also convert all possible stats to numeric values

# Convert values to numeric
batter_career_stats.b_cwpa_bat = batter_career_stats.b_cwpa_bat.apply(lambda x: x.replace('%', '') if type(x) == str else x)
pitcher_career_stats.p_cwpa_def = pitcher_career_stats.p_cwpa_def.apply(lambda x: x.replace('%', '') if type(x) == str else x)

batter_career_stats = batter_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))
pitcher_career_stats = pitcher_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))

# Impute column medians, ignoring the string columns
imputer = SimpleImputer(strategy='median')
batter_career_stats.loc[:, ~batter_career_stats.columns.isin(['accomplishments', 'player_id'])] = imputer.fit_transform(batter_career_stats.loc[:, ~batter_career_stats.columns.isin(['accomplishments', 'player_id'])])
pitcher_career_stats.loc[:, ~pitcher_career_stats.columns.isin(['accomplishments', 'player_id'])] = imputer.fit_transform(pitcher_career_stats.loc[:, ~pitcher_career_stats.columns.isin(['accomplishments', 'player_id'])])
/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/1207380148.py:5: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
  batter_career_stats = batter_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))
/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/1207380148.py:6: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
  pitcher_career_stats = pitcher_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))
pitcher_career_stats.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3835 entries, 0 to 3834
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   p_war                           3835 non-null   float64
 1   p_w                             3835 non-null   int64  
 2   p_win_loss_perc                 3835 non-null   float64
 3   p_earned_run_avg                3835 non-null   float64
 4   p_earned_run_avg_plus           3835 non-null   int64  
 5   p_g                             3835 non-null   int64  
 6   p_gs                            3835 non-null   int64  
 7   p_sho                           3835 non-null   int64  
 8   p_sv                            3835 non-null   int64  
 9   p_ip                            3835 non-null   float64
 10  p_so                            3835 non-null   int64  
 11  p_whip                          3835 non-null   float64
 12  p_fip                           3835 non-null   float64
 13  p_strikeouts_per_base_on_balls  3835 non-null   float64
 14  p_batting_avg                   3835 non-null   float64
 15  p_onbase_plus_slugging          3835 non-null   float64
 16  p_home_run_perc                 3835 non-null   float64
 17  p_strikeout_perc                3835 non-null   float64
 18  p_cwpa_def                      3835 non-null   float64
 19  p_baseout_runs                  3835 non-null   float64
 20  accomplishments                 3835 non-null   object 
 21  b_war                           3835 non-null   float64
 22  b_batting_avg                   3835 non-null   float64
 23  b_onbase_plus_slugging_plus     3835 non-null   float64
 24  player_id                       3835 non-null   object 
dtypes: float64(16), int64(7), object(2)
memory usage: 749.1+ KB

Now that the table is free from all missing values, the last step is to account for the awards data, which is currently unstructured text. To do this, we write a function that searches the texts for any of the 5 major awards we want to include, as well as their counts, before adding new columns for the award wins. By looking at awards, we also aim to capture greatness from players who had short stretches of greatness, but potentially didnt play for many years. The major awards are:

  • MVP - Most Valuable Player (Generally a hitter)
  • Cy Young - Best Pitcher
  • Batting Title - Highest Batting Average
  • Gold Glove - Best defensive player at a given position
  • All Star - Being named to the ‘All Star’ team in a given year
def scan_awards(text, award):
    '''Given the text from our award scraping function, and a specific award, return the number of times
       the award was won'''
    # Check if the award was won at all
    win = award.lower() in text.lower()

    # If the award was not won, return 0. Else, return the number of times it was won
    if not win:
        return 0
    else:
        if ',' in text: # The player won multiple awards, and the formatting is different
            try:
                num_wins = int(text.lower().split(award.lower())[0].split(',')[-1].strip().replace('x', ''))
            except (IndexError, ValueError): # Sometimes when winning the awarad once, the format changes
                return 1
        else:
            try:
                num_wins = int(text.lower().split('all-star'.lower())[0].strip().replace('x', ''))
            except (IndexError, ValueError):
                return 1

        return num_wins

# Insert Pitching Awards  
pitcher_career_stats["cy_youngs"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Cy Young"))
pitcher_career_stats["gold_gloves"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Gold Glove"))
pitcher_career_stats["mvps"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "MVP"))
pitcher_career_stats["all_stars"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "All-Star"))

# Insert Batting Awards
batter_career_stats["mvps"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "MVP"))
batter_career_stats["gold_gloves"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Gold Glove"))
batter_career_stats["batting_titles"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Batting Title"))
batter_career_stats["all_stars"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "All-Star"))

# Remove the general accomplishments columns from each DataFrame
batter_career_stats = batter_career_stats.drop(columns=['accomplishments'])
pitcher_career_stats = pitcher_career_stats.drop(columns=['accomplishments'])
# View the first few rows of the awards data for batters
batter_career_stats.head()[['player_id', 'mvps', 'gold_gloves', 'batting_titles', 'all_stars']]
player_id mvps gold_gloves batting_titles all_stars
0 aaronha01 1 3 2 25
1 aaronto01 0 0 0 0
2 abbated01 0 0 0 0
3 abbeych01 0 0 0 0
4 abbotku01 0 0 0 0

With the DataFrame completely clean, and the awards dealt with, we save it back to the processed data folder

batter_career_stats.to_csv('../../data/processed-data/batter_career_stats.csv', index=False)
pitcher_career_stats.to_csv('../../data/processed-data/pitcher_career_stats.csv', index=False)

Appearences by Position DataFrame

Finally, we move onto the appearences by position dataset, which we will use to gain a more granular look at the positions each player played throughout their career. Because this is a list of DataFrames, we start by concatonating everything into a singular one

appearances_df = pd.concat([x for x in all_player_career_appearences], ignore_index=True)

appearances_df.head()
lg_ID G_all GS G_batting G_defense G_p_app G_c G_1b G_2b G_3b G_ss G_lf_app G_cf_app G_rf_app G_of_app G_dh G_ph G_pr player_id
0 331 0 139 331 331 0 0 0 0 0 0 0 0 0 0 0 0 aardsda01
1 3298 3173 3298 2985 0 0 210 43 7 0 315 308 2174 2760 201 122 1 aaronha01
2 437 206 437 346 0 0 232 7 10 0 135 1 2 137 0 102 35 aaronto01
3 448 91 81 448 448 0 0 0 0 0 0 0 0 0 0 0 0 aasedo01
4 15 4 15 9 0 0 8 0 0 0 0 0 1 1 0 7 1 abadan01

The dataset looks quite clean already! We see the first 4 columns tally general position types, the final 2 (excluding player_id), tally games as a pinch hitter and pinch runner, while the G_[position] each tally games at a given position. The two steps we take to finish the job are dropping unnecessary columns, and converting game tallies to numeric values and then percentage shares.

Additionally, a small subset of players have missing data for their positions. To combat this, we fill the missing values with equal percentages across the player, signaling no relative impact from position.

# Convert values to numeric types
appearances_df = appearances_df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

# Define the important columns to keep and normalize
position_cols = ['G_p_app', 'G_c', 'G_1b', 'G_2b', 'G_3b',
                 'G_ss', 'G_lf_app', 'G_cf_app', 'G_rf_app',
                 'G_dh']

# Normalize the games by position into percentages
appearances_df[position_cols] = appearances_df[position_cols].div(appearances_df[position_cols].sum(axis=1), axis=0)

# Filter down to our final DataFrame
appearances_df = appearances_df[position_cols + ['player_id']]

# Fill missing values
appearances_df = appearances_df.fillna(1/9)
/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/2786824259.py:2: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
  appearances_df = appearances_df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

Looking at the final DataFrame, we see that it is now clean and can be written to the processed data folder.

appearances_df.head()
G_p_app G_c G_1b G_2b G_3b G_ss G_lf_app G_cf_app G_rf_app G_dh player_id
0 1.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 aardsda01
1 0.0 0.0 0.064457 0.013198 0.002149 0.0 0.096685 0.094537 0.667281 0.061694 aaronha01
2 0.0 0.0 0.599483 0.018088 0.025840 0.0 0.348837 0.002584 0.005168 0.000000 aaronto01
3 1.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 aasedo01
4 0.0 0.0 0.888889 0.000000 0.000000 0.0 0.000000 0.000000 0.111111 0.000000 abadan01
appearances_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21228 entries, 0 to 21227
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   G_p_app    21228 non-null  float64
 1   G_c        21228 non-null  float64
 2   G_1b       21228 non-null  float64
 3   G_2b       21228 non-null  float64
 4   G_3b       21228 non-null  float64
 5   G_ss       21228 non-null  float64
 6   G_lf_app   21228 non-null  float64
 7   G_cf_app   21228 non-null  float64
 8   G_rf_app   21228 non-null  float64
 9   G_dh       21228 non-null  float64
 10  player_id  21228 non-null  object 
dtypes: float64(10), object(1)
memory usage: 1.8+ MB
appearances_df.to_csv('../../data/processed-data/appearances.csv', index=False)

Final Dataset Building

Now that all of the individual DataFrames are cleaned (woohoo!), we move into condensing the data into one singular DataFrame for batters and one for pitchers, that we will use for future exploration and analysis. To accomplish this, we take the BBWAA voting DataFrame as our base, as it already contains only those players who have been voted on in a BBWAA ballot, and then we add extra features like the players stats on from our adjacent DataFrames.

# Read in the core DataFrames
final_dataset = pd.read_csv('../../data/processed-data/all_bbwaa_voting.csv')

# Read in the stats DataFrames
batter_stats = pd.read_csv('../../data/processed-data/batter_career_stats.csv')
pitcher_stats = pd.read_csv('../../data/processed-data/pitcher_career_stats.csv')
def sort_row_by_bp(player_id, batter_stats, pitcher_stats):
    if player_id in pitcher_stats.player_id.values:
        return 'pitcher'
    elif player_id in batter_stats.player_id.values:
        return 'batter'
    else:
        return 'missing'

# Attach the batter/pitcher designation for each entry and check the results
final_dataset['position'] = final_dataset.player_id.apply(lambda x: sort_row_by_bp(x, batter_stats, pitcher_stats))
final_dataset.position.value_counts()
position
batter     3144
missing    1226
pitcher     530
Name: count, dtype: int64

Upon investigation of the players with ‘missing’ values for their position, we can confirm that the vast majority of the players are ‘missing’ because they were among the group to fail to be scraped in the data collection phase due to changes in underlying HTML. We see below the only names of players who did succesfully have data scraped.

for player in final_dataset[final_dataset.position == 'missing'].player_id:
    if player == 'mccarjo99': # throws error bc the id is different across datasets
        continue
    val = all_player_stats_dict[player]
    if not isinstance(val, str):
        print(player)
rickebr01
rickebr01
rickebr01
rickebr01
ensje01
beaucji01

There are only 4 players left who did succesfully have data scraped. However, a quick seach reveals that these players were actually primarily manages or executives who played early enough that the ‘BBWAA’ eligibility rules were different such that they were voted on from this pool. Becasue of this, we will drop all players who are ‘missing’ in this dataset.

final_dataset = final_dataset[final_dataset.position != 'missing']
final_dataset.head()
name player_id voting_year year_on_ballot votes_pct ly_votes_pct outcome position
0 Ty Cobb cobbty01 1936 1 98.2 12.975208 elected batter
1 Babe Ruth ruthba01 1936 1 95.1 12.975208 elected batter
2 Honus Wagner wagneho01 1936 1 95.1 12.975208 elected batter
5 Nap Lajoie lajoina01 1936 1 64.6 12.975208 limbo batter
6 Tris Speaker speaktr01 1936 1 58.8 12.975208 limbo batter

Now that we are sure that all dataset entries have a valid connection to the stats data, we split the data into batters and pitchers, before attaching the career stats

pitcher_df = final_dataset[final_dataset.position == 'pitcher']
batter_df = final_dataset[final_dataset.position == 'batter']

Before we merge the stats with the BBWAA voting data, we will also add in the scandal data as set forth in the earlier fangraphs paper. We do this with a manual list of known scandelous players as set forth in that paper. We also filter down to players in our dataset. Reasons for scandals include:

  • Use of Performance Enhancing Drugs
  • Gambling
  • Conduct detrimental to the game and personal image

After adding in the presence of a scandal, we remove the players with scandals, as the presense of a scandal essentially interferes with the normal voting process, and critically, is a factor known before the voting process itself. Thus, for future data we would also be able to exclude them from predicions.

# Define the ids with scandals
scandal_ids = [
    'bondsba01', 'rodrial01',
    'pettian01', 'ortizda01',
    'schilcu01', 'dykstle01',
    'ramirma02', 'mcgwima01',
    'rosepe01']

# Create a scandal column in both the pitching and batting DataFrames
batter_df['scandal'] = batter_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)
pitcher_df['scandal'] = pitcher_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)

# Remove players with scandals
batter_df = batter_df[batter_df.scandal == 0]
pitcher_df = pitcher_df[pitcher_df.scandal == 0]
/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/2900954371.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batter_df['scandal'] = batter_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)
/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/2900954371.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pitcher_df['scandal'] = pitcher_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)
batter_df = batter_df.merge(batter_stats, on='player_id', how='left')
pitcher_df = pitcher_df.merge(pitcher_stats, on='player_id', how='left')

Next, we attach the appearances by position data with another merge

batter_df = batter_df.merge(appearances_df, on='player_id', how='left')
pitcher_df = pitcher_df.merge(appearances_df, on='player_id', how='left')

One other thing we must deal with, is the fact that some advanced metrics like championship win probability added don’t appear in the data until 1912, so players who played before and after 1912 all have depressed values due to its summative property. To counter this, we will override the values for these columns for the first few year of voting, until the league resumed play after WWII in 1946.

pitcher_df.head().columns
Index(['name', 'player_id', 'voting_year', 'year_on_ballot', 'votes_pct',
       'ly_votes_pct', 'outcome', 'position', 'scandal', 'p_war', 'p_w',
       'p_win_loss_perc', 'p_earned_run_avg', 'p_earned_run_avg_plus', 'p_g',
       'p_gs', 'p_sho', 'p_sv', 'p_ip', 'p_so', 'p_whip', 'p_fip',
       'p_strikeouts_per_base_on_balls', 'p_batting_avg',
       'p_onbase_plus_slugging', 'p_home_run_perc', 'p_strikeout_perc',
       'p_cwpa_def', 'p_baseout_runs', 'b_war', 'b_batting_avg',
       'b_onbase_plus_slugging_plus', 'cy_youngs', 'gold_gloves', 'mvps',
       'all_stars', 'G_p_app', 'G_c', 'G_1b', 'G_2b', 'G_3b', 'G_ss',
       'G_lf_app', 'G_cf_app', 'G_rf_app', 'G_dh'],
      dtype='object')
# Update some of the advanced stats for very early players
early_batters = batter_df[batter_df.voting_year <= 1945]
early_cwpa =  early_batters.b_cwpa_bat.mean()
early_re24 = early_batters.b_baseout_runs.mean()

batter_df.b_cwpa_bat = batter_df.apply(lambda x: early_cwpa if x.voting_year <= 1945 else x.b_cwpa_bat, axis=1)
batter_df.b_baseout_runs = batter_df.apply(lambda x: early_re24 if x.voting_year <= 1945 else x.b_baseout_runs, axis=1)

# And do that same for the pitchers DataFrame
early_pitchers = pitcher_df[pitcher_df.voting_year <= 1945]
early_cwpa =  early_pitchers.p_cwpa_def.mean()
early_re24 = early_pitchers.p_baseout_runs.mean()

pitcher_df.p_cwpa_def = pitcher_df.apply(lambda x: early_cwpa if x.voting_year <= 1945 else x.p_cwpa_def, axis=1)
pitcher_df.p_baseout_runs = pitcher_df.apply(lambda x: early_re24 if x.voting_year <= 1945 else x.p_baseout_runs, axis=1)

Finally, we write out our final datasets to the processed data folder!

batter_df.to_csv('../../data/processed-data/final_batter_df.csv', index=False)
pitcher_df.to_csv('../../data/processed-data/final_pitcher_df.csv', index=False)

Now that all of our data is cleaned, we move forward to exploring our datasets in depth, and begin analyzing our data! This can be found in the Exploratory Data Analysis tab.