Data Cleaning

Summary

Like any project, cleaning our raw data collected in the prior step is critical to ensuring we have a dataset that can be explored and modeled to extract insights. In this section, we read in the raw scraped data, and process in the following steps:

Reading in all the scraped data
Looking carefully through each dataset and:
- Ensuring data is read in properly
- Checking for and potentially removing missing values
- Dropping unnecessary columns that do not relate to the research questions
- Creating new features for for analysis
- Merging datasets together into a singular final dataset each for batters and pitchers

Code

Provide the source code used for this section of the project here.

If you’re using a package for code organization, you can import it at this point. However, make sure that the actual workflow steps—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

# General Imports
import pandas as pd
import pickle as pkl

from sklearn.impute import SimpleImputer

Reading in Raw Data

# Read in our datasets individually
all_inductees_df = pd.read_csv('../../data/raw-data/all_hof_inductees_table.csv')

yearly_hof_voting_df = pd.read_csv('../../data/raw-data/yearly_hof_voting_data.csv')

all_player_info = pd.read_csv('../../data/raw-data/all_player_info.csv')

with open('../../data/raw-data/all_player_stats.pkl', 'rb') as fpath:
    all_player_stats_dict = pkl.load(fpath)

Because we scraped our player stats into a dictionary with keys for each player, we next split this dictionary into DataFrames for each key type - Career Stats, Annual Stats, and Appearences by Position. While doing this, we also attatch the ID from the dictionary key into the underlying DataFrame so that we don’t lose the connection

# Define the list of all player IDs scraped
ids = list(all_player_stats_dict.keys())

for id in ids:
    if isinstance(all_player_stats_dict[id], dict):
        all_player_stats_dict[id].get('career_stats')['player_id'] = id
        all_player_stats_dict[id].get('annual_stats')['player_id'] = id
        all_player_stats_dict[id].get('appearances')['player_id'] = id

# Create a list of each DataFrame stored within each ID, accounting for the occurence of Non-dictionaries when scraping fails
all_player_career_stats = [all_player_stats_dict[player_id].get('career_stats') if isinstance(all_player_stats_dict[player_id], dict) else None for player_id in ids]
all_player_annual_stats = [all_player_stats_dict[player_id].get('annual_stats') if isinstance(all_player_stats_dict[player_id], dict) else None for player_id in ids]
all_player_career_appearences = [all_player_stats_dict[player_id].get('appearances')if isinstance(all_player_stats_dict[player_id], dict) else None for player_id in ids]

Cleaning Data

All Inducted Players DataFrame

Working through our DataFrames, we start with the DataFrame holding all succesfully inducted players. Start by taking a glance at the first few rows of the data.

all_inductees_df.head()

	voting_year	player	living_status	voting_body	inducted_as	votes	vote_percentage
0	2024	Adrian BeltrÃ©	1979-Living	BBWAA	Player	366.0	95.1%
1	2024	Todd Helton	1973-Living	BBWAA	Player	307.0	79.7%
2	2024	Jim Leyland	1944-Living	Contemporary Baseball Era Non-Players	Manager	NaN	NaN
3	2024	Joe Mauer	1983-Living	BBWAA	Player	293.0	76.1%
4	2023	Fred McGriff	1963-Living	Contemporary Baseball Era	Player	16.0	100.0%

We observe a few necessary fixes for the table, including:

Player names not properly encoded
Voting bodies other than BBWAA
Non-Players included via non-BBWAA methods
NaN values for vote tallies for non-BBWAA methods

Additionally, there are some cosmetic alterations we can make, including:

Removing the living status, which we will not use for further analysis

### Clean the all_inductees_df DataFrame

# Define a function for encoding the player names
def decode_player_name(name):
    encoded_name = name.encode('latin-1')
    corrected_name = encoded_name.decode('utf-8')
    return corrected_name.strip()

# Encode the player names
all_inductees_df.player = all_inductees_df.player.apply(decode_player_name)

# Filter to only BBWAA votes
all_inductees_df = all_inductees_df[all_inductees_df.voting_body == 'BBWAA']

# Drop the living status column
all_inductees_df = all_inductees_df.drop(columns=['living_status'])

Looking at the table again, all previous concerns are now resolved:

all_inductees_df.head()

	voting_year	player	voting_body	inducted_as	votes	vote_percentage
0	2024	Adrian Beltré	BBWAA	Player	366.0	95.1%
1	2024	Todd Helton	BBWAA	Player	307.0	79.7%
3	2024	Joe Mauer	BBWAA	Player	293.0	76.1%
5	2023	Scott Rolen	BBWAA	Player	297.0	76.3%
12	2022	David Ortiz	BBWAA	Player	307.0	77.9%

Next, we confirm the full table is clean, with no missing values, and all the columns we desire.

all_inductees_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 134 entries, 0 to 345
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   voting_year      134 non-null    int64  
 1   player           134 non-null    object 
 2   voting_body      134 non-null    object 
 3   inducted_as      134 non-null    object 
 4   votes            134 non-null    float64
 5   vote_percentage  134 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 7.3+ KB

We also observe that the player names are now properly encoded - resulting in a cleaned dataframe. Thus, we save the data back to the processed data folder.

all_inductees_df.player.iloc[1:25]

1           Todd Helton
3             Joe Mauer
5           Scott Rolen
12          David Ortiz
13          Derek Jeter
16         Larry Walker
18         Roy Halladay
19       Edgar Martínez
20         Mike Mussina
21       Mariano Rivera
23    Vladimir Guerrero
24       Trevor Hoffman
25        Chipper Jones
27            Jim Thome
29         Jeff Bagwell
30           Tim Raines
31       Iván Rodríguez
34      Ken Griffey Jr.
35          Mike Piazza
36         Craig Biggio
37        Randy Johnson
38       Pedro Martinez
39          John Smoltz
41          Tom Glavine
Name: player, dtype: object

all_inductees_df.to_csv('../../data/processed-data/all_hof_inductees.csv', index=False)

Yearly BBWAA HOF Ballot Voting DataFrame

The next DataFrame is that containing the yearly voting results of the BBWAA ballot. We again print out the first few rows to observe the table structure.

yearly_hof_voting_df.head()

	rank	name	player_page_url	year_on_ballot	votes	votes_pct	hof_monitor	hof_standard	experience	WAR_career	...	G_p	GS	SV	IP	H_p	HR_p	BB_p	SO_p	pos_summary	voting_year
0	1	Ty Cobb	https://www.baseball-reference.com/players/c/c...	1st	222	98.2%	445.0	75.0	24.0	151.4	...	3.0	0.0	1.0	5.0	6.0	0.0	2.0	0.0	89H7/3145	1936
1	2	Babe Ruth	https://www.baseball-reference.com/players/r/r...	1st	215	95.1%	411.0	79.0	22.0	162.2	...	163.0	147.0	4.0	1221.1	974.0	10.0	441.0	488.0	971H83	1936
2	3	Honus Wagner	https://www.baseball-reference.com/players/w/w...	1st	215	95.1%	313.0	75.0	21.0	131.0	...	2.0	0.0	0.0	8.1	7.0	0.0	6.0	6.0	6O3594H7/81	1936
3	4	Christy Mathewson	https://www.baseball-reference.com/players/m/m...	1st	205	90.7%	303.0	84.0	17.0	106.7	...	636.0	552.0	30.0	4788.2	4219.0	89.0	848.0	2507.0	*1/H397	1936
4	5	Walter Johnson	https://www.baseball-reference.com/players/j/j...	1st	189	83.6%	364.0	82.0	21.0	166.9	...	802.0	666.0	34.0	5914.1	4913.0	97.0	1363.0	3509.0	*1H/897	1936

5 rows × 41 columns

Again, there are some immediate fixes we must complete. These include:

Correctly encoding player names
Converting the year on ballot and vote percentage columns to numeric values

As well as some cosmetic edits:

Dropping all the stats columns, plus rank, url, and raw vote tallies
- We already have all the stats we will use in the stats DataFrames

For this DataFrame however, there is also some Feature Engineering/Creation we complete:

Building a column for player ID, taken from the scraped URL
Building a column for the players vote percentage in the year prior
- Again, this leverages a single-step markov assumption, that current vote % is impacted by (and only by) last years vote total.
Create an ‘outcome’ column, based on whether the player was elected, expired, or staying in limbo

# Decode the player names
yearly_hof_voting_df.name = yearly_hof_voting_df.name.apply(decode_player_name)

# Convert the year on ballot and vote percent colulmns to numberic values
yearly_hof_voting_df.year_on_ballot = yearly_hof_voting_df.year_on_ballot.apply(lambda x: int(x[0]))
yearly_hof_voting_df.votes_pct = yearly_hof_voting_df.votes_pct.apply(lambda x: float(x[:-1]))

# Build the ID column
yearly_hof_voting_df['player_id'] = yearly_hof_voting_df['player_page_url'].apply(lambda x: x.split('/')[-1].split('.')[0])

# Build the prior year vote percent column
# For players who are on the ballot for the first year, we insert None, and will later impute a value
yearly_hof_voting_df['ly_votes_pct'] = yearly_hof_voting_df.apply(lambda x: yearly_hof_voting_df[(yearly_hof_voting_df.player_id == x.player_id) & (yearly_hof_voting_df.year_on_ballot == x.year_on_ballot - 1)].votes_pct.iloc[0] if x.year_on_ballot > 1 else None, axis=1)

# Create outcome column
def return_ballot_outcome(year_on_ballot, current_vote):
    '''Returns the outcome for a player in the BBWAA vote, depending on their vote share and numbher
      of previous years on the ballot'''
    
    if current_vote >= 75:
        return 'elected'
    elif current_vote < 5:
        return 'eliminated'
    elif current_vote < 75 and year_on_ballot == 9:
        return 'expired'
    else:
        return 'limbo'

yearly_hof_voting_df['outcome'] = yearly_hof_voting_df.apply(lambda x: return_ballot_outcome(x.year_on_ballot, x.votes_pct), axis=1)

# Drop unnecessary columns
yearly_hof_voting_df = yearly_hof_voting_df[['name', 'player_id', 'voting_year', 'year_on_ballot', 'votes_pct', 'ly_votes_pct', 'outcome']]

Because we’ve filled in the prior year’s vote percentage as None for players in their first year on the ballot, we now must impute, or fill in, those missing values. To do this, we will impute with the average vote percentage of first year players. We don’t impute with 0, as the values are numeric and relative, so 0 may undervalue these players in a model. We also don’t chose the overall mean value, as we already have the additional information that these are first year players, so we may introduce falsehoods by including ballot ‘veterans’ in the mean.

# Calculate the mean vote percentage garnered by first year players
mean_value = yearly_hof_voting_df[['year_on_ballot', 'votes_pct']].groupby(by='year_on_ballot').mean().loc[1].votes_pct

# Impute the mean value
yearly_hof_voting_df.ly_votes_pct = yearly_hof_voting_df.ly_votes_pct.fillna(mean_value)

Finally, we once again check that the first few rows of the dataset to ensure our changes have taken effect, and print info about the dataset as a whole to ensure we have completed the cleaning process, before we write the dataset out to the processed data folder.

yearly_hof_voting_df.head()

	name	player_id	voting_year	year_on_ballot	votes_pct	ly_votes_pct	outcome
0	Ty Cobb	cobbty01	1936	1	98.2	12.975208	elected
1	Babe Ruth	ruthba01	1936	1	95.1	12.975208	elected
2	Honus Wagner	wagneho01	1936	1	95.1	12.975208	elected
3	Christy Mathewson	mathech01	1936	1	90.7	12.975208	elected
4	Walter Johnson	johnswa01	1936	1	83.6	12.975208	elected

yearly_hof_voting_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4900 entries, 0 to 4899
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            4900 non-null   object 
 1   player_id       4900 non-null   object 
 2   voting_year     4900 non-null   int64  
 3   year_on_ballot  4900 non-null   int64  
 4   votes_pct       4900 non-null   float64
 5   ly_votes_pct    4900 non-null   float64
 6   outcome         4900 non-null   object 
dtypes: float64(2), int64(2), object(3)
memory usage: 268.1+ KB

# Write DataFrame to csv in processed data folder
yearly_hof_voting_df.to_csv('../../data/processed-data/all_bbwaa_voting.csv', index=False)

Player Names and IDs DataFrame

For the next DataFrame, we inspect the DataFrame containing all the General Player Information

all_player_info.head()

	name	id	url_suffix
0	David Aardsma	aardsda01	/players/a/aardsda01.shtml
1	Henry Aaron	aaronha01	/players/a/aaronha01.shtml
2	Tommie Aaron	aaronto01	/players/a/aaronto01.shtml
3	Don Aase	aasedo01	/players/a/aasedo01.shtml
4	Andy Abad	abadan01	/players/a/abadan01.shtml

all_player_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23370 entries, 0 to 23369
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        23370 non-null  object
 1   id          23370 non-null  object
 2   url_suffix  23370 non-null  object
dtypes: object(3)
memory usage: 547.9+ KB

We see that the DataFrame is already clean and tidy, but we can drop the url suffix, as we will not need it for further research. We then save the DataFrame to the processed data folder

# Drop the URL column
all_player_info = all_player_info.drop(columns=['url_suffix'])

# Save the DataFrame
all_player_info.to_csv('../../data/processed-data/all_player_info.csv', index=False)

Career Stats for All Players

Moving onto the career stats for every player throughout history, we begin by dividing data between batters and pitchers. We accomplish this by filtering DataFrames based on whether the player meets thresholds for innings pitched and if not then plate appearences. During this process we also drop duplicated column names so that we can combine DataFrames in a future step.

If players do not meet either threshold, we simply ignore them as there is zero possibility they will be eligible for the HOF under BBWAA rules

batter_career_stats_l = []
pitcher_career_stats_l = []

for series in all_player_career_stats:
    if isinstance(series, pd.DataFrame):
        # Check if the player pitched more than 100 innings
        if 'p_ip' in series.columns and float(series.p_ip.T.iloc[0,0]) > 150 and series['player_id'].iloc[0] != 'ruthba01': # We also make Babe Ruth a Batter
            duplicated_columns = series.columns.duplicated()
            pitcher_career_stats_l.append(series.loc[:, ~duplicated_columns])
        
        # Check if the batter had at least 500 plate appearences
        elif 'b_pa' in series.columns and float(series.b_pa.T.iloc[0,0]) > 850:
            duplicated_columns = series.columns.duplicated()
            batter_career_stats_l.append(series.loc[:, ~duplicated_columns])

# Finally, we concatonate all the individual DataFrames into one
pitcher_career_stats = pd.concat([df for df in pitcher_career_stats_l], ignore_index=True)
batter_career_stats = pd.concat([df for df in batter_career_stats_l], ignore_index=True)

Now we inspect the first few rows of both the batting and pitching DataFrames

batter_career_stats.head()

	b_war	b_games	b_pa	b_ab	b_r	b_h	b_doubles	b_triples	b_hr	b_rbi	...	p_avg_exit_velo	p_hard_hit_perc	p_ld_perc	p_gb_perc	p_fb_perc	p_gb_fb_ratio	p_wpa_def	p_cwpa_def	p_baseout_runs	p_win_loss_perc
0	143.1	3298	13941	12364	2174	3771	624	98	755	2297	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-2.8	437	1046	944	102	216	42	6	13	94	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	8.7	855	3479	3044	355	772	99	43	11	324	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1.8	452	1965	1756	307	493	67	46	19	280	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	0.5	702	2227	2044	273	523	109	23	62	242	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 99 columns

pitcher_career_stats.head()

	b_war	b_games	b_pa	b_ab	b_r	b_h	b_doubles	b_triples	b_hr	b_rbi	...	p_cwpa_def	p_baseout_runs	accomplishments	player_id	b_gb_fb_ratio	b_run_scoring_perc	b_extra_bases_taken_perc	b_stolen_base_perc	b_avg_exit_velo	b_hard_hit_perc
0	-0.1	139	5	4	0	0	0	0	0	0	...	-0.1%	11.21		aardsda01	NaN	NaN	NaN	NaN	NaN	NaN
1	-0.1	81	5	5	0	0	0	0	0	0	...	8.1%	36.38	1x All-Star	aasedo01	NaN	NaN	NaN	NaN	NaN	NaN
2	-0.1	159	9	9	0	1	0	0	0	0	...	-1.3%	10.25		abadfe01	3.00	0.0	NaN	NaN	NaN	NaN
3	-1.0	79	252	225	21	38	3	3	0	17	...	NaN	NaN		abbeybe01	NaN	35.6	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0.7%	18.95		abbotan01	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 99 columns

Both DataFrames have a large number of features, so it is necessary to print out the full columnn list, where we will determine which are necessary

pitcher_career_stats.columns

Index(['b_war', 'b_games', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_doubles',
       'b_triples', 'b_hr', 'b_rbi', 'b_sb', 'b_cs', 'b_bb', 'b_so',
       'b_batting_avg', 'b_onbase_perc', 'b_slugging_perc',
       'b_onbase_plus_slugging', 'b_onbase_plus_slugging_plus', 'b_roba',
       'b_rbat_plus', 'b_tb', 'b_gidp', 'b_hbp', 'b_sh', 'b_sf', 'b_ibb',
       'pos', 'awards', 'b_batting_avg_bip', 'b_iso_slugging',
       'b_home_run_perc', 'b_strikeout_perc', 'b_base_on_balls_perc',
       'b_ld_perc', 'b_gb_perc', 'b_fb_perc', 'b_pull_perc', 'b_center_perc',
       'b_oppo_perc', 'b_wpa_bat', 'b_cwpa_bat', 'b_baseout_runs', 'p_war',
       'p_w', 'p_l', 'p_win_loss_perc', 'p_earned_run_avg', 'p_g', 'p_gs',
       'p_gf', 'p_cg', 'p_sho', 'p_sv', 'p_ip', 'p_h', 'p_r', 'p_er', 'p_hr',
       'p_bb', 'p_ibb', 'p_so', 'p_hbp', 'p_bk', 'p_wp', 'p_bfp',
       'p_earned_run_avg_plus', 'p_fip', 'p_whip', 'p_hits_per_nine',
       'p_hr_per_nine', 'p_bb_per_nine', 'p_so_per_nine',
       'p_strikeouts_per_base_on_balls', 'p_batting_avg', 'p_onbase_perc',
       'p_slugging_perc', 'p_onbase_plus_slugging', 'p_batting_avg_bip',
       'p_home_run_perc', 'p_strikeout_perc', 'p_base_on_balls_perc',
       'p_avg_exit_velo', 'p_hard_hit_perc', 'p_ld_perc', 'p_gb_perc',
       'p_fb_perc', 'p_gb_fb_ratio', 'p_wpa_def', 'p_cwpa_def',
       'p_baseout_runs', 'accomplishments', 'player_id', 'b_gb_fb_ratio',
       'b_run_scoring_perc', 'b_extra_bases_taken_perc', 'b_stolen_base_perc',
       'b_avg_exit_velo', 'b_hard_hit_perc'],
      dtype='object')

pitcher_career_stats.columns

Index(['b_war', 'b_games', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_doubles',
       'b_triples', 'b_hr', 'b_rbi', 'b_sb', 'b_cs', 'b_bb', 'b_so',
       'b_batting_avg', 'b_onbase_perc', 'b_slugging_perc',
       'b_onbase_plus_slugging', 'b_onbase_plus_slugging_plus', 'b_roba',
       'b_rbat_plus', 'b_tb', 'b_gidp', 'b_hbp', 'b_sh', 'b_sf', 'b_ibb',
       'pos', 'awards', 'b_batting_avg_bip', 'b_iso_slugging',
       'b_home_run_perc', 'b_strikeout_perc', 'b_base_on_balls_perc',
       'b_ld_perc', 'b_gb_perc', 'b_fb_perc', 'b_pull_perc', 'b_center_perc',
       'b_oppo_perc', 'b_wpa_bat', 'b_cwpa_bat', 'b_baseout_runs', 'p_war',
       'p_w', 'p_l', 'p_win_loss_perc', 'p_earned_run_avg', 'p_g', 'p_gs',
       'p_gf', 'p_cg', 'p_sho', 'p_sv', 'p_ip', 'p_h', 'p_r', 'p_er', 'p_hr',
       'p_bb', 'p_ibb', 'p_so', 'p_hbp', 'p_bk', 'p_wp', 'p_bfp',
       'p_earned_run_avg_plus', 'p_fip', 'p_whip', 'p_hits_per_nine',
       'p_hr_per_nine', 'p_bb_per_nine', 'p_so_per_nine',
       'p_strikeouts_per_base_on_balls', 'p_batting_avg', 'p_onbase_perc',
       'p_slugging_perc', 'p_onbase_plus_slugging', 'p_batting_avg_bip',
       'p_home_run_perc', 'p_strikeout_perc', 'p_base_on_balls_perc',
       'p_avg_exit_velo', 'p_hard_hit_perc', 'p_ld_perc', 'p_gb_perc',
       'p_fb_perc', 'p_gb_fb_ratio', 'p_wpa_def', 'p_cwpa_def',
       'p_baseout_runs', 'accomplishments', 'player_id', 'b_gb_fb_ratio',
       'b_run_scoring_perc', 'b_extra_bases_taken_perc', 'b_stolen_base_perc',
       'b_avg_exit_velo', 'b_hard_hit_perc'],
      dtype='object')

We will not need this many features in our final datasets. Thus, for pitchers, we will use a concise set of pitching metrics, alongside a few simple batting metrics. For batters, we will filter down to a more consice set of batting metrics, while leaving out any pitching metrics. This is because there is effectively no sample of batters who consistiently pitch (outside of a few notable exceptions)

Batting Metrics

Metric	Description
b_war	Wins above Replacement (full definition here)
b_pa	Plate Appearances
b_h	Hits
b_hr	Home Runs
b_sb	Stolen Bases
b_bb	Walks
b_so	Strikeouts
b_batting_avg	Batting Average
b_onbase_plus_slugging	On Base % plus Slugging % (full definition here)
b_onbase_plus_slugging_plus	On Base % plus Slugging % indexed to league average value of 100 (full definition here)
b_home_run_perc	Home Run Percent
b_strikeout_perc	Strikeout Percent
b_base_on_balls_perc	Walk Percent
b_cwpa_bat	Championship win probability added (full definition here)
b_baseout_runs	Run Expectancy Change (full definition here)
accomplishments	Awards won throughout career

Pitching Metrics

Metric	Description
p_war	Wins above Replacement (full definition here)
p_w	Wins
p_win_loss_perc	Win percentage
p_earned_run_avg	Earned Run Average (Earned Runs / Innings Pitched)
p_earned_run_avg_plus	Earned Run Average indexed to league average value of 100 (full definition here)
p_g	Games appeared in
p_gs	Games started
p_sho	Shutouts
p_sv	Saves
p_ip	Innings Pitched
p_so	Strikeouts
p_whip	Walks and Hits per Innings Pitched
p_fip	Fielding Independent Pitching (full definition here)
p_strikeouts_per_base_on_balls	Ratio of strikeouts to walks
p_batting_avg	Batting Average of batters faced
p_onbase_plus_slugging	On Base % + Slugging % of batters faced
p_home_run_perc	Home Run %
p_strikeout_perc	Strikeout %
p_cwpa_def	Championship win probability added (full definition here)
p_baseout_runs	Run Expectancy Change (full definition here)
b_war	Wins above Replacement (full definition here)
b_batting_avg	Batting Average (as a batter)
b_onbase_plus_slugging_plus	On Base % plus Slugging % (as a batter)
accomplishments	Awards won throughout career

# Define the important columns and filter the datasets down
batter_columns = [
    "b_war", "b_pa", "b_h", "b_hr", "b_sb", "b_bb", "b_so", "b_batting_avg",
    "b_onbase_plus_slugging", "b_onbase_plus_slugging_plus", "b_home_run_perc",
    "b_strikeout_perc", "b_base_on_balls_perc", "b_cwpa_bat", "b_baseout_runs",
    "accomplishments", 'player_id'
]

pitcher_columns = [
    "p_war", "p_w", "p_win_loss_perc", "p_earned_run_avg", "p_earned_run_avg_plus",
    "p_g", "p_gs", "p_sho", "p_sv", "p_ip", "p_so", "p_whip", "p_fip",
    "p_strikeouts_per_base_on_balls", "p_batting_avg", "p_onbase_plus_slugging",
    "p_home_run_perc", "p_strikeout_perc", "p_cwpa_def", "p_baseout_runs",
    "accomplishments", 'b_war','b_batting_avg', 'b_onbase_plus_slugging_plus', 'player_id'
]

batter_career_stats = batter_career_stats[batter_columns]
pitcher_career_stats = pitcher_career_stats[pitcher_columns]

Looking through the career stats DataFrames, we still see that there are some missing values present. Specifically, these are for a handfull of metrics, for players who played so long ago that these metrics were not tracked in games. To solve this problem, we impute with the median of each stat. We choose the median becuase the dataset consists of all major league players, so the underlying data is likely skewed with outliers from the few excellent players of each statistic.

pitcher_career_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3835 entries, 0 to 3834
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   p_war                           3835 non-null   object
 1   p_w                             3835 non-null   object
 2   p_win_loss_perc                 3835 non-null   object
 3   p_earned_run_avg                3835 non-null   object
 4   p_earned_run_avg_plus           3835 non-null   object
 5   p_g                             3835 non-null   object
 6   p_gs                            3835 non-null   object
 7   p_sho                           3835 non-null   object
 8   p_sv                            3835 non-null   object
 9   p_ip                            3835 non-null   object
 10  p_so                            3835 non-null   object
 11  p_whip                          3835 non-null   object
 12  p_fip                           3835 non-null   object
 13  p_strikeouts_per_base_on_balls  3835 non-null   object
 14  p_batting_avg                   3271 non-null   object
 15  p_onbase_plus_slugging          3271 non-null   object
 16  p_home_run_perc                 3835 non-null   object
 17  p_strikeout_perc                3835 non-null   object
 18  p_cwpa_def                      3259 non-null   object
 19  p_baseout_runs                  3259 non-null   object
 20  accomplishments                 3835 non-null   object
 21  b_war                           3750 non-null   object
 22  b_batting_avg                   3605 non-null   object
 23  b_onbase_plus_slugging_plus     3605 non-null   object
 24  player_id                       3835 non-null   object
dtypes: object(25)
memory usage: 749.1+ KB

batter_career_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4425 entries, 0 to 4424
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   b_war                        4425 non-null   object
 1   b_pa                         4425 non-null   object
 2   b_h                          4425 non-null   object
 3   b_hr                         4425 non-null   object
 4   b_sb                         4412 non-null   object
 5   b_bb                         4425 non-null   object
 6   b_so                         4267 non-null   object
 7   b_batting_avg                4425 non-null   object
 8   b_onbase_plus_slugging       4425 non-null   object
 9   b_onbase_plus_slugging_plus  4425 non-null   object
 10  b_home_run_perc              4425 non-null   object
 11  b_strikeout_perc             4267 non-null   object
 12  b_base_on_balls_perc         4425 non-null   object
 13  b_cwpa_bat                   3777 non-null   object
 14  b_baseout_runs               3777 non-null   object
 15  accomplishments              4425 non-null   object
 16  player_id                    4425 non-null   object
dtypes: object(17)
memory usage: 587.8+ KB

To complete the imputing, we must also convert all possible stats to numeric values

# Convert values to numeric
batter_career_stats.b_cwpa_bat = batter_career_stats.b_cwpa_bat.apply(lambda x: x.replace('%', '') if type(x) == str else x)
pitcher_career_stats.p_cwpa_def = pitcher_career_stats.p_cwpa_def.apply(lambda x: x.replace('%', '') if type(x) == str else x)

batter_career_stats = batter_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))
pitcher_career_stats = pitcher_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))

# Impute column medians, ignoring the string columns
imputer = SimpleImputer(strategy='median')
batter_career_stats.loc[:, ~batter_career_stats.columns.isin(['accomplishments', 'player_id'])] = imputer.fit_transform(batter_career_stats.loc[:, ~batter_career_stats.columns.isin(['accomplishments', 'player_id'])])
pitcher_career_stats.loc[:, ~pitcher_career_stats.columns.isin(['accomplishments', 'player_id'])] = imputer.fit_transform(pitcher_career_stats.loc[:, ~pitcher_career_stats.columns.isin(['accomplishments', 'player_id'])])

/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/1207380148.py:5: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
  batter_career_stats = batter_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))
/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/1207380148.py:6: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
  pitcher_career_stats = pitcher_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))

pitcher_career_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3835 entries, 0 to 3834
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   p_war                           3835 non-null   float64
 1   p_w                             3835 non-null   int64  
 2   p_win_loss_perc                 3835 non-null   float64
 3   p_earned_run_avg                3835 non-null   float64
 4   p_earned_run_avg_plus           3835 non-null   int64  
 5   p_g                             3835 non-null   int64  
 6   p_gs                            3835 non-null   int64  
 7   p_sho                           3835 non-null   int64  
 8   p_sv                            3835 non-null   int64  
 9   p_ip                            3835 non-null   float64
 10  p_so                            3835 non-null   int64  
 11  p_whip                          3835 non-null   float64
 12  p_fip                           3835 non-null   float64
 13  p_strikeouts_per_base_on_balls  3835 non-null   float64
 14  p_batting_avg                   3835 non-null   float64
 15  p_onbase_plus_slugging          3835 non-null   float64
 16  p_home_run_perc                 3835 non-null   float64
 17  p_strikeout_perc                3835 non-null   float64
 18  p_cwpa_def                      3835 non-null   float64
 19  p_baseout_runs                  3835 non-null   float64
 20  accomplishments                 3835 non-null   object 
 21  b_war                           3835 non-null   float64
 22  b_batting_avg                   3835 non-null   float64
 23  b_onbase_plus_slugging_plus     3835 non-null   float64
 24  player_id                       3835 non-null   object 
dtypes: float64(16), int64(7), object(2)
memory usage: 749.1+ KB

Now that the table is free from all missing values, the last step is to account for the awards data, which is currently unstructured text. To do this, we write a function that searches the texts for any of the 5 major awards we want to include, as well as their counts, before adding new columns for the award wins. By looking at awards, we also aim to capture greatness from players who had short stretches of greatness, but potentially didnt play for many years. The major awards are:

MVP - Most Valuable Player (Generally a hitter)
Cy Young - Best Pitcher
Batting Title - Highest Batting Average
Gold Glove - Best defensive player at a given position
All Star - Being named to the ‘All Star’ team in a given year

def scan_awards(text, award):
    '''Given the text from our award scraping function, and a specific award, return the number of times
       the award was won'''
    # Check if the award was won at all
    win = award.lower() in text.lower()

    # If the award was not won, return 0. Else, return the number of times it was won
    if not win:
        return 0
    else:
        if ',' in text: # The player won multiple awards, and the formatting is different
            try:
                num_wins = int(text.lower().split(award.lower())[0].split(',')[-1].strip().replace('x', ''))
            except (IndexError, ValueError): # Sometimes when winning the awarad once, the format changes
                return 1
        else:
            try:
                num_wins = int(text.lower().split('all-star'.lower())[0].strip().replace('x', ''))
            except (IndexError, ValueError):
                return 1

        return num_wins

# Insert Pitching Awards  
pitcher_career_stats["cy_youngs"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Cy Young"))
pitcher_career_stats["gold_gloves"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Gold Glove"))
pitcher_career_stats["mvps"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "MVP"))
pitcher_career_stats["all_stars"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "All-Star"))

# Insert Batting Awards
batter_career_stats["mvps"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "MVP"))
batter_career_stats["gold_gloves"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Gold Glove"))
batter_career_stats["batting_titles"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Batting Title"))
batter_career_stats["all_stars"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "All-Star"))

# Remove the general accomplishments columns from each DataFrame
batter_career_stats = batter_career_stats.drop(columns=['accomplishments'])
pitcher_career_stats = pitcher_career_stats.drop(columns=['accomplishments'])

# View the first few rows of the awards data for batters
batter_career_stats.head()[['player_id', 'mvps', 'gold_gloves', 'batting_titles', 'all_stars']]

	player_id	mvps	gold_gloves	batting_titles	all_stars
0	aaronha01	1	3	2	25
1	aaronto01	0	0	0	0
2	abbated01	0	0	0	0
3	abbeych01	0	0	0	0
4	abbotku01	0	0	0	0

With the DataFrame completely clean, and the awards dealt with, we save it back to the processed data folder

batter_career_stats.to_csv('../../data/processed-data/batter_career_stats.csv', index=False)
pitcher_career_stats.to_csv('../../data/processed-data/pitcher_career_stats.csv', index=False)

Appearences by Position DataFrame

Finally, we move onto the appearences by position dataset, which we will use to gain a more granular look at the positions each player played throughout their career. Because this is a list of DataFrames, we start by concatonating everything into a singular one

appearances_df = pd.concat([x for x in all_player_career_appearences], ignore_index=True)

appearances_df.head()

	G_all	GS	G_batting	G_defense	G_p_app	G_1b	G_2b	G_3b	G_lf_app	G_cf_app	G_rf_app	G_of_app	G_dh	G_ph	G_pr	player_id
0	331	0	139	331	331	0	0	0	0	0	0	0	0	0	0	aardsda01
1	3298	3173	3298	2985	0	210	43	7	315	308	2174	2760	201	122	1	aaronha01
2	437	206	437	346	0	232	7	10	135	1	2	137	0	102	35	aaronto01
3	448	91	81	448	448	0	0	0	0	0	0	0	0	0	0	aasedo01
4	15	4	15	9	0	8	0	0	0	0	1	1	0	7	1	abadan01

The dataset looks quite clean already! We see the first 4 columns tally general position types, the final 2 (excluding player_id), tally games as a pinch hitter and pinch runner, while the G_[position] each tally games at a given position. The two steps we take to finish the job are dropping unnecessary columns, and converting game tallies to numeric values and then percentage shares.

Additionally, a small subset of players have missing data for their positions. To combat this, we fill the missing values with equal percentages across the player, signaling no relative impact from position.

# Convert values to numeric types
appearances_df = appearances_df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

# Define the important columns to keep and normalize
position_cols = ['G_p_app', 'G_c', 'G_1b', 'G_2b', 'G_3b',
                 'G_ss', 'G_lf_app', 'G_cf_app', 'G_rf_app',
                 'G_dh']

# Normalize the games by position into percentages
appearances_df[position_cols] = appearances_df[position_cols].div(appearances_df[position_cols].sum(axis=1), axis=0)

# Filter down to our final DataFrame
appearances_df = appearances_df[position_cols + ['player_id']]

# Fill missing values
appearances_df = appearances_df.fillna(1/9)

/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/2786824259.py:2: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
  appearances_df = appearances_df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

Looking at the final DataFrame, we see that it is now clean and can be written to the processed data folder.

appearances_df.head()

	G_p_app	G_1b	G_2b	G_3b	G_lf_app	G_cf_app	G_rf_app	G_dh	player_id
0	1.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	aardsda01
1	0.0	0.064457	0.013198	0.002149	0.096685	0.094537	0.667281	0.061694	aaronha01
2	0.0	0.599483	0.018088	0.025840	0.348837	0.002584	0.005168	0.000000	aaronto01
3	1.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	aasedo01
4	0.0	0.888889	0.000000	0.000000	0.000000	0.000000	0.111111	0.000000	abadan01

appearances_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21228 entries, 0 to 21227
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   G_p_app    21228 non-null  float64
 1   G_c        21228 non-null  float64
 2   G_1b       21228 non-null  float64
 3   G_2b       21228 non-null  float64
 4   G_3b       21228 non-null  float64
 5   G_ss       21228 non-null  float64
 6   G_lf_app   21228 non-null  float64
 7   G_cf_app   21228 non-null  float64
 8   G_rf_app   21228 non-null  float64
 9   G_dh       21228 non-null  float64
 10  player_id  21228 non-null  object 
dtypes: float64(10), object(1)
memory usage: 1.8+ MB

appearances_df.to_csv('../../data/processed-data/appearances.csv', index=False)

Final Dataset Building

Now that all of the individual DataFrames are cleaned (woohoo!), we move into condensing the data into one singular DataFrame for batters and one for pitchers, that we will use for future exploration and analysis. To accomplish this, we take the BBWAA voting DataFrame as our base, as it already contains only those players who have been voted on in a BBWAA ballot, and then we add extra features like the players stats on from our adjacent DataFrames.

# Read in the core DataFrames
final_dataset = pd.read_csv('../../data/processed-data/all_bbwaa_voting.csv')

# Read in the stats DataFrames
batter_stats = pd.read_csv('../../data/processed-data/batter_career_stats.csv')
pitcher_stats = pd.read_csv('../../data/processed-data/pitcher_career_stats.csv')

def sort_row_by_bp(player_id, batter_stats, pitcher_stats):
    if player_id in pitcher_stats.player_id.values:
        return 'pitcher'
    elif player_id in batter_stats.player_id.values:
        return 'batter'
    else:
        return 'missing'

# Attach the batter/pitcher designation for each entry and check the results
final_dataset['position'] = final_dataset.player_id.apply(lambda x: sort_row_by_bp(x, batter_stats, pitcher_stats))
final_dataset.position.value_counts()

position
batter     3144
missing    1226
pitcher     530
Name: count, dtype: int64

Upon investigation of the players with ‘missing’ values for their position, we can confirm that the vast majority of the players are ‘missing’ because they were among the group to fail to be scraped in the data collection phase due to changes in underlying HTML. We see below the only names of players who did succesfully have data scraped.

for player in final_dataset[final_dataset.position == 'missing'].player_id:
    if player == 'mccarjo99': # throws error bc the id is different across datasets
        continue
    val = all_player_stats_dict[player]
    if not isinstance(val, str):
        print(player)

rickebr01
rickebr01
rickebr01
rickebr01
ensje01
beaucji01

There are only 4 players left who did succesfully have data scraped. However, a quick seach reveals that these players were actually primarily manages or executives who played early enough that the ‘BBWAA’ eligibility rules were different such that they were voted on from this pool. Becasue of this, we will drop all players who are ‘missing’ in this dataset.

final_dataset = final_dataset[final_dataset.position != 'missing']
final_dataset.head()

	name	player_id	voting_year	year_on_ballot	votes_pct	ly_votes_pct	outcome	position
0	Ty Cobb	cobbty01	1936	1	98.2	12.975208	elected	batter
1	Babe Ruth	ruthba01	1936	1	95.1	12.975208	elected	batter
2	Honus Wagner	wagneho01	1936	1	95.1	12.975208	elected	batter
5	Nap Lajoie	lajoina01	1936	1	64.6	12.975208	limbo	batter
6	Tris Speaker	speaktr01	1936	1	58.8	12.975208	limbo	batter

Now that we are sure that all dataset entries have a valid connection to the stats data, we split the data into batters and pitchers, before attaching the career stats

pitcher_df = final_dataset[final_dataset.position == 'pitcher']
batter_df = final_dataset[final_dataset.position == 'batter']

Before we merge the stats with the BBWAA voting data, we will also add in the scandal data as set forth in the earlier fangraphs paper. We do this with a manual list of known scandelous players as set forth in that paper. We also filter down to players in our dataset. Reasons for scandals include:

Use of Performance Enhancing Drugs
Gambling
Conduct detrimental to the game and personal image

After adding in the presence of a scandal, we remove the players with scandals, as the presense of a scandal essentially interferes with the normal voting process, and critically, is a factor known before the voting process itself. Thus, for future data we would also be able to exclude them from predicions.

# Define the ids with scandals
scandal_ids = [
    'bondsba01', 'rodrial01',
    'pettian01', 'ortizda01',
    'schilcu01', 'dykstle01',
    'ramirma02', 'mcgwima01',
    'rosepe01']

# Create a scandal column in both the pitching and batting DataFrames
batter_df['scandal'] = batter_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)
pitcher_df['scandal'] = pitcher_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)

# Remove players with scandals
batter_df = batter_df[batter_df.scandal == 0]
pitcher_df = pitcher_df[pitcher_df.scandal == 0]

/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/2900954371.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batter_df['scandal'] = batter_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)
/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/2900954371.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pitcher_df['scandal'] = pitcher_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)

batter_df = batter_df.merge(batter_stats, on='player_id', how='left')
pitcher_df = pitcher_df.merge(pitcher_stats, on='player_id', how='left')

Next, we attach the appearances by position data with another merge

batter_df = batter_df.merge(appearances_df, on='player_id', how='left')
pitcher_df = pitcher_df.merge(appearances_df, on='player_id', how='left')

One other thing we must deal with, is the fact that some advanced metrics like championship win probability added don’t appear in the data until 1912, so players who played before and after 1912 all have depressed values due to its summative property. To counter this, we will override the values for these columns for the first few year of voting, until the league resumed play after WWII in 1946.

pitcher_df.head().columns

Index(['name', 'player_id', 'voting_year', 'year_on_ballot', 'votes_pct',
       'ly_votes_pct', 'outcome', 'position', 'scandal', 'p_war', 'p_w',
       'p_win_loss_perc', 'p_earned_run_avg', 'p_earned_run_avg_plus', 'p_g',
       'p_gs', 'p_sho', 'p_sv', 'p_ip', 'p_so', 'p_whip', 'p_fip',
       'p_strikeouts_per_base_on_balls', 'p_batting_avg',
       'p_onbase_plus_slugging', 'p_home_run_perc', 'p_strikeout_perc',
       'p_cwpa_def', 'p_baseout_runs', 'b_war', 'b_batting_avg',
       'b_onbase_plus_slugging_plus', 'cy_youngs', 'gold_gloves', 'mvps',
       'all_stars', 'G_p_app', 'G_c', 'G_1b', 'G_2b', 'G_3b', 'G_ss',
       'G_lf_app', 'G_cf_app', 'G_rf_app', 'G_dh'],
      dtype='object')

# Update some of the advanced stats for very early players
early_batters = batter_df[batter_df.voting_year <= 1945]
early_cwpa =  early_batters.b_cwpa_bat.mean()
early_re24 = early_batters.b_baseout_runs.mean()

batter_df.b_cwpa_bat = batter_df.apply(lambda x: early_cwpa if x.voting_year <= 1945 else x.b_cwpa_bat, axis=1)
batter_df.b_baseout_runs = batter_df.apply(lambda x: early_re24 if x.voting_year <= 1945 else x.b_baseout_runs, axis=1)

# And do that same for the pitchers DataFrame
early_pitchers = pitcher_df[pitcher_df.voting_year <= 1945]
early_cwpa =  early_pitchers.p_cwpa_def.mean()
early_re24 = early_pitchers.p_baseout_runs.mean()

pitcher_df.p_cwpa_def = pitcher_df.apply(lambda x: early_cwpa if x.voting_year <= 1945 else x.p_cwpa_def, axis=1)
pitcher_df.p_baseout_runs = pitcher_df.apply(lambda x: early_re24 if x.voting_year <= 1945 else x.p_baseout_runs, axis=1)

Finally, we write out our final datasets to the processed data folder!

batter_df.to_csv('../../data/processed-data/final_batter_df.csv', index=False)
pitcher_df.to_csv('../../data/processed-data/final_pitcher_df.csv', index=False)

Now that all of our data is cleaned, we move forward to exploring our datasets in depth, and begin analyzing our data! This can be found in the Exploratory Data Analysis tab.