# General Imports
import pandas as pd
import pickle as pkl
from sklearn.impute import SimpleImputerData Cleaning
Summary
Like any project, cleaning our raw data collected in the prior step is critical to ensuring we have a dataset that can be explored and modeled to extract insights. In this section, we read in the raw scraped data, and process in the following steps:
- Reading in all the scraped data
- Looking carefully through each dataset and:
- Ensuring data is read in properly
- Checking for and potentially removing missing values
- Dropping unnecessary columns that do not relate to the research questions
- Creating new features for for analysis
- Merging datasets together into a singular final dataset each for batters and pitchers
Code
Provide the source code used for this section of the project here.
If you’re using a package for code organization, you can import it at this point. However, make sure that the actual workflow steps—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.
If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.
Remember, this page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.
Reading in Raw Data
# Read in our datasets individually
all_inductees_df = pd.read_csv('../../data/raw-data/all_hof_inductees_table.csv')
yearly_hof_voting_df = pd.read_csv('../../data/raw-data/yearly_hof_voting_data.csv')
all_player_info = pd.read_csv('../../data/raw-data/all_player_info.csv')
with open('../../data/raw-data/all_player_stats.pkl', 'rb') as fpath:
all_player_stats_dict = pkl.load(fpath)Because we scraped our player stats into a dictionary with keys for each player, we next split this dictionary into DataFrames for each key type - Career Stats, Annual Stats, and Appearences by Position. While doing this, we also attatch the ID from the dictionary key into the underlying DataFrame so that we don’t lose the connection
# Define the list of all player IDs scraped
ids = list(all_player_stats_dict.keys())
for id in ids:
if isinstance(all_player_stats_dict[id], dict):
all_player_stats_dict[id].get('career_stats')['player_id'] = id
all_player_stats_dict[id].get('annual_stats')['player_id'] = id
all_player_stats_dict[id].get('appearances')['player_id'] = id
# Create a list of each DataFrame stored within each ID, accounting for the occurence of Non-dictionaries when scraping fails
all_player_career_stats = [all_player_stats_dict[player_id].get('career_stats') if isinstance(all_player_stats_dict[player_id], dict) else None for player_id in ids]
all_player_annual_stats = [all_player_stats_dict[player_id].get('annual_stats') if isinstance(all_player_stats_dict[player_id], dict) else None for player_id in ids]
all_player_career_appearences = [all_player_stats_dict[player_id].get('appearances')if isinstance(all_player_stats_dict[player_id], dict) else None for player_id in ids]Cleaning Data
All Inducted Players DataFrame
Working through our DataFrames, we start with the DataFrame holding all succesfully inducted players. Start by taking a glance at the first few rows of the data.
all_inductees_df.head()| voting_year | player | living_status | voting_body | inducted_as | votes | vote_percentage | |
|---|---|---|---|---|---|---|---|
| 0 | 2024 | Adrian Beltré | 1979-Living | BBWAA | Player | 366.0 | 95.1% |
| 1 | 2024 | Todd Helton | 1973-Living | BBWAA | Player | 307.0 | 79.7% |
| 2 | 2024 | Jim Leyland | 1944-Living | Contemporary Baseball Era Non-Players | Manager | NaN | NaN |
| 3 | 2024 | Joe Mauer | 1983-Living | BBWAA | Player | 293.0 | 76.1% |
| 4 | 2023 | Fred McGriff | 1963-Living | Contemporary Baseball Era | Player | 16.0 | 100.0% |
We observe a few necessary fixes for the table, including:
- Player names not properly encoded
- Voting bodies other than BBWAA
- Non-Players included via non-BBWAA methods
- NaN values for vote tallies for non-BBWAA methods
Additionally, there are some cosmetic alterations we can make, including:
- Removing the living status, which we will not use for further analysis
### Clean the all_inductees_df DataFrame
# Define a function for encoding the player names
def decode_player_name(name):
encoded_name = name.encode('latin-1')
corrected_name = encoded_name.decode('utf-8')
return corrected_name.strip()
# Encode the player names
all_inductees_df.player = all_inductees_df.player.apply(decode_player_name)
# Filter to only BBWAA votes
all_inductees_df = all_inductees_df[all_inductees_df.voting_body == 'BBWAA']
# Drop the living status column
all_inductees_df = all_inductees_df.drop(columns=['living_status'])Looking at the table again, all previous concerns are now resolved:
all_inductees_df.head()| voting_year | player | voting_body | inducted_as | votes | vote_percentage | |
|---|---|---|---|---|---|---|
| 0 | 2024 | Adrian Beltré | BBWAA | Player | 366.0 | 95.1% |
| 1 | 2024 | Todd Helton | BBWAA | Player | 307.0 | 79.7% |
| 3 | 2024 | Joe Mauer | BBWAA | Player | 293.0 | 76.1% |
| 5 | 2023 | Scott Rolen | BBWAA | Player | 297.0 | 76.3% |
| 12 | 2022 | David Ortiz | BBWAA | Player | 307.0 | 77.9% |
Next, we confirm the full table is clean, with no missing values, and all the columns we desire.
all_inductees_df.info()<class 'pandas.core.frame.DataFrame'>
Index: 134 entries, 0 to 345
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 voting_year 134 non-null int64
1 player 134 non-null object
2 voting_body 134 non-null object
3 inducted_as 134 non-null object
4 votes 134 non-null float64
5 vote_percentage 134 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 7.3+ KB
We also observe that the player names are now properly encoded - resulting in a cleaned dataframe. Thus, we save the data back to the processed data folder.
all_inductees_df.player.iloc[1:25]1 Todd Helton
3 Joe Mauer
5 Scott Rolen
12 David Ortiz
13 Derek Jeter
16 Larry Walker
18 Roy Halladay
19 Edgar Martínez
20 Mike Mussina
21 Mariano Rivera
23 Vladimir Guerrero
24 Trevor Hoffman
25 Chipper Jones
27 Jim Thome
29 Jeff Bagwell
30 Tim Raines
31 Iván Rodríguez
34 Ken Griffey Jr.
35 Mike Piazza
36 Craig Biggio
37 Randy Johnson
38 Pedro Martinez
39 John Smoltz
41 Tom Glavine
Name: player, dtype: object
all_inductees_df.to_csv('../../data/processed-data/all_hof_inductees.csv', index=False)Yearly BBWAA HOF Ballot Voting DataFrame
The next DataFrame is that containing the yearly voting results of the BBWAA ballot. We again print out the first few rows to observe the table structure.
yearly_hof_voting_df.head()| rank | name | player_page_url | year_on_ballot | votes | votes_pct | hof_monitor | hof_standard | experience | WAR_career | ... | G_p | GS | SV | IP | H_p | HR_p | BB_p | SO_p | pos_summary | voting_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Ty Cobb | https://www.baseball-reference.com/players/c/c... | 1st | 222 | 98.2% | 445.0 | 75.0 | 24.0 | 151.4 | ... | 3.0 | 0.0 | 1.0 | 5.0 | 6.0 | 0.0 | 2.0 | 0.0 | *8*9H7/3145 | 1936 |
| 1 | 2 | Babe Ruth | https://www.baseball-reference.com/players/r/r... | 1st | 215 | 95.1% | 411.0 | 79.0 | 22.0 | 162.2 | ... | 163.0 | 147.0 | 4.0 | 1221.1 | 974.0 | 10.0 | 441.0 | 488.0 | *9*71H83 | 1936 |
| 2 | 3 | Honus Wagner | https://www.baseball-reference.com/players/w/w... | 1st | 215 | 95.1% | 313.0 | 75.0 | 21.0 | 131.0 | ... | 2.0 | 0.0 | 0.0 | 8.1 | 7.0 | 0.0 | 6.0 | 6.0 | *6*O3594H7/81 | 1936 |
| 3 | 4 | Christy Mathewson | https://www.baseball-reference.com/players/m/m... | 1st | 205 | 90.7% | 303.0 | 84.0 | 17.0 | 106.7 | ... | 636.0 | 552.0 | 30.0 | 4788.2 | 4219.0 | 89.0 | 848.0 | 2507.0 | *1/H397 | 1936 |
| 4 | 5 | Walter Johnson | https://www.baseball-reference.com/players/j/j... | 1st | 189 | 83.6% | 364.0 | 82.0 | 21.0 | 166.9 | ... | 802.0 | 666.0 | 34.0 | 5914.1 | 4913.0 | 97.0 | 1363.0 | 3509.0 | *1H/897 | 1936 |
5 rows × 41 columns
Again, there are some immediate fixes we must complete. These include:
- Correctly encoding player names
- Converting the year on ballot and vote percentage columns to numeric values
As well as some cosmetic edits:
- Dropping all the stats columns, plus rank, url, and raw vote tallies
- We already have all the stats we will use in the stats DataFrames
For this DataFrame however, there is also some Feature Engineering/Creation we complete:
- Building a column for player ID, taken from the scraped URL
- Building a column for the players vote percentage in the year prior
- Again, this leverages a single-step markov assumption, that current vote % is impacted by (and only by) last years vote total.
- Create an ‘outcome’ column, based on whether the player was elected, expired, or staying in limbo
# Decode the player names
yearly_hof_voting_df.name = yearly_hof_voting_df.name.apply(decode_player_name)
# Convert the year on ballot and vote percent colulmns to numberic values
yearly_hof_voting_df.year_on_ballot = yearly_hof_voting_df.year_on_ballot.apply(lambda x: int(x[0]))
yearly_hof_voting_df.votes_pct = yearly_hof_voting_df.votes_pct.apply(lambda x: float(x[:-1]))
# Build the ID column
yearly_hof_voting_df['player_id'] = yearly_hof_voting_df['player_page_url'].apply(lambda x: x.split('/')[-1].split('.')[0])
# Build the prior year vote percent column
# For players who are on the ballot for the first year, we insert None, and will later impute a value
yearly_hof_voting_df['ly_votes_pct'] = yearly_hof_voting_df.apply(lambda x: yearly_hof_voting_df[(yearly_hof_voting_df.player_id == x.player_id) & (yearly_hof_voting_df.year_on_ballot == x.year_on_ballot - 1)].votes_pct.iloc[0] if x.year_on_ballot > 1 else None, axis=1)
# Create outcome column
def return_ballot_outcome(year_on_ballot, current_vote):
'''Returns the outcome for a player in the BBWAA vote, depending on their vote share and numbher
of previous years on the ballot'''
if current_vote >= 75:
return 'elected'
elif current_vote < 5:
return 'eliminated'
elif current_vote < 75 and year_on_ballot == 9:
return 'expired'
else:
return 'limbo'
yearly_hof_voting_df['outcome'] = yearly_hof_voting_df.apply(lambda x: return_ballot_outcome(x.year_on_ballot, x.votes_pct), axis=1)
# Drop unnecessary columns
yearly_hof_voting_df = yearly_hof_voting_df[['name', 'player_id', 'voting_year', 'year_on_ballot', 'votes_pct', 'ly_votes_pct', 'outcome']]Because we’ve filled in the prior year’s vote percentage as None for players in their first year on the ballot, we now must impute, or fill in, those missing values. To do this, we will impute with the average vote percentage of first year players. We don’t impute with 0, as the values are numeric and relative, so 0 may undervalue these players in a model. We also don’t chose the overall mean value, as we already have the additional information that these are first year players, so we may introduce falsehoods by including ballot ‘veterans’ in the mean.
# Calculate the mean vote percentage garnered by first year players
mean_value = yearly_hof_voting_df[['year_on_ballot', 'votes_pct']].groupby(by='year_on_ballot').mean().loc[1].votes_pct
# Impute the mean value
yearly_hof_voting_df.ly_votes_pct = yearly_hof_voting_df.ly_votes_pct.fillna(mean_value)Finally, we once again check that the first few rows of the dataset to ensure our changes have taken effect, and print info about the dataset as a whole to ensure we have completed the cleaning process, before we write the dataset out to the processed data folder.
yearly_hof_voting_df.head()| name | player_id | voting_year | year_on_ballot | votes_pct | ly_votes_pct | outcome | |
|---|---|---|---|---|---|---|---|
| 0 | Ty Cobb | cobbty01 | 1936 | 1 | 98.2 | 12.975208 | elected |
| 1 | Babe Ruth | ruthba01 | 1936 | 1 | 95.1 | 12.975208 | elected |
| 2 | Honus Wagner | wagneho01 | 1936 | 1 | 95.1 | 12.975208 | elected |
| 3 | Christy Mathewson | mathech01 | 1936 | 1 | 90.7 | 12.975208 | elected |
| 4 | Walter Johnson | johnswa01 | 1936 | 1 | 83.6 | 12.975208 | elected |
yearly_hof_voting_df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4900 entries, 0 to 4899
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 4900 non-null object
1 player_id 4900 non-null object
2 voting_year 4900 non-null int64
3 year_on_ballot 4900 non-null int64
4 votes_pct 4900 non-null float64
5 ly_votes_pct 4900 non-null float64
6 outcome 4900 non-null object
dtypes: float64(2), int64(2), object(3)
memory usage: 268.1+ KB
# Write DataFrame to csv in processed data folder
yearly_hof_voting_df.to_csv('../../data/processed-data/all_bbwaa_voting.csv', index=False)Player Names and IDs DataFrame
For the next DataFrame, we inspect the DataFrame containing all the General Player Information
all_player_info.head()| name | id | url_suffix | |
|---|---|---|---|
| 0 | David Aardsma | aardsda01 | /players/a/aardsda01.shtml |
| 1 | Henry Aaron | aaronha01 | /players/a/aaronha01.shtml |
| 2 | Tommie Aaron | aaronto01 | /players/a/aaronto01.shtml |
| 3 | Don Aase | aasedo01 | /players/a/aasedo01.shtml |
| 4 | Andy Abad | abadan01 | /players/a/abadan01.shtml |
all_player_info.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23370 entries, 0 to 23369
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 23370 non-null object
1 id 23370 non-null object
2 url_suffix 23370 non-null object
dtypes: object(3)
memory usage: 547.9+ KB
We see that the DataFrame is already clean and tidy, but we can drop the url suffix, as we will not need it for further research. We then save the DataFrame to the processed data folder
# Drop the URL column
all_player_info = all_player_info.drop(columns=['url_suffix'])
# Save the DataFrame
all_player_info.to_csv('../../data/processed-data/all_player_info.csv', index=False)Career Stats for All Players
Moving onto the career stats for every player throughout history, we begin by dividing data between batters and pitchers. We accomplish this by filtering DataFrames based on whether the player meets thresholds for innings pitched and if not then plate appearences. During this process we also drop duplicated column names so that we can combine DataFrames in a future step.
- If players do not meet either threshold, we simply ignore them as there is zero possibility they will be eligible for the HOF under BBWAA rules
batter_career_stats_l = []
pitcher_career_stats_l = []
for series in all_player_career_stats:
if isinstance(series, pd.DataFrame):
# Check if the player pitched more than 100 innings
if 'p_ip' in series.columns and float(series.p_ip.T.iloc[0,0]) > 150 and series['player_id'].iloc[0] != 'ruthba01': # We also make Babe Ruth a Batter
duplicated_columns = series.columns.duplicated()
pitcher_career_stats_l.append(series.loc[:, ~duplicated_columns])
# Check if the batter had at least 500 plate appearences
elif 'b_pa' in series.columns and float(series.b_pa.T.iloc[0,0]) > 850:
duplicated_columns = series.columns.duplicated()
batter_career_stats_l.append(series.loc[:, ~duplicated_columns])
# Finally, we concatonate all the individual DataFrames into one
pitcher_career_stats = pd.concat([df for df in pitcher_career_stats_l], ignore_index=True)
batter_career_stats = pd.concat([df for df in batter_career_stats_l], ignore_index=True)Now we inspect the first few rows of both the batting and pitching DataFrames
batter_career_stats.head()| b_war | b_games | b_pa | b_ab | b_r | b_h | b_doubles | b_triples | b_hr | b_rbi | ... | p_avg_exit_velo | p_hard_hit_perc | p_ld_perc | p_gb_perc | p_fb_perc | p_gb_fb_ratio | p_wpa_def | p_cwpa_def | p_baseout_runs | p_win_loss_perc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 143.1 | 3298 | 13941 | 12364 | 2174 | 3771 | 624 | 98 | 755 | 2297 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | -2.8 | 437 | 1046 | 944 | 102 | 216 | 42 | 6 | 13 | 94 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 8.7 | 855 | 3479 | 3044 | 355 | 772 | 99 | 43 | 11 | 324 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 1.8 | 452 | 1965 | 1756 | 307 | 493 | 67 | 46 | 19 | 280 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0.5 | 702 | 2227 | 2044 | 273 | 523 | 109 | 23 | 62 | 242 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 99 columns
pitcher_career_stats.head()| b_war | b_games | b_pa | b_ab | b_r | b_h | b_doubles | b_triples | b_hr | b_rbi | ... | p_cwpa_def | p_baseout_runs | accomplishments | player_id | b_gb_fb_ratio | b_run_scoring_perc | b_extra_bases_taken_perc | b_stolen_base_perc | b_avg_exit_velo | b_hard_hit_perc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.1 | 139 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.1% | 11.21 | aardsda01 | NaN | NaN | NaN | NaN | NaN | NaN | |
| 1 | -0.1 | 81 | 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 8.1% | 36.38 | 1x All-Star | aasedo01 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | -0.1 | 159 | 9 | 9 | 0 | 1 | 0 | 0 | 0 | 0 | ... | -1.3% | 10.25 | abadfe01 | 3.00 | 0.0 | NaN | NaN | NaN | NaN | |
| 3 | -1.0 | 79 | 252 | 225 | 21 | 38 | 3 | 3 | 0 | 17 | ... | NaN | NaN | abbeybe01 | NaN | 35.6 | NaN | NaN | NaN | NaN | |
| 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.7% | 18.95 | abbotan01 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 99 columns
Both DataFrames have a large number of features, so it is necessary to print out the full columnn list, where we will determine which are necessary
pitcher_career_stats.columnsIndex(['b_war', 'b_games', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_doubles',
'b_triples', 'b_hr', 'b_rbi', 'b_sb', 'b_cs', 'b_bb', 'b_so',
'b_batting_avg', 'b_onbase_perc', 'b_slugging_perc',
'b_onbase_plus_slugging', 'b_onbase_plus_slugging_plus', 'b_roba',
'b_rbat_plus', 'b_tb', 'b_gidp', 'b_hbp', 'b_sh', 'b_sf', 'b_ibb',
'pos', 'awards', 'b_batting_avg_bip', 'b_iso_slugging',
'b_home_run_perc', 'b_strikeout_perc', 'b_base_on_balls_perc',
'b_ld_perc', 'b_gb_perc', 'b_fb_perc', 'b_pull_perc', 'b_center_perc',
'b_oppo_perc', 'b_wpa_bat', 'b_cwpa_bat', 'b_baseout_runs', 'p_war',
'p_w', 'p_l', 'p_win_loss_perc', 'p_earned_run_avg', 'p_g', 'p_gs',
'p_gf', 'p_cg', 'p_sho', 'p_sv', 'p_ip', 'p_h', 'p_r', 'p_er', 'p_hr',
'p_bb', 'p_ibb', 'p_so', 'p_hbp', 'p_bk', 'p_wp', 'p_bfp',
'p_earned_run_avg_plus', 'p_fip', 'p_whip', 'p_hits_per_nine',
'p_hr_per_nine', 'p_bb_per_nine', 'p_so_per_nine',
'p_strikeouts_per_base_on_balls', 'p_batting_avg', 'p_onbase_perc',
'p_slugging_perc', 'p_onbase_plus_slugging', 'p_batting_avg_bip',
'p_home_run_perc', 'p_strikeout_perc', 'p_base_on_balls_perc',
'p_avg_exit_velo', 'p_hard_hit_perc', 'p_ld_perc', 'p_gb_perc',
'p_fb_perc', 'p_gb_fb_ratio', 'p_wpa_def', 'p_cwpa_def',
'p_baseout_runs', 'accomplishments', 'player_id', 'b_gb_fb_ratio',
'b_run_scoring_perc', 'b_extra_bases_taken_perc', 'b_stolen_base_perc',
'b_avg_exit_velo', 'b_hard_hit_perc'],
dtype='object')
pitcher_career_stats.columnsIndex(['b_war', 'b_games', 'b_pa', 'b_ab', 'b_r', 'b_h', 'b_doubles',
'b_triples', 'b_hr', 'b_rbi', 'b_sb', 'b_cs', 'b_bb', 'b_so',
'b_batting_avg', 'b_onbase_perc', 'b_slugging_perc',
'b_onbase_plus_slugging', 'b_onbase_plus_slugging_plus', 'b_roba',
'b_rbat_plus', 'b_tb', 'b_gidp', 'b_hbp', 'b_sh', 'b_sf', 'b_ibb',
'pos', 'awards', 'b_batting_avg_bip', 'b_iso_slugging',
'b_home_run_perc', 'b_strikeout_perc', 'b_base_on_balls_perc',
'b_ld_perc', 'b_gb_perc', 'b_fb_perc', 'b_pull_perc', 'b_center_perc',
'b_oppo_perc', 'b_wpa_bat', 'b_cwpa_bat', 'b_baseout_runs', 'p_war',
'p_w', 'p_l', 'p_win_loss_perc', 'p_earned_run_avg', 'p_g', 'p_gs',
'p_gf', 'p_cg', 'p_sho', 'p_sv', 'p_ip', 'p_h', 'p_r', 'p_er', 'p_hr',
'p_bb', 'p_ibb', 'p_so', 'p_hbp', 'p_bk', 'p_wp', 'p_bfp',
'p_earned_run_avg_plus', 'p_fip', 'p_whip', 'p_hits_per_nine',
'p_hr_per_nine', 'p_bb_per_nine', 'p_so_per_nine',
'p_strikeouts_per_base_on_balls', 'p_batting_avg', 'p_onbase_perc',
'p_slugging_perc', 'p_onbase_plus_slugging', 'p_batting_avg_bip',
'p_home_run_perc', 'p_strikeout_perc', 'p_base_on_balls_perc',
'p_avg_exit_velo', 'p_hard_hit_perc', 'p_ld_perc', 'p_gb_perc',
'p_fb_perc', 'p_gb_fb_ratio', 'p_wpa_def', 'p_cwpa_def',
'p_baseout_runs', 'accomplishments', 'player_id', 'b_gb_fb_ratio',
'b_run_scoring_perc', 'b_extra_bases_taken_perc', 'b_stolen_base_perc',
'b_avg_exit_velo', 'b_hard_hit_perc'],
dtype='object')
We will not need this many features in our final datasets. Thus, for pitchers, we will use a concise set of pitching metrics, alongside a few simple batting metrics. For batters, we will filter down to a more consice set of batting metrics, while leaving out any pitching metrics. This is because there is effectively no sample of batters who consistiently pitch (outside of a few notable exceptions)
Batting Metrics
| Metric | Description |
|---|---|
| b_war | Wins above Replacement (full definition here) |
| b_pa | Plate Appearances |
| b_h | Hits |
| b_hr | Home Runs |
| b_sb | Stolen Bases |
| b_bb | Walks |
| b_so | Strikeouts |
| b_batting_avg | Batting Average |
| b_onbase_plus_slugging | On Base % plus Slugging % (full definition here) |
| b_onbase_plus_slugging_plus | On Base % plus Slugging % indexed to league average value of 100 (full definition here) |
| b_home_run_perc | Home Run Percent |
| b_strikeout_perc | Strikeout Percent |
| b_base_on_balls_perc | Walk Percent |
| b_cwpa_bat | Championship win probability added (full definition here) |
| b_baseout_runs | Run Expectancy Change (full definition here) |
| accomplishments | Awards won throughout career |
Pitching Metrics
| Metric | Description |
|---|---|
| p_war | Wins above Replacement (full definition here) |
| p_w | Wins |
| p_win_loss_perc | Win percentage |
| p_earned_run_avg | Earned Run Average (Earned Runs / Innings Pitched) |
| p_earned_run_avg_plus | Earned Run Average indexed to league average value of 100 (full definition here) |
| p_g | Games appeared in |
| p_gs | Games started |
| p_sho | Shutouts |
| p_sv | Saves |
| p_ip | Innings Pitched |
| p_so | Strikeouts |
| p_whip | Walks and Hits per Innings Pitched |
| p_fip | Fielding Independent Pitching (full definition here) |
| p_strikeouts_per_base_on_balls | Ratio of strikeouts to walks |
| p_batting_avg | Batting Average of batters faced |
| p_onbase_plus_slugging | On Base % + Slugging % of batters faced |
| p_home_run_perc | Home Run % |
| p_strikeout_perc | Strikeout % |
| p_cwpa_def | Championship win probability added (full definition here) |
| p_baseout_runs | Run Expectancy Change (full definition here) |
| b_war | Wins above Replacement (full definition here) |
| b_batting_avg | Batting Average (as a batter) |
| b_onbase_plus_slugging_plus | On Base % plus Slugging % (as a batter) |
| accomplishments | Awards won throughout career |
# Define the important columns and filter the datasets down
batter_columns = [
"b_war", "b_pa", "b_h", "b_hr", "b_sb", "b_bb", "b_so", "b_batting_avg",
"b_onbase_plus_slugging", "b_onbase_plus_slugging_plus", "b_home_run_perc",
"b_strikeout_perc", "b_base_on_balls_perc", "b_cwpa_bat", "b_baseout_runs",
"accomplishments", 'player_id'
]
pitcher_columns = [
"p_war", "p_w", "p_win_loss_perc", "p_earned_run_avg", "p_earned_run_avg_plus",
"p_g", "p_gs", "p_sho", "p_sv", "p_ip", "p_so", "p_whip", "p_fip",
"p_strikeouts_per_base_on_balls", "p_batting_avg", "p_onbase_plus_slugging",
"p_home_run_perc", "p_strikeout_perc", "p_cwpa_def", "p_baseout_runs",
"accomplishments", 'b_war','b_batting_avg', 'b_onbase_plus_slugging_plus', 'player_id'
]
batter_career_stats = batter_career_stats[batter_columns]
pitcher_career_stats = pitcher_career_stats[pitcher_columns]Looking through the career stats DataFrames, we still see that there are some missing values present. Specifically, these are for a handfull of metrics, for players who played so long ago that these metrics were not tracked in games. To solve this problem, we impute with the median of each stat. We choose the median becuase the dataset consists of all major league players, so the underlying data is likely skewed with outliers from the few excellent players of each statistic.
pitcher_career_stats.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3835 entries, 0 to 3834
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 p_war 3835 non-null object
1 p_w 3835 non-null object
2 p_win_loss_perc 3835 non-null object
3 p_earned_run_avg 3835 non-null object
4 p_earned_run_avg_plus 3835 non-null object
5 p_g 3835 non-null object
6 p_gs 3835 non-null object
7 p_sho 3835 non-null object
8 p_sv 3835 non-null object
9 p_ip 3835 non-null object
10 p_so 3835 non-null object
11 p_whip 3835 non-null object
12 p_fip 3835 non-null object
13 p_strikeouts_per_base_on_balls 3835 non-null object
14 p_batting_avg 3271 non-null object
15 p_onbase_plus_slugging 3271 non-null object
16 p_home_run_perc 3835 non-null object
17 p_strikeout_perc 3835 non-null object
18 p_cwpa_def 3259 non-null object
19 p_baseout_runs 3259 non-null object
20 accomplishments 3835 non-null object
21 b_war 3750 non-null object
22 b_batting_avg 3605 non-null object
23 b_onbase_plus_slugging_plus 3605 non-null object
24 player_id 3835 non-null object
dtypes: object(25)
memory usage: 749.1+ KB
batter_career_stats.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4425 entries, 0 to 4424
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 b_war 4425 non-null object
1 b_pa 4425 non-null object
2 b_h 4425 non-null object
3 b_hr 4425 non-null object
4 b_sb 4412 non-null object
5 b_bb 4425 non-null object
6 b_so 4267 non-null object
7 b_batting_avg 4425 non-null object
8 b_onbase_plus_slugging 4425 non-null object
9 b_onbase_plus_slugging_plus 4425 non-null object
10 b_home_run_perc 4425 non-null object
11 b_strikeout_perc 4267 non-null object
12 b_base_on_balls_perc 4425 non-null object
13 b_cwpa_bat 3777 non-null object
14 b_baseout_runs 3777 non-null object
15 accomplishments 4425 non-null object
16 player_id 4425 non-null object
dtypes: object(17)
memory usage: 587.8+ KB
To complete the imputing, we must also convert all possible stats to numeric values
# Convert values to numeric
batter_career_stats.b_cwpa_bat = batter_career_stats.b_cwpa_bat.apply(lambda x: x.replace('%', '') if type(x) == str else x)
pitcher_career_stats.p_cwpa_def = pitcher_career_stats.p_cwpa_def.apply(lambda x: x.replace('%', '') if type(x) == str else x)
batter_career_stats = batter_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))
pitcher_career_stats = pitcher_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))
# Impute column medians, ignoring the string columns
imputer = SimpleImputer(strategy='median')
batter_career_stats.loc[:, ~batter_career_stats.columns.isin(['accomplishments', 'player_id'])] = imputer.fit_transform(batter_career_stats.loc[:, ~batter_career_stats.columns.isin(['accomplishments', 'player_id'])])
pitcher_career_stats.loc[:, ~pitcher_career_stats.columns.isin(['accomplishments', 'player_id'])] = imputer.fit_transform(pitcher_career_stats.loc[:, ~pitcher_career_stats.columns.isin(['accomplishments', 'player_id'])])/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/1207380148.py:5: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
batter_career_stats = batter_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))
/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/1207380148.py:6: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
pitcher_career_stats = pitcher_career_stats.apply(lambda x: pd.to_numeric(x, errors='ignore'))
pitcher_career_stats.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3835 entries, 0 to 3834
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 p_war 3835 non-null float64
1 p_w 3835 non-null int64
2 p_win_loss_perc 3835 non-null float64
3 p_earned_run_avg 3835 non-null float64
4 p_earned_run_avg_plus 3835 non-null int64
5 p_g 3835 non-null int64
6 p_gs 3835 non-null int64
7 p_sho 3835 non-null int64
8 p_sv 3835 non-null int64
9 p_ip 3835 non-null float64
10 p_so 3835 non-null int64
11 p_whip 3835 non-null float64
12 p_fip 3835 non-null float64
13 p_strikeouts_per_base_on_balls 3835 non-null float64
14 p_batting_avg 3835 non-null float64
15 p_onbase_plus_slugging 3835 non-null float64
16 p_home_run_perc 3835 non-null float64
17 p_strikeout_perc 3835 non-null float64
18 p_cwpa_def 3835 non-null float64
19 p_baseout_runs 3835 non-null float64
20 accomplishments 3835 non-null object
21 b_war 3835 non-null float64
22 b_batting_avg 3835 non-null float64
23 b_onbase_plus_slugging_plus 3835 non-null float64
24 player_id 3835 non-null object
dtypes: float64(16), int64(7), object(2)
memory usage: 749.1+ KB
Now that the table is free from all missing values, the last step is to account for the awards data, which is currently unstructured text. To do this, we write a function that searches the texts for any of the 5 major awards we want to include, as well as their counts, before adding new columns for the award wins. By looking at awards, we also aim to capture greatness from players who had short stretches of greatness, but potentially didnt play for many years. The major awards are:
- MVP - Most Valuable Player (Generally a hitter)
- Cy Young - Best Pitcher
- Batting Title - Highest Batting Average
- Gold Glove - Best defensive player at a given position
- All Star - Being named to the ‘All Star’ team in a given year
def scan_awards(text, award):
'''Given the text from our award scraping function, and a specific award, return the number of times
the award was won'''
# Check if the award was won at all
win = award.lower() in text.lower()
# If the award was not won, return 0. Else, return the number of times it was won
if not win:
return 0
else:
if ',' in text: # The player won multiple awards, and the formatting is different
try:
num_wins = int(text.lower().split(award.lower())[0].split(',')[-1].strip().replace('x', ''))
except (IndexError, ValueError): # Sometimes when winning the awarad once, the format changes
return 1
else:
try:
num_wins = int(text.lower().split('all-star'.lower())[0].strip().replace('x', ''))
except (IndexError, ValueError):
return 1
return num_wins
# Insert Pitching Awards
pitcher_career_stats["cy_youngs"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Cy Young"))
pitcher_career_stats["gold_gloves"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Gold Glove"))
pitcher_career_stats["mvps"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "MVP"))
pitcher_career_stats["all_stars"] = pitcher_career_stats.accomplishments.apply(lambda x: scan_awards(x, "All-Star"))
# Insert Batting Awards
batter_career_stats["mvps"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "MVP"))
batter_career_stats["gold_gloves"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Gold Glove"))
batter_career_stats["batting_titles"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "Batting Title"))
batter_career_stats["all_stars"] = batter_career_stats.accomplishments.apply(lambda x: scan_awards(x, "All-Star"))
# Remove the general accomplishments columns from each DataFrame
batter_career_stats = batter_career_stats.drop(columns=['accomplishments'])
pitcher_career_stats = pitcher_career_stats.drop(columns=['accomplishments'])# View the first few rows of the awards data for batters
batter_career_stats.head()[['player_id', 'mvps', 'gold_gloves', 'batting_titles', 'all_stars']]| player_id | mvps | gold_gloves | batting_titles | all_stars | |
|---|---|---|---|---|---|
| 0 | aaronha01 | 1 | 3 | 2 | 25 |
| 1 | aaronto01 | 0 | 0 | 0 | 0 |
| 2 | abbated01 | 0 | 0 | 0 | 0 |
| 3 | abbeych01 | 0 | 0 | 0 | 0 |
| 4 | abbotku01 | 0 | 0 | 0 | 0 |
With the DataFrame completely clean, and the awards dealt with, we save it back to the processed data folder
batter_career_stats.to_csv('../../data/processed-data/batter_career_stats.csv', index=False)
pitcher_career_stats.to_csv('../../data/processed-data/pitcher_career_stats.csv', index=False)Appearences by Position DataFrame
Finally, we move onto the appearences by position dataset, which we will use to gain a more granular look at the positions each player played throughout their career. Because this is a list of DataFrames, we start by concatonating everything into a singular one
appearances_df = pd.concat([x for x in all_player_career_appearences], ignore_index=True)
appearances_df.head()| lg_ID | G_all | GS | G_batting | G_defense | G_p_app | G_c | G_1b | G_2b | G_3b | G_ss | G_lf_app | G_cf_app | G_rf_app | G_of_app | G_dh | G_ph | G_pr | player_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 331 | 0 | 139 | 331 | 331 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | aardsda01 | |
| 1 | 3298 | 3173 | 3298 | 2985 | 0 | 0 | 210 | 43 | 7 | 0 | 315 | 308 | 2174 | 2760 | 201 | 122 | 1 | aaronha01 | |
| 2 | 437 | 206 | 437 | 346 | 0 | 0 | 232 | 7 | 10 | 0 | 135 | 1 | 2 | 137 | 0 | 102 | 35 | aaronto01 | |
| 3 | 448 | 91 | 81 | 448 | 448 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | aasedo01 | |
| 4 | 15 | 4 | 15 | 9 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 7 | 1 | abadan01 |
The dataset looks quite clean already! We see the first 4 columns tally general position types, the final 2 (excluding player_id), tally games as a pinch hitter and pinch runner, while the G_[position] each tally games at a given position. The two steps we take to finish the job are dropping unnecessary columns, and converting game tallies to numeric values and then percentage shares.
Additionally, a small subset of players have missing data for their positions. To combat this, we fill the missing values with equal percentages across the player, signaling no relative impact from position.
# Convert values to numeric types
appearances_df = appearances_df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
# Define the important columns to keep and normalize
position_cols = ['G_p_app', 'G_c', 'G_1b', 'G_2b', 'G_3b',
'G_ss', 'G_lf_app', 'G_cf_app', 'G_rf_app',
'G_dh']
# Normalize the games by position into percentages
appearances_df[position_cols] = appearances_df[position_cols].div(appearances_df[position_cols].sum(axis=1), axis=0)
# Filter down to our final DataFrame
appearances_df = appearances_df[position_cols + ['player_id']]
# Fill missing values
appearances_df = appearances_df.fillna(1/9)/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/2786824259.py:2: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
appearances_df = appearances_df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
Looking at the final DataFrame, we see that it is now clean and can be written to the processed data folder.
appearances_df.head()| G_p_app | G_c | G_1b | G_2b | G_3b | G_ss | G_lf_app | G_cf_app | G_rf_app | G_dh | player_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | aardsda01 |
| 1 | 0.0 | 0.0 | 0.064457 | 0.013198 | 0.002149 | 0.0 | 0.096685 | 0.094537 | 0.667281 | 0.061694 | aaronha01 |
| 2 | 0.0 | 0.0 | 0.599483 | 0.018088 | 0.025840 | 0.0 | 0.348837 | 0.002584 | 0.005168 | 0.000000 | aaronto01 |
| 3 | 1.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | aasedo01 |
| 4 | 0.0 | 0.0 | 0.888889 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.111111 | 0.000000 | abadan01 |
appearances_df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21228 entries, 0 to 21227
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 G_p_app 21228 non-null float64
1 G_c 21228 non-null float64
2 G_1b 21228 non-null float64
3 G_2b 21228 non-null float64
4 G_3b 21228 non-null float64
5 G_ss 21228 non-null float64
6 G_lf_app 21228 non-null float64
7 G_cf_app 21228 non-null float64
8 G_rf_app 21228 non-null float64
9 G_dh 21228 non-null float64
10 player_id 21228 non-null object
dtypes: float64(10), object(1)
memory usage: 1.8+ MB
appearances_df.to_csv('../../data/processed-data/appearances.csv', index=False)Final Dataset Building
Now that all of the individual DataFrames are cleaned (woohoo!), we move into condensing the data into one singular DataFrame for batters and one for pitchers, that we will use for future exploration and analysis. To accomplish this, we take the BBWAA voting DataFrame as our base, as it already contains only those players who have been voted on in a BBWAA ballot, and then we add extra features like the players stats on from our adjacent DataFrames.
# Read in the core DataFrames
final_dataset = pd.read_csv('../../data/processed-data/all_bbwaa_voting.csv')
# Read in the stats DataFrames
batter_stats = pd.read_csv('../../data/processed-data/batter_career_stats.csv')
pitcher_stats = pd.read_csv('../../data/processed-data/pitcher_career_stats.csv')def sort_row_by_bp(player_id, batter_stats, pitcher_stats):
if player_id in pitcher_stats.player_id.values:
return 'pitcher'
elif player_id in batter_stats.player_id.values:
return 'batter'
else:
return 'missing'
# Attach the batter/pitcher designation for each entry and check the results
final_dataset['position'] = final_dataset.player_id.apply(lambda x: sort_row_by_bp(x, batter_stats, pitcher_stats))
final_dataset.position.value_counts()position
batter 3144
missing 1226
pitcher 530
Name: count, dtype: int64
Upon investigation of the players with ‘missing’ values for their position, we can confirm that the vast majority of the players are ‘missing’ because they were among the group to fail to be scraped in the data collection phase due to changes in underlying HTML. We see below the only names of players who did succesfully have data scraped.
for player in final_dataset[final_dataset.position == 'missing'].player_id:
if player == 'mccarjo99': # throws error bc the id is different across datasets
continue
val = all_player_stats_dict[player]
if not isinstance(val, str):
print(player)rickebr01
rickebr01
rickebr01
rickebr01
ensje01
beaucji01
There are only 4 players left who did succesfully have data scraped. However, a quick seach reveals that these players were actually primarily manages or executives who played early enough that the ‘BBWAA’ eligibility rules were different such that they were voted on from this pool. Becasue of this, we will drop all players who are ‘missing’ in this dataset.
final_dataset = final_dataset[final_dataset.position != 'missing']
final_dataset.head()| name | player_id | voting_year | year_on_ballot | votes_pct | ly_votes_pct | outcome | position | |
|---|---|---|---|---|---|---|---|---|
| 0 | Ty Cobb | cobbty01 | 1936 | 1 | 98.2 | 12.975208 | elected | batter |
| 1 | Babe Ruth | ruthba01 | 1936 | 1 | 95.1 | 12.975208 | elected | batter |
| 2 | Honus Wagner | wagneho01 | 1936 | 1 | 95.1 | 12.975208 | elected | batter |
| 5 | Nap Lajoie | lajoina01 | 1936 | 1 | 64.6 | 12.975208 | limbo | batter |
| 6 | Tris Speaker | speaktr01 | 1936 | 1 | 58.8 | 12.975208 | limbo | batter |
Now that we are sure that all dataset entries have a valid connection to the stats data, we split the data into batters and pitchers, before attaching the career stats
pitcher_df = final_dataset[final_dataset.position == 'pitcher']
batter_df = final_dataset[final_dataset.position == 'batter']Before we merge the stats with the BBWAA voting data, we will also add in the scandal data as set forth in the earlier fangraphs paper. We do this with a manual list of known scandelous players as set forth in that paper. We also filter down to players in our dataset. Reasons for scandals include:
- Use of Performance Enhancing Drugs
- Gambling
- Conduct detrimental to the game and personal image
After adding in the presence of a scandal, we remove the players with scandals, as the presense of a scandal essentially interferes with the normal voting process, and critically, is a factor known before the voting process itself. Thus, for future data we would also be able to exclude them from predicions.
# Define the ids with scandals
scandal_ids = [
'bondsba01', 'rodrial01',
'pettian01', 'ortizda01',
'schilcu01', 'dykstle01',
'ramirma02', 'mcgwima01',
'rosepe01']
# Create a scandal column in both the pitching and batting DataFrames
batter_df['scandal'] = batter_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)
pitcher_df['scandal'] = pitcher_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)
# Remove players with scandals
batter_df = batter_df[batter_df.scandal == 0]
pitcher_df = pitcher_df[pitcher_df.scandal == 0]/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/2900954371.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
batter_df['scandal'] = batter_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)
/var/folders/x1/v8sbf2px7bd3mh5s_v4zrxp40000gn/T/ipykernel_53307/2900954371.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
pitcher_df['scandal'] = pitcher_df.player_id.apply(lambda x: 1 if x in scandal_ids else 0)
batter_df = batter_df.merge(batter_stats, on='player_id', how='left')
pitcher_df = pitcher_df.merge(pitcher_stats, on='player_id', how='left')Next, we attach the appearances by position data with another merge
batter_df = batter_df.merge(appearances_df, on='player_id', how='left')
pitcher_df = pitcher_df.merge(appearances_df, on='player_id', how='left')One other thing we must deal with, is the fact that some advanced metrics like championship win probability added don’t appear in the data until 1912, so players who played before and after 1912 all have depressed values due to its summative property. To counter this, we will override the values for these columns for the first few year of voting, until the league resumed play after WWII in 1946.
pitcher_df.head().columnsIndex(['name', 'player_id', 'voting_year', 'year_on_ballot', 'votes_pct',
'ly_votes_pct', 'outcome', 'position', 'scandal', 'p_war', 'p_w',
'p_win_loss_perc', 'p_earned_run_avg', 'p_earned_run_avg_plus', 'p_g',
'p_gs', 'p_sho', 'p_sv', 'p_ip', 'p_so', 'p_whip', 'p_fip',
'p_strikeouts_per_base_on_balls', 'p_batting_avg',
'p_onbase_plus_slugging', 'p_home_run_perc', 'p_strikeout_perc',
'p_cwpa_def', 'p_baseout_runs', 'b_war', 'b_batting_avg',
'b_onbase_plus_slugging_plus', 'cy_youngs', 'gold_gloves', 'mvps',
'all_stars', 'G_p_app', 'G_c', 'G_1b', 'G_2b', 'G_3b', 'G_ss',
'G_lf_app', 'G_cf_app', 'G_rf_app', 'G_dh'],
dtype='object')
# Update some of the advanced stats for very early players
early_batters = batter_df[batter_df.voting_year <= 1945]
early_cwpa = early_batters.b_cwpa_bat.mean()
early_re24 = early_batters.b_baseout_runs.mean()
batter_df.b_cwpa_bat = batter_df.apply(lambda x: early_cwpa if x.voting_year <= 1945 else x.b_cwpa_bat, axis=1)
batter_df.b_baseout_runs = batter_df.apply(lambda x: early_re24 if x.voting_year <= 1945 else x.b_baseout_runs, axis=1)
# And do that same for the pitchers DataFrame
early_pitchers = pitcher_df[pitcher_df.voting_year <= 1945]
early_cwpa = early_pitchers.p_cwpa_def.mean()
early_re24 = early_pitchers.p_baseout_runs.mean()
pitcher_df.p_cwpa_def = pitcher_df.apply(lambda x: early_cwpa if x.voting_year <= 1945 else x.p_cwpa_def, axis=1)
pitcher_df.p_baseout_runs = pitcher_df.apply(lambda x: early_re24 if x.voting_year <= 1945 else x.p_baseout_runs, axis=1)Finally, we write out our final datasets to the processed data folder!
batter_df.to_csv('../../data/processed-data/final_batter_df.csv', index=False)
pitcher_df.to_csv('../../data/processed-data/final_pitcher_df.csv', index=False)Now that all of our data is cleaned, we move forward to exploring our datasets in depth, and begin analyzing our data! This can be found in the Exploratory Data Analysis tab.