Predicting Baseball Hall of Fame Admittance Using Machine Learning

Alex Rodriguez https://bostonglobe-prod.cdn.arcpublishing.com/resizer/zUfgMp90Hk1A_PakUjRPPGtg_LA=/1440×0/arc-anglerfish-arc2-prod-bostonglobe.s3.amazonaws.com/public/5ZAV2FEQJAI6HPJJQYWQQX3GL4.jpg

Every year, one of the main topics surrounding the baseball off-season is Hall of Fame voting. There are nearly 400 Baseball Writers Association of America (BBWAA) voters who cast a ballot every year. If a player receives at least 75% of the vote, they are elected to the Hall of Fame. Over 15,000 players have played Major League Baseball, yet only 333 are in the Hall of Fame, so it’s a rare feat, reserved for the game’s greatest players. In 2020, no baseball player was elected, and fans continued to be disgruntled by the process.

What qualifies a player as a Hall of Famer is very subjective. Some voters consider traditional counting statistics, some heavily weigh advanced metrics. Certain voters refuse to vote in players with performance enhancing drug allegations, while others are willing to look past it. One of the most controversial debates this year was over Curt Schilling. Schilling, one of the most clutch postseason pitchers of all time, has seen his reputation go sharply downhill since retiring, headlined by bankrupting a company, offensive remarks and an outspoken alt-right personality.

Curt Schilling, Rhode Island and the Fall of 38 Studios - The New York Times — https://static01.nyt.com/images/2013/04/21/business/21-SCHILLING-JP8/21-SCHILLING-JP8-jumbo.jpg

Many fans are frustrated, and some of the game’s greatest players are being left out of the Hall of Fame. While the character clause is used against players such as Schilling, some argue that players in previous generations held offensive views as well, yet still have a place in Cooperstown. Many of the things Schilling says are incredibly hurtful and wrong, and I am not defending him, or any past players who have held viewpoints different than mine.

The purpose of this project is to create an objective way to look at whether a player deserves admittance into the Hall of Fame. For this analysis, I looked only at hitters, but a similar report could be done for pitchers. My goal was to create a model where I could enter in the career statistics for a baseball player, and the algorithm would say if they belonged in the Hall of Fame or not.

I compiled career statistics for the top hitters in baseball history. I then created a column called Hall of Fame, where the player had either a 1 if they are in the Hall, or 0 if not. By doing this, I could use classification algorithms to build my model. These algorithms seek to recognize patterns in the data, and then try to find those patterns in future data sets. This allowed me to see what patterns exist among Hall of Fame hitters. Once I created an optimal model, I could input statistics for recently retired players to predict if they belong in the Hall of Fame or not. This is an objective way to look at if someone deserves to be in the Hall of Fame. This classification model will suggest that someone should get in if there are patterns in their statistics similar to other Hall of Fame hitters.

This image has an empty alt attribute; its file name is image.png — First 5 Rows of Data

This simplified image below shows what a classification algorithm does, and how it’s different than regression, a topic that I have covered before. Regression is used to predict a continuous output, such as estimated attendance at an NBA game based on a number of factors. In this case, classification has a binary output, where it is either 1 if the player is in the Hall of Fame or 0 if not.

This image has an empty alt attribute; its file name is Classification-of-Machine-Learning.jpg — https://whataftercollege.com/wp-content/uploads/2020/05/Classification-of-Machine-Learning.jpg

For the data set, I looked at the top 451 hitters (in terms of career hits) in MLB history. I omitted active players, as well as players who are still on the Hall of Fame ballot, since it would be unfair to classify them before their fate is sealed. Of these 451 hitters, 139 of them are in the Hall of Fame. To build my model, I split the data into a training (70%) and test (30%) set. This was to try and prevent overfitting, which is a concern given the small sample size.

There are many different classification algorithms available, so using Python I tested 5 of them to see which would give me the best results. The 5 that I looked at were Logistic Regression, Support Vector Machines, MLP Classifier, Gaussian Naïve Bayes and Random Forest. More information about these can be found at the scikit-learn website. For each algorithm I looked at prediction accuracy and performed 10-fold cross validation as well. In addition to prediction accuracy, I looked at F1, Recall and AUC scores as well. Those metrics are explained here. If anyone wants to dive deeper into these technical details and the process behind creating the model, please reach out.

Model	Prediction Accuracy	10-Fold Mean	10-Fold St. Dev.	AUC	F1	Recall
Random Forest	0.88	0.87	0.05	0.91	0.88	0.88

The best performing model was the Random Forest Classifier algorithm. Additionally, I used grid search to find the best hyper parameters for the model. I was able to look at feature importance, which showed me which statistics were most important in predicting Hall of Fame admittance. Out of the 14 predictor variables, the top three were batting average, hits and runs. Now that I had created my model, I used it to make out of sample predictions.

I then uploaded career statistics for 25 players who will appear on the Hall of Fame ballot for the first time in the next 5 years. Now, I was able to objectively determine which of these players belong in the Hall of Fame.

PLAYER	YRS	G	AB	R	H	2B	3B	HR	RBI	BB	SO	SB	CS	BA
Carl Crawford	15	1716	6655	998	1931	309	123	136	766	377	1067	480	109	0.29
Prince Fielder	12	1611	5821	862	1645	321	10	319	1028	847	1155	18	11	0.283
Ryan Howard	13	1572	5707	848	1475	277	21	382	1194	709	1843	12	5	0.258
David Ortiz	20	2408	8640	1419	2472	632	19	541	1768	1319	1750	17	9	0.286
A.J. Pierzynski	19	2059	7290	807	2043	407	24	188	909	308	895	15	23	0.28
Alex Rodriguez	22	2784	10566	2021	3115	548	31	696	2086	1338	2287	329	76	0.295
Jimmy Rollins	17	2275	9294	1421	2455	511	115	231	936	813	1264	470	105	0.264
Mark Teixeira	14	1862	6936	1099	1862	408	18	409	1298	918	1441	26	7	0.268
Carlos Beltran	20	2586	9768	1582	2725	565	78	435	1587	1084	1795	312	49	0.279
Jose Bautista	15	1798	6051	1022	1496	312	17	344	975	1032	1394	70	32	0.247
Adrian Beltre	21	2933	11068	1524	3166	636	38	477	1707	848	1732	121	42	0.286
Adrian Gonzalez	15	1929	7139	997	2050	437	12	317	1202	782	1401	6	7	0.287
Matt Holliday	15	1903	7009	1157	2096	468	32	316	1220	802	1362	108	37	0.299
Victor Martinez	16	1973	7297	914	2153	423	3	246	1178	730	891	7	7	0.295
Joe Mauer	15	1858	6930	1018	2123	428	30	143	923	939	1034	52	19	0.306
Jose Reyes	16	1877	7552	1180	2138	387	131	145	719	589	914	517	127	0.283
Chase Utley	16	1937	6857	1103	1885	411	58	259	1025	724	1193	154	22	0.275
David Wright	14	1585	5998	949	1777	390	26	242	970	762	1292	196	65	0.296
Melky Cabrera	15	1887	6878	895	1962	383	45	144	854	510	891	101	37	0.285
Curtis Granderson	16	2057	7236	1217	1800	346	95	344	937	924	1916	153	50	0.249
Adam Jones	14	1823	7009	963	1939	336	29	282	945	340	1395	97	35	0.277
Ian Kinsler	14	1888	7423	1243	1999	416	41	257	909	693	1046	243	74	0.269
Dustin Pedroia	14	1512	6031	922	1805	394	15	140	725	624	654	138	46	0.299
Hanley Ramirez	15	1668	6349	1049	1834	375	32	271	917	660	1234	281	93	0.289
Ichiro Suzuki	19	2653	9934	1420	3089	362	96	117	780	647	1080	509	117	0.311

Of these 25, the model predicts that David Ortiz, Alex Rodriguez, Carlos Beltran, Adrian Beltre, and Ichiro Suzuki will be admitted. Based on my intuition that made sense. There were some other really good players on that list, but those 5 definitely stand out.

David Ortiz https://images2.minutemediacdn.com/image/fetch/w_736,h_485,c_fill,g_auto,f_auto/https%3A%2F%2Fbosoxinjection.com%2Fwp-content%2Fuploads%2Fgetty-images%2F2017%2F07%2F186588264-850×560.jpeg

I must admit, this project was more about me getting comfortable using Python and studying machine learning algorithms than it was about creating a perfectly realistic model. For example, the model only takes into account traditional counting statistics like hits and home runs, it does not include any advanced metrics. Additionally, it is offense only, and doesn’t factor in fielding performance. Based on hitting statistics alone, these 5 players deserve to get in, but it will be interesting to see. Alex Rodriguez is an admitted PED user, and Beltran was involved in the Astros sign stealing scandal. I personally believe they both belong in the Hall of Fame, which is a museum of baseball history, not some sort of moral high ground. It may not be realistic to determine who belongs in the Hall of Fame based on a single model, but perhaps these baseball writers have too much power in a subjective process, and more universal standards should be implemented. Whether you agree or disagree, I welcome your thoughts, and I am very much looking forward to this baseball season!

Share this:

Related

Leave a comment Cancel reply