Predicting Baseball Hall of Fame Admittance Using Machine Learning

Alex Rodriguez×0/

Every year, one of the main topics surrounding the baseball off-season is Hall of Fame voting. There are nearly 400 Baseball Writers Association of America (BBWAA) voters who cast a ballot every year. If a player receives at least 75% of the vote, they are elected to the Hall of Fame. Over 15,000 players have played Major League Baseball, yet only 333 are in the Hall of Fame, so it’s a rare feat, reserved for the game’s greatest players. In 2020, no baseball player was elected, and fans continued to be disgruntled by the process.

What qualifies a player as a Hall of Famer is very subjective. Some voters consider traditional counting statistics, some heavily weigh advanced metrics. Certain voters refuse to vote in players with performance enhancing drug allegations, while others are willing to look past it. One of the most controversial debates this year was over Curt Schilling. Schilling, one of the most clutch postseason pitchers of all time, has seen his reputation go sharply downhill since retiring, headlined by bankrupting a company, offensive remarks and an outspoken alt-right personality.

Curt Schilling, Rhode Island and the Fall of 38 Studios - The New York Times

Many fans are frustrated, and some of the game’s greatest players are being left out of the Hall of Fame. While the character clause is used against players such as Schilling, some argue that players in previous generations held offensive views as well, yet still have a place in Cooperstown. Many of the things Schilling says are incredibly hurtful and wrong, and I am not defending him, or any past players who have held viewpoints different than mine.

The purpose of this project is to create an objective way to look at whether a player deserves admittance into the Hall of Fame. For this analysis, I looked only at hitters, but a similar report could be done for pitchers. My goal was to create a model where I could enter in the career statistics for a baseball player, and the algorithm would say if they belonged in the Hall of Fame or not.

I compiled career statistics for the top hitters in baseball history. I then created a column called Hall of Fame, where the player had either a 1 if they are in the Hall, or 0 if not. By doing this, I could use classification algorithms to build my model. These algorithms seek to recognize patterns in the data, and then try to find those patterns in future data sets. This allowed me to see what patterns exist among Hall of Fame hitters. Once I created an optimal model, I could input statistics for recently retired players to predict if they belong in the Hall of Fame or not. This is an objective way to look at if someone deserves to be in the Hall of Fame. This classification model will suggest that someone should get in if there are patterns in their statistics similar to other Hall of Fame hitters.

This image has an empty alt attribute; its file name is image.png
First 5 Rows of Data

This simplified image below shows what a classification algorithm does, and how it’s different than regression, a topic that I have covered before. Regression is used to predict a continuous output, such as estimated attendance at an NBA game based on a number of factors. In this case, classification has a binary output, where it is either 1 if the player is in the Hall of Fame or 0 if not.

This image has an empty alt attribute; its file name is Classification-of-Machine-Learning.jpg

For the data set, I looked at the top 451 hitters (in terms of career hits) in MLB history. I omitted active players, as well as players who are still on the Hall of Fame ballot, since it would be unfair to classify them before their fate is sealed. Of these 451 hitters, 139 of them are in the Hall of Fame. To build my model, I split the data into a training (70%) and test (30%) set. This was to try and prevent overfitting, which is a concern given the small sample size.

There are many different classification algorithms available, so using Python I tested 5 of them to see which would give me the best results. The 5 that I looked at were Logistic Regression, Support Vector Machines, MLP Classifier, Gaussian Naïve Bayes and Random Forest. More information about these can be found at the scikit-learn website. For each algorithm I looked at prediction accuracy and performed 10-fold cross validation as well. In addition to prediction accuracy, I looked at F1, Recall and AUC scores as well. Those metrics are explained here. If anyone wants to dive deeper into these technical details and the process behind creating the model, please reach out.

ModelPrediction Accuracy10-Fold Mean10-Fold St. Dev.AUCF1Recall
Random Forest0.880.870.050.910.880.88

The best performing model was the Random Forest Classifier algorithm. Additionally, I used grid search to find the best hyper parameters for the model. I was able to look at feature importance, which showed me which statistics were most important in predicting Hall of Fame admittance. Out of the 14 predictor variables, the top three were batting average, hits and runs. Now that I had created my model, I used it to make out of sample predictions.

I then uploaded career statistics for 25 players who will appear on the Hall of Fame ballot for the first time in the next 5 years. Now, I was able to objectively determine which of these players belong in the Hall of Fame.

Carl Crawford1517166655998193130912313676637710674801090.29
Prince Fielder12161158218621645321103191028847115518110.283
Ryan Howard1315725707848147527721382119470918431250.258
David Ortiz202408864014192472632195411768131917501790.286
A.J. Pierzynski192059729080720434072418890930889515230.28
Alex Rodriguez222784105662021311554831696208613382287329760.295
Jimmy Rollins17227592941421245551111523193681312644701050.264
Mark Teixeira14186269361099186240818409129891814412670.268
Carlos Beltran20258697681582272556578435158710841795312490.279
Jose Bautista151798605110221496312173449751032139470320.247
Adrian Beltre21293311068152431666363847717078481732121420.286
Adrian Gonzalez151929713999720504371231712027821401670.287
Matt Holliday1519037009115720964683231612208021362108370.299
Victor Martinez1619737297914215342332461178730891770.295
Joe Mauer15185869301018212342830143923939103452190.306
Jose Reyes1618777552118021383871311457195899145171270.283
Chase Utley1619376857110318854115825910257241193154220.275
David Wright14158559989491777390262429707621292196650.296
Melky Cabrera1518876878895196238345144854510891101370.285
Curtis Granderson162057723612171800346953449379241916153500.249
Adam Jones1418237009963193933629282945340139597350.277
Ian Kinsler141888742312431999416412579096931046243740.269
Dustin Pedroia1415126031922180539415140725624654138460.299
Hanley Ramirez151668634910491834375322719176601234281930.289
Ichiro Suzuki1926539934142030893629611778064710805091170.311

Of these 25, the model predicts that David Ortiz, Alex Rodriguez, Carlos Beltran, Adrian Beltre, and Ichiro Suzuki will be admitted. Based on my intuition that made sense. There were some other really good players on that list, but those 5 definitely stand out.

David Ortiz,h_485,c_fill,g_auto,f_auto/×560.jpeg

I must admit, this project was more about me getting comfortable using Python and studying machine learning algorithms than it was about creating a perfectly realistic model. For example, the model only takes into account traditional counting statistics like hits and home runs, it does not include any advanced metrics. Additionally, it is offense only, and doesn’t factor in fielding performance. Based on hitting statistics alone, these 5 players deserve to get in, but it will be interesting to see. Alex Rodriguez is an admitted PED user, and Beltran was involved in the Astros sign stealing scandal. I personally believe they both belong in the Hall of Fame, which is a museum of baseball history, not some sort of moral high ground. It may not be realistic to determine who belongs in the Hall of Fame based on a single model, but perhaps these baseball writers have too much power in a subjective process, and more universal standards should be implemented. Whether you agree or disagree, I welcome your thoughts, and I am very much looking forward to this baseball season!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: