
Every year, one of the main topics surrounding the baseball off-season is Hall of Fame voting. There are nearly 400 Baseball Writers Association of America (BBWAA) voters who cast a ballot every year. If a player receives at least 75% of the vote, they are elected to the Hall of Fame. Over 15,000 players have played Major League Baseball, yet only 333 are in the Hall of Fame, so it’s a rare feat, reserved for the game’s greatest players. In 2020, no baseball player was elected, and fans continued to be disgruntled by the process.
What qualifies a player as a Hall of Famer is very subjective. Some voters consider traditional counting statistics, some heavily weigh advanced metrics. Certain voters refuse to vote in players with performance enhancing drug allegations, while others are willing to look past it. One of the most controversial debates this year was over Curt Schilling. Schilling, one of the most clutch postseason pitchers of all time, has seen his reputation go sharply downhill since retiring, headlined by bankrupting a company, offensive remarks and an outspoken alt-right personality.

Many fans are frustrated, and some of the game’s greatest players are being left out of the Hall of Fame. While the character clause is used against players such as Schilling, some argue that players in previous generations held offensive views as well, yet still have a place in Cooperstown. Many of the things Schilling says are incredibly hurtful and wrong, and I am not defending him, or any past players who have held viewpoints different than mine.
The purpose of this project is to create an objective way to look at whether a player deserves admittance into the Hall of Fame. For this analysis, I looked only at hitters, but a similar report could be done for pitchers. My goal was to create a model where I could enter in the career statistics for a baseball player, and the algorithm would say if they belonged in the Hall of Fame or not.
I compiled career statistics for the top hitters in baseball history. I then created a column called Hall of Fame, where the player had either a 1 if they are in the Hall, or 0 if not. By doing this, I could use classification algorithms to build my model. These algorithms seek to recognize patterns in the data, and then try to find those patterns in future data sets. This allowed me to see what patterns exist among Hall of Fame hitters. Once I created an optimal model, I could input statistics for recently retired players to predict if they belong in the Hall of Fame or not. This is an objective way to look at if someone deserves to be in the Hall of Fame. This classification model will suggest that someone should get in if there are patterns in their statistics similar to other Hall of Fame hitters.

This simplified image below shows what a classification algorithm does, and how it’s different than regression, a topic that I have covered before. Regression is used to predict a continuous output, such as estimated attendance at an NBA game based on a number of factors. In this case, classification has a binary output, where it is either 1 if the player is in the Hall of Fame or 0 if not.

For the data set, I looked at the top 451 hitters (in terms of career hits) in MLB history. I omitted active players, as well as players who are still on the Hall of Fame ballot, since it would be unfair to classify them before their fate is sealed. Of these 451 hitters, 139 of them are in the Hall of Fame. To build my model, I split the data into a training (70%) and test (30%) set. This was to try and prevent overfitting, which is a concern given the small sample size.
There are many different classification algorithms available, so using Python I tested 5 of them to see which would give me the best results. The 5 that I looked at were Logistic Regression, Support Vector Machines, MLP Classifier, Gaussian Naïve Bayes and Random Forest. More information about these can be found at the scikit-learn website. For each algorithm I looked at prediction accuracy and performed 10-fold cross validation as well. In addition to prediction accuracy, I looked at F1, Recall and AUC scores as well. Those metrics are explained here. If anyone wants to dive deeper into these technical details and the process behind creating the model, please reach out.
Model | Prediction Accuracy | 10-Fold Mean | 10-Fold St. Dev. | AUC | F1 | Recall |
Random Forest | 0.88 | 0.87 | 0.05 | 0.91 | 0.88 | 0.88 |
The best performing model was the Random Forest Classifier algorithm. Additionally, I used grid search to find the best hyper parameters for the model. I was able to look at feature importance, which showed me which statistics were most important in predicting Hall of Fame admittance. Out of the 14 predictor variables, the top three were batting average, hits and runs. Now that I had created my model, I used it to make out of sample predictions.
I then uploaded career statistics for 25 players who will appear on the Hall of Fame ballot for the first time in the next 5 years. Now, I was able to objectively determine which of these players belong in the Hall of Fame.
PLAYER | YRS | G | AB | R | H | 2B | 3B | HR | RBI | BB | SO | SB | CS | BA |
Carl Crawford | 15 | 1716 | 6655 | 998 | 1931 | 309 | 123 | 136 | 766 | 377 | 1067 | 480 | 109 | 0.29 |
Prince Fielder | 12 | 1611 | 5821 | 862 | 1645 | 321 | 10 | 319 | 1028 | 847 | 1155 | 18 | 11 | 0.283 |
Ryan Howard | 13 | 1572 | 5707 | 848 | 1475 | 277 | 21 | 382 | 1194 | 709 | 1843 | 12 | 5 | 0.258 |
David Ortiz | 20 | 2408 | 8640 | 1419 | 2472 | 632 | 19 | 541 | 1768 | 1319 | 1750 | 17 | 9 | 0.286 |
A.J. Pierzynski | 19 | 2059 | 7290 | 807 | 2043 | 407 | 24 | 188 | 909 | 308 | 895 | 15 | 23 | 0.28 |
Alex Rodriguez | 22 | 2784 | 10566 | 2021 | 3115 | 548 | 31 | 696 | 2086 | 1338 | 2287 | 329 | 76 | 0.295 |
Jimmy Rollins | 17 | 2275 | 9294 | 1421 | 2455 | 511 | 115 | 231 | 936 | 813 | 1264 | 470 | 105 | 0.264 |
Mark Teixeira | 14 | 1862 | 6936 | 1099 | 1862 | 408 | 18 | 409 | 1298 | 918 | 1441 | 26 | 7 | 0.268 |
Carlos Beltran | 20 | 2586 | 9768 | 1582 | 2725 | 565 | 78 | 435 | 1587 | 1084 | 1795 | 312 | 49 | 0.279 |
Jose Bautista | 15 | 1798 | 6051 | 1022 | 1496 | 312 | 17 | 344 | 975 | 1032 | 1394 | 70 | 32 | 0.247 |
Adrian Beltre | 21 | 2933 | 11068 | 1524 | 3166 | 636 | 38 | 477 | 1707 | 848 | 1732 | 121 | 42 | 0.286 |
Adrian Gonzalez | 15 | 1929 | 7139 | 997 | 2050 | 437 | 12 | 317 | 1202 | 782 | 1401 | 6 | 7 | 0.287 |
Matt Holliday | 15 | 1903 | 7009 | 1157 | 2096 | 468 | 32 | 316 | 1220 | 802 | 1362 | 108 | 37 | 0.299 |
Victor Martinez | 16 | 1973 | 7297 | 914 | 2153 | 423 | 3 | 246 | 1178 | 730 | 891 | 7 | 7 | 0.295 |
Joe Mauer | 15 | 1858 | 6930 | 1018 | 2123 | 428 | 30 | 143 | 923 | 939 | 1034 | 52 | 19 | 0.306 |
Jose Reyes | 16 | 1877 | 7552 | 1180 | 2138 | 387 | 131 | 145 | 719 | 589 | 914 | 517 | 127 | 0.283 |
Chase Utley | 16 | 1937 | 6857 | 1103 | 1885 | 411 | 58 | 259 | 1025 | 724 | 1193 | 154 | 22 | 0.275 |
David Wright | 14 | 1585 | 5998 | 949 | 1777 | 390 | 26 | 242 | 970 | 762 | 1292 | 196 | 65 | 0.296 |
Melky Cabrera | 15 | 1887 | 6878 | 895 | 1962 | 383 | 45 | 144 | 854 | 510 | 891 | 101 | 37 | 0.285 |
Curtis Granderson | 16 | 2057 | 7236 | 1217 | 1800 | 346 | 95 | 344 | 937 | 924 | 1916 | 153 | 50 | 0.249 |
Adam Jones | 14 | 1823 | 7009 | 963 | 1939 | 336 | 29 | 282 | 945 | 340 | 1395 | 97 | 35 | 0.277 |
Ian Kinsler | 14 | 1888 | 7423 | 1243 | 1999 | 416 | 41 | 257 | 909 | 693 | 1046 | 243 | 74 | 0.269 |
Dustin Pedroia | 14 | 1512 | 6031 | 922 | 1805 | 394 | 15 | 140 | 725 | 624 | 654 | 138 | 46 | 0.299 |
Hanley Ramirez | 15 | 1668 | 6349 | 1049 | 1834 | 375 | 32 | 271 | 917 | 660 | 1234 | 281 | 93 | 0.289 |
Ichiro Suzuki | 19 | 2653 | 9934 | 1420 | 3089 | 362 | 96 | 117 | 780 | 647 | 1080 | 509 | 117 | 0.311 |
Of these 25, the model predicts that David Ortiz, Alex Rodriguez, Carlos Beltran, Adrian Beltre, and Ichiro Suzuki will be admitted. Based on my intuition that made sense. There were some other really good players on that list, but those 5 definitely stand out.

I must admit, this project was more about me getting comfortable using Python and studying machine learning algorithms than it was about creating a perfectly realistic model. For example, the model only takes into account traditional counting statistics like hits and home runs, it does not include any advanced metrics. Additionally, it is offense only, and doesn’t factor in fielding performance. Based on hitting statistics alone, these 5 players deserve to get in, but it will be interesting to see. Alex Rodriguez is an admitted PED user, and Beltran was involved in the Astros sign stealing scandal. I personally believe they both belong in the Hall of Fame, which is a museum of baseball history, not some sort of moral high ground. It may not be realistic to determine who belongs in the Hall of Fame based on a single model, but perhaps these baseball writers have too much power in a subjective process, and more universal standards should be implemented. Whether you agree or disagree, I welcome your thoughts, and I am very much looking forward to this baseball season!