One of my priorities this semester was to become more proficient in R. I have been practicing through a data analytics course, as well as doing projects on my own time. In class last week we learned about model selection. That is, when we have many input variables, how do we know which ones to include in our equation to estimate output? In this project, I used best subset selection to determine which variables to include in my linear model. In addition to using what I’ve learned in class, this book has been very helpful for me in learning R.
In this analysis, I am looking to determine which variables have the greatest impact on a college basketball team’s record. I used this dataset from Kaggle, which includes data from the 2015-2019 college basketball seasons. I chose to exclude 2020 because the season was shortened. Additionally, I included only the ACC, B10, B12, SEC, P12 and Big East conferences. The original data comes from this website. It includes advanced metrics such as adjusted offensive efficiency, adjusted defensive efficiency, turnover percentage allowed, offensive rebound rate etc. If you’re curious as to what those mean, the Kaggle site has a description for each variable.
Once I loaded the data into R, I filtered to include only the 6 conferences I want and saved it to a data frame called cbb1. I then used the regsubsets function. This function identifies the best subset of predictors. For example, if I have 15 variables, and I want to choose only 8 for my model, it will tell which 8 are the best to use, according to R^2.
I then used the which.max function to tell me which number of predictors has the highest adjusted r squared value. This will let me know how many predictors creates the best model, and which predictors to use.
I now know that the model with 11 variables would be the best. I can then run the coef function to see which 11 variables are used and what their coefficients are.
I then used the lm function to run linear regression seeing how these 11 variables affect wins.
All of the variables are statistically significant at least the 0.05 level, with 9/11 of them at the 0.001 level. This model has an adjusted r squared value of 0.8637, meaning 86.37% of variation in wins can be explained by these 11 predictors. The model is:
Estimated Wins = 5.84 + 0.31(Adjusted Offensive Efficiency) – 0.22(Adjusted Defensive Efficiency) + 0.60(Effective Field Goal %) -0.70(Opp Effective Field Goal %) -0.63(Turnover %) + 0.64(Opp Turnover %) + 0.22(Offensive Rebound Rate) -0.28(Opp Offensive Rebounds Rate) + 0.15(Free Throw Rate) – 0.15(Opp Free Throw Rate) + 0.09(Adjusted Tempo)
The largest coefficient is opponent effective field goal percentage. eFG takes into account that three pointers are worth more than two point shots, so eFG can measure how efficient your team is.
I then looked at the residuals to see which teams had significantly more (or less) wins in reality compared to what the model predicted. I got this line of code from the Analyzing Baseball Data with R book, and altered it to fit my data.
The Root Mean Square Error (RMSE), which is essentially the standard deviation of the residuals, is 2.28. We would expect about 68% of the estimated wins to be within 2.28 of the actual, and 95% within 4.56. That code produced a graph of all of the residual points, and I highlighted the 20 teams with the largest residuals.
The largest residual was for the 2018-19 Washington team. They won 27 games (and the Pac-12 regular season title), and the model only predicted they would win 21. Their roster included Pac-12 Defensive Player of the Year and now 76er Matisse Thybulle. They were 8th in the country in steals and 2nd in blocks. I looked a little further to try and understand why they performed so much better than expected. They were 9-3 in close games, which could help explain part of it. Most teams go .500 in close games. Looking at the barttorvik site, Washington was 3rd in the country in their “F.U.N.” statistic, a metric which the site says is similar to luck.
The largest negative residual was the 2015-16 Vanderbilt team. They were 2-6 in close games, and last in the country in that F.U.N. statistic, so perhaps the unluckiest team.
The model created was by no means perfect, as other factors could influence win loss records, such as strength of schedule or even just plain old luck. My goal from this project was to become more comfortable using R. If anyone is a fan of one of these teams with a large residual, I’m curious to hear why you think they over/underperformed so much compared to the model prediction.
2 thoughts on “Predicting college basketball win loss records using best subset selection for linear regression in R”
Scott, I must admit that I understood virtually none of this (and that is not, of course, on account of my not being, at this stage, particularly math-oriented; it just sailed above my head). I even am not sure I know what “R” is . . . !!
I liked this article. I myself am doing an independent study in learning R right now. I am using the same book analyzing baseball with R and would like to know if you know any suggestions for building prediction model for march madness tournament