Loading… please be patient.

Are The
Hands Hot?

Examining the hot hand theory: When NBA players
start shooting a scoring streak, they feel like they're "on fire". Are they?

A final project for CS109a at Harvard University.

He shoots— he scores. He shoots again— and score again he does. Three times now; four. Will he make the fifth? The crowd is going wild. "He's on fire," they say. "Nothing can stop him now."

Three weeks ago, Klay Thompson—of the Golden State Warriors—made 60 shots in 29 minutes. The concept of the hot hand is simple and intuitive— a basketball player who is doing well will continue to do well, whether it is because they are in the mood or because there is actually something affecting their ability.

But is there mathematical basis for such "hot hand"?

To examine this phenomenon, we sourced data of every single shot made in a NBA season, 2014-2015. In previous studies, research was conducted with a single basketball team, or with a single college team. This time, we're examining the data for the entire league.


Data cleaning

We found that there were no columns for a player’s team name, the opposing team for the game, or the date of the game in this dataset. Instead, the dataset included a column called “MATCHUP” that included all of this information, e.g. “MAR 04, 2015 - CHA @ BKN.” As such, we extracted the relevant data from this column to create three columns: “MATCHUP” that represented the date of the match, “TEAM” that represented the player’s team, and “OPPONENT” that represented the opposing team during that match.

In addition, “SHOT_RESULT” took values of “missed” or “made.” This was encoded into binary values, with 1 representing a shot that was made and 0, missed. The same was done for the “LOCATION” column. “A,” or an away game, was encoded to 0. “H,” or a home game, was encoded to 1. The “W” column was similarly converted, where “W,” or a win, was encoded to 1. “L,” or a loss, was encoded to 0.

We then examined the data for any null values and found that a column, SHOT_CLOCK, had 5567 null values. We imputed the data for this by using linear regression, reasonable because the variable was continuous. Imputing the data led to reasonable results as well, with no values about 24 seconds, the maximum.

View calculations

View the iPython notebook for raw calculations and data processing.

Data exploration

Commentary

Here we notice some interesting behavior, some of which you would expect: there is a jump at the three-point line where players try to make many of their shots. That's matched with a drop within the three-point line, where it's not worth the shot (it would be easier, but equally effective, to move closer to the net).

Otherwise NBA players are generally not interested in making shots from a far distance. That's the type of neutered and cautious behavior that could influence our analysis.

Commentary

In our final data analysis, we will actually drop players that have less than 200 shots taken in their careers— mainly because we are not interested in the confounding that amateur players may provide.

We can also see that examining players with more than 200 shots presents us with a well-distributed collection of players, with the number dropping off as the number of career shots increases.

Commentary

After dropping players with less than 200 shots from our data set, we see that all the players remaining in the dataset have a good number of games played.

This is important because it means that our players have enough games that they will have "good" and "bad" games— in other words, we prevent a particularly good game or a bad game (for other, perhaps environmental, factors) from affecting our data.

Player exploration

Commentary

For the visualization here, we split up the streak length predictor into two streaks: hit streak and miss streak. We see for the overall set of players that the most common hit streak is 1 (a hit that immediately follows a miss), which corresponds to a 0 in the overall streak length predictor. Streaks can be quite long, sometimes reaching 7+ hits in a row.

Commentary

The distribution for miss streak is almost identical to that of hit streak. We again see that for the overall set of players that the most common miss streak is 1 (a miss that immediately follows a hit), which corresponds to a 0 in the overall streak length predictor. We also see that streaks can reach up to 7+ misses in a row.

The model

Feature selection

In order to determine that the data supports hot hand theory, we would have to find that adding a predictor about streak to the regression significantly increases accuracy in predicting whether or not a shot is made. As part of this, we selected features that were relevant to the shot being made. We removed the following variables:

  • Specific game ID should not matter, as the other predictors should cover any influence a specific game ID would have, e.g. opponent and location.
  • Matchup should not matter as day of the year is assumed to not affect gameplay.
  • Closest defender should not be considered as we already count the closest defender's ID.
  • FGM, points, and shot result should not be considered as they are the response variable.
  • Player name should not be considered as we already count the player's ID.

Finalizing the model

We considered two baseline models against which we could compare our prediction of whether a shot was made:

  • Random chance: this was reasonable because proportions of shots made and shots missed were approximately equal.
  • A model without the "streak length" predictor: comparing a model that considers streak length against this should indicate whether or not streak length improved accuracy of the model

In our simple logistic regression model, our predictors were these selected features as well as the length of the streak as of (but not including) that shot, while the response variable was the shot result. Streak length was calculated by iterating through each shot in the game for that player. The streak length incremented for every shot in a row made prior, reset back to 0 whenever there was a new game or when the player missed a shot following a hit streak or hit a shot following a miss streak, and decremented for every shot in a row missed prior. As such, a negative value in the streak length predictor indicated a miss streak while a positive value indicated a hit streak.

We chose a simple logistic regression model in order to: (1) replicate the model found in the 1985 Gilovich et al paper, (2) have the ability to add custom values for priors/class weights, and (3) view the coefficients for lasso and ridge regression to see which predictors were the most important.

After constructing a simple logistic regression model, we adjusted tuning parameters (type of regularization and value of C) and cross-validated (k = 10) to find the most accurate model (~61%). K = 10 was chosen in order to have enough cross validation, be computationally efficient, and observe the pattern expected in which there was a peak as values of C increased and then a decline as the model overfitted the data. Once we found the most appropriate parameters, we tested for confounding variables by adding variables one by one.


Commentary

Tuning parameters for an L1 regularized model indicated that the best parameters C value was 0.001 with an accuracy of ~61%. We also observed the expected pattern that comes with tuning parameters -- as values of C increased, we observed an initial increase in accuracy as the appropriate predictors were minimized and then a decrease as the model overfitted to the testing data.

Commentary

Tuning parameters for an L2 regularized model indicated that values of C did not significantly affect accuracy beyond C = 10^-6. This was reasonable because the predictors that were minimized by Ridge regression were so insignificant that they were minimized to nearly 0 regardless of the C value. Accuracies of the models were slightly lower than that of the best with L1 regularization.

Commentary

Once we found the most appropriate parameters, we tested for confounding variables by adding variables one by one (by order in the table). The last predictor to be added was the streak length predictor. Adding streak length did not have a significant impact on accuracy, even decreasing to 60.8%. Especially since accuracy is still lower than some of the predictions with a smaller subset of predictors, it is likely that hot hand theory is not supported by this data.

Commentary

We see that the relevant predictors obtained from lasso regression under the optimal regularization parameter each contribute additional predictive accuracy when added to the model, and that the predictor that adds the most predictive accuracy when added to the model is the shot distance predictor.

It makes intuitive sense that shot distance would have a powerful bearing on whether or not a shot is more likely to be missed or hit, since shots from far away are relatively likely to be missed while shots from close to the basket are relatively likely to be hit.

Future steps

Our calculations showed that the hot hand theory is not supported at the league level. We showed that successful streaks— and streaks of failures— are both statistically unsubstantial.

In future experiments, however, we can take ourselves in many different directions. Do certain players or teams have more "hot hand" than others? To what extent can other predictors of experience and strength be considered? And, can we consider environmental factors? With additional, more comprehensive data, we can perhaps answer some of these questions.

Aaron, Athena, Brandon, and Brandon are sophomores at Harvard College.

Are things not rendering properly? Contact brandonwang@college.harvard.edu.
 
You need to view this on a computer. The graphs look best that way.