The NBA Last 2 Minute (L2M) reports began grading all calls and relevant no-calls for games which were within 5 points at the 2:00 minute mark of the 4th quarter on 2015-03-01. The NBA’s League Operations senior management team graded those days games which involved Cleveland at Houston, Golden State at Boston, and New Orleans at Denver. These L2M reports partly began due to a desire for transparency of referee judgment in order to quell any public distrust of NBA officials that may have resulted from a betting scandal involving former official Tim Donaghy. The NBA’s accompanying FAQ gives the NBA’s official reason for making the L2M reports are made public:
“L2Ms are part of the NBA’s ongoing effort to build a greater awareness and understanding of the rules and processes that govern our game. Additionally, they serve as a mechanism of accountability to our fans and the media who fairly seek clarifications after our games.”
As of 2017-02-15, there have been a total of 865 games reviewed by the NBA, which have involved both regular season and playoff games. The NBA Officials post these graded L2M reports the following day of a game to their website and keep an archive for all games graded. The reports have a nice structure to them: every notable play is tagged with the time it occurred, an actual (or potential) call made at this moment, the committing player, a possibly disadvantaged player, a judgement on the call, and comments on the particular play. The judgement of the play can be deemed a correct call (CC), a correct no-call (CNC), an incorrect call (IC), and an incorrect no-call (INC). The NBA senior management maintains there exists a fifth category where they do not claim any of the CC, CNC, IC, or CNC but rather just leave the decision as blank. They describe these as “plays that are only observable with the help of a stop-watch, zoom or other technical support, are noted in brackets along with the explanatory comments but are not deemed to be incorrectly officiated.” This is a troublesome description for many reasons, but these calls describe an INC and I will use that definition.
NBA Referee Accuracy
Across the L2M dataset, there are 13,494 plays which have been graded on the CC, CNC, IC, and INC scale. Defining the accuracy of referees is a bit problematic due to the nature of a no-call and how frequently the NBA decides to make a judgement on a no-call. It’s obvious that an INC should be graded, but what threshold should the NBA use in judging a CNC? There are 46 types of calls that the L2M has identified, should each play have a judgement on every type type of call? That seems a bit ridiculous, but it’s hard to identify what should be the trigger for making a judgment on a no-call. Tabling this no-call issue for the moment, we can observe the frequency at which these plays are graded as the particular calls:
Effectively, the grades from the L2M give a description which tells us the truth (correct or incorrect) as well as what a statistical test (the referees) determines (call or no call). A referee can either claim a call or a no-call for every play demarcated on the L2M reports, where it appears that the default option (null hypothesis) is that there should not be a call (thus making a call the alternative hypothesis). The referees are either correct or incorrect in their calls or no-calls. A CC or CNC are not too interesting in and of themselves mainly because they are what should happen. But in what way can the referees be wrong and to what degree is this a problem?
Referees are wrong when it is the case that referees made a call when they should not have (IC) or they failed to make a call when there should be a call (INC). In statistics terminology, these are type 1 and type 2 errors – a false positive (IC) and a false negative (INC). A nice mnemonic for recalling the difference in a type 1 or type 2 error is to recall the story of the boy who cried wolf. At first, the boy claimed there was a wolf that didn’t exist and people believed him (type 1 error). But later, no one believed him when he cried wolf even though it was there (type 2 error).
So how accurate are the referees? Well, it depends on what role one believes a referee should play within a game. If one believes that the referees responsibility is to detect all infractions in a game when they occur, this relates to the type 1 errors and is generally termed statistical significance. In a scientific setting, one generally does not know the true underlying probabilities of being correct and has to make assumptions – which leads to debates and is a large reason why performing a rigorous scientific analysis is difficult. But with the L2M reports, we only need to assume that the NBA’s assessment is the truth. And under this setting, the NBA’s referees are incorrect on 2.4% of the the plays where an infraction did not occur. A typical level of significance in the social sciences is around 5-10% (the lower the better) to be assured that an effect exists (ie an infraction did occur) and the referees clear this level. From this perspective, the referees do a good job of ensuring that an infraction does occur when they blow their whistle in the last 2 minutes of an NBA contest. This is clearly related to how frequently the NBA deems plays a CNC, as more (fewer) plays are graded as CNC the type 1 error rate will decrease (increase).
But looking at the type 1 error rate may not be the correct way to evaluate referees if one feels that the referees job is to make sure that the game is clean and that they detect all fouls. This relates to type 2 errors as well as the statistical property termed the power of a statistical test. This addresses a different aspect, how often do the referees miss a call? In this setting, the referees detect 71.3% of the infractions that occur (the higher the better), which corresponds to missing 28.7% of the infractions. The convention for social sciences is to target a power of above .8, although this really depends on what the research goal is and there’s an interplay between the type 1 and type 2 error rates. Lowering the type 1 error rate will increase the type 2 error rate, and vice versa. Or in NBA terms, if the NBA wants to call more fouls in order to clean up the game, they do so at the cost of having more incorrect calls. This will end the metaphor of L2M reports as statistical tests, but there’s some other ways we can evaluate this new dataset.
Another way is to consider how often referees get their call correct. Or in other words, given that the referees have called an infraction, what’s the probability that they are correct? With this interpretation, the referees calls are correct 96.5% of the time.
There’s no benchmark or baseline for us to judge these L2M reports, so are the values high? Low? Well, we don’t know. We need more data to determine this particular question. But we can take a look at other aspects involved with these L2M reports.
Each play graded in the L2M has a corresponding call and type of infraction in question. There are 9 call types within the reports, although the type is overwhelmingly of the foul type:
In terms of what types of infractions can occur, there are 46 types of calls which are in the L2M reports. While the distribution is not nearly as skewed as the calls, there’s still a skew for the types and displaying all of these would be a bit of overkill. It turns out the top 6 are good enough for our purposes:
Top 6 Infractions
The overwhelming calls are personal, shooting, offensive, and loose-ball fouls. For those that watch NBA games, this makes sense as that is where all the action is within a game. There’s also two polarizing types that show up: traveling and support rulings (which stem from when referees go to the replay center for verification of a call). The proportion table does not allow for the accuracy of these particular calls, which might be a useful exercise in evaluating referees. So let’s check this out:
Top 6 Infractions Accuracy
There does not appear to be substantial differences in the incorrect calls across these call types, with the exception of traveling and support rulings. There’s almost no NBA evaluated incorrect rulings based off of the support rulings, which should be expected. Support rulings involve the use of technology in the form of video replays from many different angles which can be slowed down and zoomed in as well as extra time to evaluate. There should be no incorrect calls or no-calls when this is activated, and it appears that’s just about the case. However, the rate at which the referees fail to call a travel is astonishing although not completely surprising. There is a reason why so many “traveling truthers” exist, it’s because the referees fail to call travels more than any other event in an NBA game. And it’s not even close.
There is certainly something to be said about the purpose of the rules and the objective of the NBA. Basketball is an entertainment industry, the NBA sells the sport to networks, advertisers, corporations, and fans. As such they try to make the game as attractive to its consumers as possible. At the same time, there needs to be a structure to the game to ensure that the participating teams are well aware at what is allowable in play. When a travel occurs yet is failed to be called, these are typically plays which are extremely entertaining which spurs interest across consumers…although there are exceptions to this.
I’m a rules purist, but also I’m not someone who feels that there is some sort of sanctity to the current set of rules. If the NBA wants to put a rule in place, I’m fine with it. However, I am not fine with a rule being selectively enforced because there’s a reason it is in place. If the NBA wants to selectively enforce a rule, they should instead look into changing the rule so that the selective enforcement is built into the rule. This would align both interest – ensuring an entertaining product while also letting teams know what is legal within the field of play.
There’s probably more insight to be gained from call types, but we’ll leave that for a future date. Let’s jump into a few other aspects of the L2M data.
One interesting aspect of the L2M reports is their reporting consistency over time. So far there’s only one full season worth of data and two partial seasons for us to evaluate, which is an OK amount. But one aspect that I recently noticed is how overtime calls are dealt with. Currently, the L2M FAQ states is criteria for grading a game as follows:
What is the criteria for a game to have an L2M?
An L2M is done for any game in which one team’s lead over the other is five points or fewer at the 2:00 minute mark of the relevant period.
This would be the last 2:00 minutes of the fourth quarter and only the last 2:00 minutes of each subsequent overtime period. The problem is that this wasn’t the original statement, intent, or criteria of the L2M reports. We can tell this just by plotting the game-time remaining that a call was made across the date which the game occurred:
It is clear that the NBA changed their L2M criteria for the 2016-17 Season and possibly even in the 2016 playoffs. This is important and notable especially because I cannot find any reference from the NBA that they decided to change their L2M criteria. It’s not from lack of trying, but it may be from a lack of knowing where to look. If any one knows of a public announcement from the NBA that their L2M criteria changed, please notify me.
But aside from this structural change in reporting, let’s dive into other structural changes of the L2M evaluations across time. I’ve already mentioned that the CNC category is a dubious one because a no-call is difficult to define. But let’s first take a look at how the number of calls per quarter has changed over time in the NBA seasons that have had L2M reports (the vertical dashed line marks the beginning of the playoffs):
I did no corrections to the overtime issue stated above, so the 2017 number of calls is a lower bound to what should be the truth, but there’s some interesting patterns here. For one, we can see a clear upward trend in 2016 with accelerated increases at the beginning of the season and after the All-Star break. We can also see that there’s something going on with the 2017 season with a marked decrease in number of graded calls, which happened sometime in November, but a noticeable uptick in calls around early January. At the same time, the 2017 call rate has been consistently above the 2016 call rate which was also clearly above the 2015 call rate. This is also in spite of a reduced amount of available calls to grade due to the overtime grading change. So clearly, the L2M reports are having more calls noted over time. But what kinds of calls? Well, the most interesting one would be the CNC because of the previously mentioned issue of how the type 1 error rate can be reduced by simply adding on more correct no-calls. So let’s see the percentage of calls that the CNC decision has been taking over time:
There’s a clear upward trend of correct no-calls in the reports although they it’s not a continual increase except for most of the 2015 season. The 2016 season has an interesting pattern where there’s an immediate increase in the percentage of CNC until around December. The 2016 CNC rate is stagnant from December until right around the playoff time when it shoots up. And the 2017 season? Well, it appears that the CNC rate matches the calls-per-quarter trend. Take a second on that. The CNC rate matches the overall trend of raw calls per quarter. Or in other words, the CNC calls are skyrocketing. This can be seen by looking at the percentage of the graded plays which are calls versus no calls, which tracks the CNC trend fairly well:
If it is the case that all no-calls are increasing, then we should see that the INC call rate is also increasing. Is that the case? Let’s take a look:
There doesn’t appear to be any trends in the INC call rate unless one thinks it has a cyclical component, which is tough to defend with not even two full season observations yet. So we’re left to essentially conclude that over time, the L2M has reported more and more CNC. Why are they doing this? Well, it’s not entirely clear because it can be explained so many different ways. The NBA might have been too conservative in reporting when first starting off the L2Ms. The actual style of play may have changed over this time where NBA players are now forcing more close calls. It could also be the NBA’s way of reducing the type 1 error rate of the officials due to pressure from the NBA Referees Union.
There’s really no way to discern between these theories at the moment. This does highlight that it is undoubtedly important to keep tracking L2M data to try and answer these questions I’ve left scattered throughout here.
This is also a call for releasing the entire referee call reports. The only data available are for a small subset of a game and we know this is the only publicly available data. Is there a significant break in how the plays are graded right at the 2:00 minute mark of the 4th quarter? Have the non-publicly available data exhibited the same upward trends in number of graded no-calls over time? There’s likely other questions, but answering these can give us insight into the established trends we’ve seen in the L2M reports across time.
Addendum: This particular post was planned earlier in the week. On Thursday, February 16th Russell Goldenberg released his own version of the L2M dataset on The Pudding along with some neat visualizations. I’m not using his dataset but rather my own version I put together which I believe is more comprehensive because he appears to be missing 29 games which I have data on. At the same time, his data has additional insight like which referees worked the particular game. It appears we both worked on L2M data at the same time but unaware of the other. Oh well, just a little bit of wasted manpower ;)
Here’s a current version of my data for those interested: L2M_2-15-2017.txt