/cdn.vox-cdn.com/uploads/chorus_image/image/50291775/GettyImages-522022628.0.jpg)
Hey, have you guys heard about the disgraced urology researcher? He got busted for pee-hacking.
Is everyone sitting down? I have some bad news about power rankings.
When I set out on this project, I expected—wanted—to wind up affirming what we all know to be true of power rankings: That they're nothing but a facile review of the of league standings with a little extra weighting on the last week or two despite claiming to be forward-looking. That they exist on major sports network and magazine websites merely to check a box because all the other sites do the same. That they maybe provide some value to the casual-enough-to-not-be-aware-of-the-standings fan (can one be a fan of a sport without having some awareness of the standings?), but that we, the enlightened baseball thinkers that we are, could—and should—continue derisively ignoring them. And, well, none of that is totally wrong, but it's wrong by enough.
With the caveat here that the analysis is limited (for now) solely to ESPN's power rankings1, which may or may not be representative of the whole industry, let's take a looksee at some numbers.
The initial goal of this project was merely to reverse engineer the implicit algorithm by which ESPN constructs its weekly power rankings. We'd all have a good laugh at their expense at the absurd simplicity and return to triggering one another with superfluous comments about first basemen2. After scraping and scrubbing ESPN's power ranking data for the 2013, 2014, and 2015 seasons, I ran regressions against a variety of intentionally-simple metrics3. Excerpts of some of the regression outputs follow, beginning with the most obvious option: season-to-date winning percentage.
For our purposes, we mostly care about the R or R^2 values4. If this looks like Greek5 to you, just know that as R is the indication of the strength of the relationship between the x-variable(s) (in this case, season-to-date winning percentage) and the y-variable (power ranking) and R^2 is a measure of the extent to which the variation in the y-variable is explained by the x-variable. In both cases, the maximum value is 1 and the higher the number6 the stronger the relationship.
As you can see above, using just winning percentage we get an R^2 of .73—pretty tasty, but we can do better. Perhaps there's an overweighting of prior week, prior two week, and/or prior month results.
Hm, no luck there. All other combinations were similarly ineffectual. What if ESPN is biased towards big-market teams? Surely there's some incentive to inflate7 the rankings of the teams whose fans give ESPN the most views and clicks.To test this, I counted how many times each team was tagged in an article in ESPN.com's archive published between April 1, 2014 and March 31, 20158 and compared that to their 2015 power rankings. This resulted in an R^2 of just 0.07, and regressing the ordinal ranking of article tags against power rankings gave an R^2 just 0.11. Despite these small figures, they still could provide some inferential value as they, unlike all the other x-variables tested, are not autocorrelated9 with winning percentage.
So after all that, we're still stuck at an R^2 of roughly .73. That's nice, but it's not enough for the cynic in me. There's no way that ~27% of the variation in power rankings is explained by secret sauce. Looking at the ordinal rankings of ESPN article mentions, thankfully, provided the spark to do the same with winning percentage. What happens when you regress power rankings against ordinal winning percentage rankings?
A 0.91 R^2 is what happens10! Clearly we're on the right track now. Let's give ESPN even less credit and assume they can never really let go of their preseason power rankings—no one likes admitting when they make an error, after all.
Nice. And now, if we go full on p-hacking and autocorrelation-ignoring11 and include prior week power ranking as an x-variable:
So here we are, a .95 R^2 while being as lazy as we could (and .92 with methodology that at least isn't blatantly invalid). Hypothesis confirmed: Power rankings are pointless and lazy and bad12. If you knew nothing about a team but its ordinal rank in the standings, you'd be able to predict its place in the power rankings with very high accuracy.
But. But! But what does determining the simplicity behind the power rankings' construction actually tell us? The data above doesn't really answer the question of whether power rankings have any value to their readers. And it seems odd to discount something just because its construction is simple. In fact, we often do the opposite13. So: The data below shows the correlations (by week of the season) between rest-of-season winning percentage and each of power ranking, ordinal win percentage ranking, win percentage, and Pythagorean record (ie run differential).
Startlingly, the power rankings had the highest correlation almost across the board, beating even Pythagorean record14! In hindsight this probably shouldn't be that surprising for the period from week 18 (mid-late July) onwards, as humans are naturally better able to price in the impact of newly acquired/departed players than is a model that is rigid by design. But seeing power rankings outperform Pythag in the first half of the season is very surprising. Even being as charitable to Pythag as I can by ignoring the first three weeks of the season (since there's so much noise in the data early on) and beginning with the first week in which it outperforms the power rankings, it still loses out by an average of .012 per week over the 14 week period from week 4 to week 17. I guess we now know what happened to that missing 0.08 R^2 from the regression!
With that, I open the floor to the commentariat to figure out how this can happen and what the implications are. Feel free to also point out any methodological flaws (keeping in mind that to some extent the methodology is intentionally flawed).
* * * * *
Footnotes
(1) Nothing for or against ESPN here, theirs were just the least troublesome rankings to pull, though still awfully troublesome. If anyone has ideas for a good way to collect SI power ranking data in a way that, even if messy, is at least scrubbable in a reasonably systematic fashion, please let me know in the comments.
(2) 1B > Yasiel Puig.
(3) The cynic in me (rightly) wouldn't allow for the possibility that ESPN was actually putting some amount of scientific rigour into this—why would they? Also, the nature of regressions is such that adding any additional independent variable will increase the R^2 by some amount, no matter how irrelevant or autocorrelated. While my goal here is admittedly mostly to p-hack my way to an answer, allowing for an unlimited number of x-variables would be messy and pointless (and doesn't result in a materially higher R^2 than the maximum presented in this article).
(4) If it matters to you, the P-value for only one variable was greater than 1E-13 and most were so small that Excel can't even process them as anything but zero, and the largest was still well below 0.01.
(5) Which would be weird, because R doesn't exist in the Greek alphabet and 2 is part of the Arabic numeral system, but I digress.
(6) Fine, the absolute value.
(7) "Embrace Debate" notwithstanding
(8) Which, unsurprisingly, really really sucked to do, so it's only for 2015 instead of all 3 seasons.
(9) Roughly: the correlation of an x-variable with another x-variable. See also [link]
(10) Using ordinal winning percentage ranking and ordinal article mention ranking as x-variables for just 2015 actually results in an even higher R^2 than does this regression, despite having only one third as many observations. There's likely some additional R^2 to be gained above the maximum (no spoilers! keep reading the article) were I to do this for all 3 years for which I have data, but I don't get paid enough for that.
(11) Prior week power ranking has a .94 correlation (.88 R^2) with current week power ranking
(12) And 1B is the greatest first baseman of all time
(13) You thought I would also link to "Brevity is the soul of wit", but actually Polonius is "Generally regarded as wrong in every judgment he makes over the course of the play"
(14) And the pattern is fairly consistent across each year; it's not just a case of averaging hiding some wild noise in the data.