This post comes out the discussion about "advanced stats vs. traditional stats" as one of the things in craig in calgary's Fanpost about things that grind his gears. But even beyond that, I think it's worth taking a look at why certain things are included and excluded from FIP, and why.
One of the things grinds my gears is when I see the claim that ERA tells us what happens (the actual results), whereas that FIP is more predictive (it's not telling us what actually happened). On the contrary, FIP is only measuring things that actually happened - strike outs, walks, and home runs - and then assigning a value to these based on what value they have in preventing or resulting in runs in the long run. Why these things, and only these things? Simply, they are the things over which the pitchers have the most control, and which are based on repeatable skills. What FIP does is strip out several things, which research has shown pitchers have much less control over, or basically no ability to control. In other words, if a pitcher deviates from the average, it is not because of their skill level, but because of random variation (some use the term luck here, but for a number of reasons it's not necessarily luck).
The two biggest factors here are Left on Base (LOB)% and BABIP. I don't want to spend a lot of time describing what they are, so I've linked to the Fangraphs glossary, which does a much better job than I ever could. The important thing is the effect they can have on the runs a pitcher allows versus what FIP predicts. What I'm going to do is look at season to season differences in these two factors among qualified pitchers, to show why they are considered largely the result of random variation, and therefore excluded from FIP. The reason to use qualified starters is because they have enough innings in each season to make for a meaningful sample. There are generally around 90 qualified starters per season, or roughly 3 per team. So it's important to remember that on average, these pitchers are better on average than the league wide averages, since it excludes pitchers who get injured or don't perform well enough to justify more innings.
On average, pitchers leave around 72% of runners on base. A higher rate of leaving guys on base will mean fewer runs scored, and usually result in a pitcher's ERA being lower than FIP, and vice versa. But generally, pitchers have little control over their strand rate, and having a high strand rate does not mean it will continue in the future, or even that's it's likely to. I compare it to flipping a fair coin - if you flip 8 out of 10 tails, not only does it tell you nothing about the future likelihoods, it's the result of random variation, not skill. Since it's not a result of skill, it doesn't make sense to credit or debit a person based on those results - they don't control the result.
Below I show the top 10 qualified pitchers in terms of 2010 LOB% (along with ERA and FIP), and then how they performed in 2011. Below that, I show the bottom 10 qualified pitchers in 2010 LOB%, as well as their performance in 2011:
In 2010, the top 10 LOB% group stranded 80% of their runners on base, well above the average of 73.2% overall (among qualified pitchers). However, in 2011, they fell back to 75.5%, though the league average was basically the same. In other words, they regressed 70% towards the average. The worst 10 LOB% group stranded only 66.7% in 2010, but in 2011 that came up to 72.7%, basically right at the league average. The striking fact is this - in 2010, the difference between the worst 10 and best 10 was 13.3%. In 2011, that fell to 2.8%. In other words, 80% of the gap disappeared.
I think it's also interesting to look at the ERA vs. FIP. In 2010, the best 10 pitchers had an ERA 0.70 lower than their FIP. In 2011, that narrowed to 0.10, even though they actually had a lower collective FIP (we would actually expect some regression upward to the mean). On the other hand, the worst 10 had a negative gap of 0.48, which narrowed to 0.09 in 2011 when the LOB% normalized. It's instructive that both of the FIP-ERA gap and the LOB%-average LOB% gap move in the same direction.
We see the exact same pattern, except to an even stronger degree. From 2009 to 2010, the best 10 pitchers at stranded runner regressed almost entirely to the mean (from 79.5% to 73.8%, average of 73.2%) and the worst 10 pitchers regressed from 66.8% to 73.1%. A gap of 12.7% in 2009 between these two groups became 0.7% in 2010 - over 90% of the difference went away! Likewise, a positive FIP-ERA gap for the top 10 of 0.62 became 0.09, and the negative gap of 0.71 for the worst 10 fell to 0.05.
Remember, these aren't randomly selected groups of pitchers we are comparing - these are literally the best performers in a particular metric against the worst. If it was a repeatable skill, we would not expect that from year to year, over 80% of that skill (and as much as over 90%) would just disappear. Instead, what we see is positive and negative random variation for the most part, that the pitcher doesn't really control (or they exhibit the "skill" again the next year).
The league average BABIP is usually around .290, and allowing a lower BABIP will usually mean fewer runs and an ERA lower than FIP. Now, there is one significant difference between LOB% and BABIP in that we know that pitchers have some control over their BABIP, and that in the very long run, the best pitchers will have a lower BABIP than league average (conversely, guys who are very hittable tend to get selected out, if not in the minors than fairly rapidly in the minors). So if we're looking at the career of a pitcher with a thousand innings or so, I can get on board with using ERA (well, actually RA with defensive adjustment, but that's a different story) rather than FIP, because the substantial sample makes it unlikely that random variation is resulting in the different. But on a season to season basis, this is not the case. I repeat the same exercise as above for the top 10 qualified pitchers in 2010 BABIP versus the bottom 10, and their respective follow-up performance in 2011:
We see a similar story as with LOB%. The top 10 pitchers had a group BABIP of .250 in 2010, well above the average of .288, but in 2010 they fell back to .278 as a group. Again, almost 70% of the way back towards the mean. The group of the ten worst had a BABIP of .327 in 2010, falling back .290 in 2011, almost exactly back to average. A gap of 77 points of BABIP in 2010 became 12 in 2010 - roughly 85% of the gap disappeared.
We see a similar story in looking what happens to ERA vs. FIP. The best 10 had a collective gap of 0.82 in 2010, which fell to 0.20 in 2011. The worst 10 went from having an ERA 0.74 runs higher than their FIP to only 0.09 runs. Again, the BABIP regression moving in the exact direction of the FIP-ERA regression. Let's go back a year and look at 2009 to 2010:
Same deal, although to a milder extent. The best 10 pitchers went from .263 in 2009 to .272 in 2010, still well below the average of .288. The worst 10 went from .325 to .302, also significantly different than the average (we're talking nearly 2000 IP as a group). The total gap of 62 points fell to only 30 points, meaning roughly 50% of the gap persisted, which is significant. Likewise, the FIP to ERA gaps were similarly more persistent . Maybe there's some skill after all? Two points. First, the average for the best 10 pitchers was higher in 2009 than in 2010, so there was less room to regress (the gap between the best and worst was 77 points in 2010 vs. 62 in 2009 a tighter band). Second, as I said above, there is some pitcher control over BABIP - in a group of pitchers with ~2000 IP, we should see some of that. So let's go back another year and take a peak looking at 2008 versus 2009:
This is the most fascinating one to me. In 2008, the best 10 BABIP pitchers had an average of 0.253, similar to 2010's group. But in 2009, that group's BABIP ballooned to .310 - almost 20 points worse than average. The worst 10 BABIPers went from .329 in 2008 to .292 in 2009, basically right smack at the average. In other words, the gap of 76 points went to 18 points the other way. So if the 2009-10 comparison was bad for regression, the 2008-09 is very good. I suspect the truth lies somewhere in the middle, more like 2010-11.
It's also instructive to look at what happened to the FIP-ERA gap, especially for the top 10 BABIP pitchers. In 2008, the top 10 had a positive gap of 0.84 runs. But in 2009, when their BABIP ballooned, the gap went to a negative difference of 0.39, despite that their group FIP stayed remarkably static (4.59 in 2008, 4.67 in 2009). Correlation is not causation, but in every single case I've outlined, the FIP-ERA gap has regressed in the same direction and roughly similar magnitude as the LOB% of BABIP regressed.
Comparing this with K/9 and BB/9
Let's do the same thing with K/9 and BB/9 to see the degree to which these skills regress from season to season, starting with K/9 for 2010 to 2011:
In 2010, the top 10 K artists struck out 9.4 per nine innings, which fell back to 8.6 in 2011. The 10 worst guys went from striking out 4.7 per nine in 2010 to 5.0 in 2010. There was some regression, which we would expect, but the rates were quite stable. A gap of 4.7 between the two groups in 2010 fell to 3.6 in 2011, meaning 75% of the gap was retained, unlike with LOB% and BABIP where a similar amount or more was not retained. Also instructive is the fact that FIP-ERA gaps are are small, and relatively similar, which also points towards K/9 being a repeatable skill. Let's look at 2009 to 2010:
Again, a similar pattern, though the retention of the gap was closer to 65% than 75%. Also, we see many of the same guys repeating on the lists. There is some random variation from year to year, but striking guys out is a persistent skill. Let's look at BB/9 for 2010 to 2011:
Again, we see the same pattern as with K/9. The best 10 pitchers at preventing walks in 2010 went from 1.7 per 9 to 1.8 in 2011. The worst 10 pitchers went from 3.9 per 9 in 2010 to 3.5 in 2011. The gap of 2.2 between them in 2010 fell to 1.7 in 2011 - again, some regression to the mean, but the gap is largely preserved. Again, the FIP-ERA gaps among these groups are stable and very small. Moving on 2009 to 2010:
Same story. The better pitchers regress from 1.6 to 1.9, the worst guys from 4.2 to 3.7. The gap of 2.6 in 2010 falls in 2.2, and is largely preserved. The FIP-ERA gaps are stable and small, and many of the same guys appear on the same list from year to year. This is largely a repeatable skill, with some expected random variation.
When looking at the best pitchers in terms of their LOB% in a season, or their BABIP in a season, there is a marked tendency for that outperformance to strongly regress towards the mean the following season. Likewise, among the worst pitchers for these outcomes in a season, there is a strong tendency to regress towards the mean. This largely represents random variation, and reflects the fact that the pitcher has little control over these outcomes.
When performing the same analysis on K/9 and BB/9, key inputs into FIP, we see that there is some regression towards the mean, but the outcomes are quite consistent from year to year. This reflects the control that the pitcher has over these outcomes as a result of their skill.
This is precisely why I prefer FIP to ERA, not as a predictive measure (that's xFIP), but a measure of what actually happened. It includes the things the pitcher most controls - his skills - and excludes the things that tend to be random variation, and not indicative of his actual skills.
What does this mean for a pitcher like Henderson Alvarez, who has a 3.30 ERA but a 4.99 FIP on the back of .250 BABIP and 79% LOB% (both of which would rank in the top 10 last season)? One school of thought says he's prevented runners, so good on him. I see it as the result of positive random variation, and it doesn't really make sense to give him credit for most of that, in a similar way you would not credit the guy who flips tails 8 times in 10 on a fair coin.