There are probably still some folks out there who have never heard of WAR, but they're increasingly becoming few and far between and very few of them likely are at this site. However, what actually constitutes this "replacement player"? It's been fairly widely-cited that a replacement-player is roughly two wins twenty runs per 600 plate appearances worse than a league-average player. In fact, that's supposedly what our "replacement-level" is based on -- not who is actually readily available, but what "league-average" production is.
Thus, we say that a two-2.5-win player is "league average." Of course, note the tautology here. We're saying that a two-2.5-win player is league-average based on the fact that he's worth two 2.5 wins more than a replacement player, but we're saying that a replacement player is worth two wins less than league average. However, it's entirely possible that the relationship between WAR and wins actually changes from year-to-year because there's no perfect way to define "average." Even though WAR is based on the statistical mean of player performance in any given year, the distribution of that performance may change. This is particularly true when looking at the game across long timescales. Consider for a moment how much the base level of play to be considered "replacement level" changed during wartime, league expansion, integration, etc.
Part of the reason that Honus Wagner was able to accrue four consecutive 10-WAR seasons was that he was an incredible ballplayer but another part of it was how many players were systematically excluded from baseball at that time. Thus, there was a smaller talent pool available, leading to a lot of players in the league who would not be in a similarly-sized league today. Of course, the leagues have expanded, so we kind of assume these things even out, but they don't necessarily.
To make matters even more confusing, the two most frequently-cited sites for WAR (fangraphs, FWAR, and baseball-reference, rWAR) calculate it differently. In fact, not only do they calculate WAR differently, they calculate replacement-level differently. But which method actually scales more closely-related to actual numbers of wins? I decided to try and figure out what the relationship between fWAR, rWAR and wins has been the past three seasons.
Before we get started, I'm not going to go into details about the differences between the two metrics because other folks have done that already. If you're interested, there are plenty of folks who'd be more than happy to explain in the comments section.
So, first, I downloaded the past three seasons-worth of team WAR across MLB (a total of 90 team-seasons). Then, using R, I constructed linear models to determine the relationship between wins above replacement-level and actual wins.
Here is the relationship between wins and fWAR:
and here is the relationship between wins and rWAR:
These graphs might look similar at first glance, but note the differences in scale on the x-axis (!). As expected, the slopes are similar (one more WAR should mean one extra team win), but where that relationship begins is very different depending on whether we're looking at fWAR or rWAR. So a main reason we see a big difference between fWAR and rWAR is that rWAR assumes that replacement-level is higher than fWAR assumes it is. How much higher?
Well, the y-intercept of these graphs should tell us how much a replacement-level team would win (since team WAR would be equal to zero).
The linear model describing the relationship between rWAR and wins is:
Wins = 53.8 + 0.88(rWAR)
So a replacement-level team would win about 54 games (only very slightly more than the 52 games that baseball-reference purports replacement-level teams should win, which could be due simply to variance, since we only used 90 separate team-seasons). We see the model doesn't quite get us to a 1:1 ratio between WAR and wins, but it is pretty close.
The linear model describing the relationship between fWAR and wins is:
Wins = 45.2 + 0.93(fWAR).
In this case, the replacement-level team would win only 45 games. Again, the model is pretty close to a 1:1 ratio between WAR and wins (and actually somewhat better than the rWAR model). So we see that the Fangraphs "replacement-level" player is likely significantly worse than the Rally "replacement player." This might not necessarily a major concern when comparing between the two because replacement players don't really exist. However, average teams do exist. So how big is the difference?
Well, if we assume that 81 wins is how many games an "average team" should win, we see that it would take 38.5 fWAR or 30.9 rWAR. The ratio for average teams is 1.25 fWAR : 1.00 rWAR, so either 1) fangraphs is overestimating players or 2) rally is underestimating them.
What does this mean when we try to determine how valuable an "average player" should be (in terms of fWAR or rWAR)? Well, we need to break this model down a bit further -- to account for the fact that we have to look at pitchers and position players separately. According to fangraphs, of 3445.2 WAR the past three seasons, position players have accounted for 59.3% (2043.3) and pitchers have accounted for 40.7% (1401.9). Of 2779 total rWAR the past three seasons, position players have accounted for 57.5% (1597.6) and pitchers have accounted for 42.5% (1181.4).
We should also separate out starter value from bullpen value. Unfortunately, we're unable to do that with rWAR, though we can do it for Fangraphs. This muddles up calculations with rWAR because they include leverage in pitching WAR -- which means that bullpen arms get a bonus for pitching in high pressure situations. No worries, we will use the fangraphs value ratio and then convert that based on average leverage index afterwards. According to fangraphs, this season, on average, starters were worth 81.1% and bullpens were worth about 18.9% of pitching runs above replacement.
Over the three seasons I looked at, the average team accrued something like 1445 innings (1445.6). Starters pitched 970 and bullpens pitched about 475 of those innings. Each team also accrued something like 6200 plate appearances (6197).
So to calculate the value of an average position player per plate appearance, our formula would be:
(WAR produced by an average team) * (proportion of that WAR that is produced by position players) / (6200 plate appearances per team)
So, for fangraphs WAR, we work it out to be:
38.5 team WAR * (0.593pos player WAR / 1 team WAR) / 6200 PA = 0.00368 WAR / plate appearance.
Over 650 plate appearances, that works out to:
(0.00368 WAR/PA) * (650 PA) = 2.4 fWAR / 650 PA
Now let's work out rally WAR:
30.9 team WAR * (0.575pos player WAR / 1 team WAR) / 6200 = 0.00287 WAR / plate appearance.
Over 650 plate appearances that works out to:
(0.00287 WAR/PA) * (650 PA) = 1.9 WAR rWAR / 650 PA
Now let's work out the starting pitchers. This should work similarly, except we will replace the 6200 plate appearance scale with a 970 inning scale.
First, fWAR:
38.5 team WAR * (0.407 pitchwar / 1 team WAR) * (0.811 starterWAR / 1 pitching WAR) / 970 IP = 0.0131 WAR / IP
Over a 185 inning season, this works out to:
(0.0108 WAR / IP) * (185 IP) = 2.4 WAR
Next, rWAR:
30.9 team WAR * (0.425 pitchWAR / 1 team WAR) * (0.811 starterWAR / 1 pitching WAR) / 970 IP = 0.0110 WAR / IP
Over a 185 inning season, this works out to:
(0.00910 WAR/IP) * (185 IP) = 2.0 WAR
Now, the relief pitchers.
fWAR:
38.5 team WAR * (0.407 pitchwar / 1 team WAR) * (0.189 starterWAR / 1 pitching WAR) / 475 IP = 0.0062 WAR / IP
Over a 70 inning season, this works out to:
(0.0062 WAR / IP) * (70 IP) = 0.4 WAR
rWAR:
30.9 team WAR * (0.425 pitchwar / 1 team WAR) * (0.189 starterWAR / 1 pitching WAR) / 475 IP = 0.0052 WAR / IP
However, there is also a leverage index of 1.27, which means that, per inning, an average reliever would be worth:
0.0052 * 1.27 = 0.0066 rWAR
Over a 70 inning season, this works out to:
(0.0066 WAR / IP) * (70 IP) = 0.5 WAR
So, interestingly, we see that a 2.0 2.5 WAR season as average is actually somewhat off. Considering a 2.5 fWAR season as average production for position players and starting pitchers actually overcorrectly estimates position player and starting pitcher performance in general by quite a bit. It is a good, but is a poor benchmark by rWAR. Comparing relief pitchers by WAR is quite difficult to do, since not every pitcher gets equal opportunity to produce. However, it is worthwhile to note that, due to leverage index, 1 relief pitcher rWAR is actually slightly less valuable than 1 relief pitcher fWAR.
So, now that we've discussed what these mean for our "average" players, it comes time to attempt to answer the question of which metric is "better." As we can tell from the slopes, wins scale slightly better with fWAR than with rWAR, suggesting the fWAR method is better. The r-squared values, however, suggest that neither method is superior. The fWAR model has an r-squared value of 0.7748 while the rWAR model has an r-squared value of 0.7755. A model that incorporates both (r-sq = 0.825) is actually slightly better than including only one of them, which suggests that using both methods of calculation (i.e., FIP vs. RA and UZR vs. TZ) does give us a better picture of a player's actual contribution to team wins. Additionally, the relative importances of each metric to that model are essentially the same (rWAR = 0.50045, fWAR = 0.49955), so the model thinks they're both equally useful.
In summation, I'd say that neither version of WAR is necessarily more useful (or "better") than the other, however, it is important to keep in mind that the methods do scale differently. Consider this for a moment: we talk about a 7 WAR player being a likely MVP candidate. Well, add a 7-WAR player to a team full of fangraphs replacement players and they're likely to win no more games than a team full of baseball-reference replacement players without that 7 WAR player. So what do you all think about that?