Begin with a data set, preferably one in which many people are interested. Let’s say, World Series results from 1903 to the present.

Now ask a question about the data, one that should be easy to answer with a highly simplified model. Our question will be: have World Series teams, historically, been evenly matched?

Our model will ignore home-field advantage. In baseball the home team wins 53% or 54% of the time; nonetheless, we will assume that each team has a probability of 0.5 of winning each game. This gives the following expected probabilities for a best-of-seven series running four, five, six, or seven games:

P(4) = 0.125

P(5) = 0.250

P(6) = 0.3125

P(7) = 0.3125

Remember that if the model is too simple to fit the data, you can clean the data. Since 1903, the World Series has been played every year but two. There were a few best-of-nine series and a few more that included ties, which are too complicated to deal with. Throw them out. This leaves 95 series. Draw up a little chart comparing actual and expected probabilities, like so:

Possible outcomes |
P(Expected) |
P(Actual) |
---|---|---|

4-0 | 0.125 | 0.179 |

4-1 | 0.250 | 0.221 |

4-2 | 0.3125 | 0.242 |

4-3 | 0.3125 | 0.358 |

Now answer your own question. If the teams were evenly matched, the results would hew reasonably closely to the expected probabilities from the model. In fact there are anomalies. There are *always* anomalies. The World Series has been swept 17 times, five more than the model would predict. Plug this into the BINOMDIST function in Excel. (Understanding how this function works is optional and may in some cases be a disadvantage.) You find that, if the probabilities in the model were correct, there would be 17 or more sweeps in 95 occurrences only 8% of the time. A rotten break: you’re three lousy percent under statistical significance. But that aside, eleven of those were won by the team with the better regular-season record, several by teams considered among the all-time greats, including the 1927, 1939 and 1998 Yankees. That probably means *something*. On the other hand, the team that held the American League record for wins before 1998, the 1954 Indians, was swept by the Giants. Conclude judiciously that, on the whole, the data imply an occasional mismatch.

Look for any bonus anomalies. It doesn’t matter if they have nothing to do with your original question. Our data set turns up a nice one; the series went to seven games 34 out of 95 times — five too many, according to the model. This would occur randomly, assuming correct probabilities, only 20% of the time.

Damn, we’ve missed out on statistical significance again. Instead of looking at how often the series went seven, we can look at how often the team behind 3-2 won the sixth game. 34 out of 57, a somewhat more unusual result. Plug it back into BINOMDIST: we’re down to 9%, which is close but not close enough.

It has become inconvenient to look at the entire data set; let’s take just a chunk of it, say, 1945 to 2002. In those 58 years the World Series lasted seven games 27 times, which would happen by chance a mere 1% of the time. Furthermore, the team behind 3-2 won the sixth game 27 of 39 times; again, a 1% chance. Statistical significance at last!

Next, concoct plausible explanations for your new, statistically significant anomaly. Maybe the team that is behind plays harder, with their backs against the wall. Maybe they use all of their best pitchers, holding nothing in reserve for the seventh game. Maybe the team that is ahead chokes and cannot close it out.

Under no circumstances should you test these explanations. In the World Series the team that won Game Six also won Game Seven 18 times out of 34 — not likely if they had squandered their resources to win Game Six. In basketball, in the NBA Finals, the team that led 3-2 won Game Six 26 times out of 45. This is the opposite of what we found in baseball, in a sport that rewards hard play more and is far more conducive to choking, as anyone knows who has tried to shoot a free throw in a big game. In other words, your explanations, though plausible, are false. The result is probably due to random variation. This should not discourage you from completing your article. Write up your doubts in a separate note several months later.

Finally, check the literature to make sure your idea is original. If it isn’t, which is likely, mention your predecessor prominently in your acknowledgements, and include a footnote in which you pick a few nits.

Submit to suitable journals. Repeat unto death, or tenure, whichever comes first.

**Update:** Actual professional statisticians comment. Evolgen, who may or may not be a professional statistician, comments.

Damned lies.

Do teams throw in the towel after falling behind 3-0 in games? Nobody in baseball or the NBA in all the postseason levels had ever done it until the 2004 (?) Red Sox. I don’t think anybody in baseball ever even tied the series 3-3 after falling behind 0-3. That seems unreasonably low to me, suggesting a quitting attitude.

In the simple model a team could be expected to rally from a 3-0 deficit once in 16 tries. A team has fallen behind 3-0 in games 20 times. It has won the fourth game three times, and never the fifth, let alone the sixth or seventh.

If the simple model were true — which it isn’t quite — the team leading 3-0 should win Game 4 half the time. The actual result of 17 out of 20 would occur by chance approximately 0.6% of the time. (The chance that not a single team would reach Game 6 in 20 tries is even lower, 0.3%.) Yeah, I’d call that unreasonably low.

There are at least two plausible explanations. One, as Steve suggests, is that teams throw in the towel. Another is that the team down 3-0 is simply overmatched. If we use a 0.4 probability of winning for 0.5, the chance of the actual result rises to 5.1%, or just outside of statistical significance. But 0.4 is pretty low for a team that won a league championship. I’m inclined to think that Steve’s explanation is probably true. Can anyone think of a better one?

It’s not hugely uncommon for a team to lose the first two on the road, then come home, win game 3, and go on to take the series in seven or even six games. But losing game 3 at home seems to be a psychological death blow. It will be interesting over the next several decades to see if the Red Sox rally from down 3-0 in 2004 will change that psychology.

The funny thing is that going all out in a baseball game just isn’t that hard, except for the pitchers. You’d think baseball players wouldn’t give up, but it looks like they sometimes do.

Still, when you work through the history of a sweep, you can see why the losers might pack it in.

In the past, when teams had four man rotations in the regular series, they’d use their three best pitchers in the Series (there are off days after Games 2 and 5). If they all won, that could be depressing to the team that was down.

For example, in the 1963 World Series, the mighty Yankees lost to Sandy Koufax in the first game 5-2 in Yankee Stadium, with Koufax striking out 15, then lost to Johnny Podres in the second 4-1. Then they went to Dodger Stadium, and Don Drysdale beat them 1-0.

So, now the Yankees are down 3-0 on the road, the Dodgers are giving up 1.3 runs per game, and the opposing pitcher in Game 4 is, oh crap, Sandy Koufax again, who went 25-5 during the season. And if they manage to beat Koufax, then they’ve got to beat Podres in Game 5, who had 5 shutouts during the season, and then beat Drysdale in Game 6, who had won 25 the year before.

And, then, even if they somehow won three straight, they’d still have to to beat Koufax again in Game 7. Not surprisingly, they lost Game 4 2-1 and were swept.

So you can see how teams down 3-0 would get depressed.

Nowadays, with 5 man rotations, a team winning 3-0 is likely to send their number 4 starter out for the 4th game (assuming both teams won the LCS quickly), while the desperate trailing team might send their ace out on 3 days rest, so the immediate situation isn’t so dire, but the long term situation is even worse, because your pitchers will all be on short rest for the rest of the series, unless it rains.

Why is it called the World Series? How many countries participate?

So, what we need is a formula for setting the required significance level based on how long the researcher can afford to sift through the data, looking for an anomaly, and how many models he can test per unit of time? Should the review reject the paper if he can produce an equally significant observation from the data with no apparent meaning or theoretical significance?

Steve: Before we buy into this theory of mailing it in, we should probably check it against other sports, like basketball. In the NBA (and ABA) finals teams down 3-0 have come back to win the next game 6 out of 13, approximately what you’d expect. That’s all I can be bothered to check, but I’ll be less willing to credit the baseball results if they can’t be reproduced in other sports.

Albatross: Unfortunately you can usually find “significant” regressions in even completely random data sets. Fortunately there are more rigorous tests that can help to weed out spurious ones. Econometricians run into this problem all the time.

DavidB: At least 50, though not in the sense you mean.

This is the excellent foppery of the world of baseball, that, when a team is sick in fortune — often the surfeit of its own behavior — it makes guilty of its disasters its prior disasters; and, despite they be champions, to lay its present circumstances on those immediately before.

No NBA team has ever come back from being down 3-0 at any playoff level. Two NHL teams have.

I believe there is more randomness in baseball results than in basketball because pitchers have such a huge influence on the outcome. It’s common to see matchups of starting pitchers where the team that is inferior overall has a big advantage in a single game due to sending, say, its #1 starter out against the better team’s #4 starter.

Theories of information and interestingnessJean-Luc pointed me to Anomaly Hunt; or, How To Write a Research Paper. This brings me to the vague topic of what is interesting. They say that you haven’t understood a concept until you have been able to explain it…

Humble apologies for being off-topic; I’d prefer to e-mail, but don’t see an e-mail address.

Thanks muchly for your sidebar link! (I didn’t know you were even aware of my existence.) Trivial note: the first link, under my name, is broken.

Thanks again.

Well the usual test used for this type of problem is the Chi-squared test. The Chi-Squared test asks:

Assuming that two different datasets were drawn from identical distributions, what is the likelihood that the results are as different as they are. usually you set a significance level (5% is common) where if the results were less than 5% likely, you reject the Null hypothesis that they are actually from the same distribution.

(Note: the chi-squared test takes advantage of the fact the binomial distribution is approximately normal when the true probability p is not too close to 0, or the sample size is large enough. These hypothesis are easily satisfied here.)

How does that work here?

See: http://en.wikipedia.org/wiki/Pearson's_chi-square_test

But here is the calculation:

Let Ei=expected number of 4-i series=Probability given by aaron*95

Let Ai=actual number of 4-i series=Probability given by aaron*95

Form the sum{ (Ai-E-)^2/Ei }

In this case this equals 4.68

This is a chi-squared variable with 3 degrees of freedom (there are 4 possible outcomes but if you know the number for 3 of them, the 4’th is determined by subtraction from 95).

For a chi-squared variable with 3 degrees of freedom, the cutoff for p=0.05 significance is 7.82.

Thus we cannot reject the Null hypothesis that the distribution of results is actually generated by independent coin flips.

As that rara avis, an actual professional statistician, I observe that people can make good money doing this sort of thing (think government, Wall Street, or writing political columns) while learning nothing at all about the real world. If you take a dataset, look for “anomalies”, and test everything you find (we used to refer to “torturing the data until they confessed”), the significance tests and methodologies are very different.