Jan 042007

Begin with a data set, preferably one in which many people are interested. Let’s say, World Series results from 1903 to the present.

Now ask a question about the data, one that should be easy to answer with a highly simplified model. Our question will be: have World Series teams, historically, been evenly matched?

Our model will ignore home-field advantage. In baseball the home team wins 53% or 54% of the time; nonetheless, we will assume that each team has a probability of 0.5 of winning each game. This gives the following expected probabilities for a best-of-seven series running four, five, six, or seven games:

P(4) = 0.125
P(5) = 0.250
P(6) = 0.3125
P(7) = 0.3125

Remember that if the model is too simple to fit the data, you can clean the data. Since 1903, the World Series has been played every year but two. There were a few best-of-nine series and a few more that included ties, which are too complicated to deal with. Throw them out. This leaves 95 series. Draw up a little chart comparing actual and expected probabilities, like so:

Possible outcomes P(Expected) P(Actual)
4-0 0.125 0.179
4-1 0.250 0.221
4-2 0.3125 0.242
4-3 0.3125 0.358

Now answer your own question. If the teams were evenly matched, the results would hew reasonably closely to the expected probabilities from the model. In fact there are anomalies. There are always anomalies. The World Series has been swept 17 times, five more than the model would predict. Plug this into the BINOMDIST function in Excel. (Understanding how this function works is optional and may in some cases be a disadvantage.) You find that, if the probabilities in the model were correct, there would be 17 or more sweeps in 95 occurrences only 8% of the time. A rotten break: you’re three lousy percent under statistical significance. But that aside, eleven of those were won by the team with the better regular-season record, several by teams considered among the all-time greats, including the 1927, 1939 and 1998 Yankees. That probably means something. On the other hand, the team that held the American League record for wins before 1998, the 1954 Indians, was swept by the Giants. Conclude judiciously that, on the whole, the data imply an occasional mismatch.

Look for any bonus anomalies. It doesn’t matter if they have nothing to do with your original question. Our data set turns up a nice one; the series went to seven games 34 out of 95 times — five too many, according to the model. This would occur randomly, assuming correct probabilities, only 20% of the time.

Damn, we’ve missed out on statistical significance again. Instead of looking at how often the series went seven, we can look at how often the team behind 3-2 won the sixth game. 34 out of 57, a somewhat more unusual result. Plug it back into BINOMDIST: we’re down to 9%, which is close but not close enough.

It has become inconvenient to look at the entire data set; let’s take just a chunk of it, say, 1945 to 2002. In those 58 years the World Series lasted seven games 27 times, which would happen by chance a mere 1% of the time. Furthermore, the team behind 3-2 won the sixth game 27 of 39 times; again, a 1% chance. Statistical significance at last!

Next, concoct plausible explanations for your new, statistically significant anomaly. Maybe the team that is behind plays harder, with their backs against the wall. Maybe they use all of their best pitchers, holding nothing in reserve for the seventh game. Maybe the team that is ahead chokes and cannot close it out.

Under no circumstances should you test these explanations. In the World Series the team that won Game Six also won Game Seven 18 times out of 34 — not likely if they had squandered their resources to win Game Six. In basketball, in the NBA Finals, the team that led 3-2 won Game Six 26 times out of 45. This is the opposite of what we found in baseball, in a sport that rewards hard play more and is far more conducive to choking, as anyone knows who has tried to shoot a free throw in a big game. In other words, your explanations, though plausible, are false. The result is probably due to random variation. This should not discourage you from completing your article. Write up your doubts in a separate note several months later.

Finally, check the literature to make sure your idea is original. If it isn’t, which is likely, mention your predecessor prominently in your acknowledgements, and include a footnote in which you pick a few nits.

Submit to suitable journals. Repeat unto death, or tenure, whichever comes first.

Update: Actual professional statisticians comment. Evolgen, who may or may not be a professional statistician, comments.