Comments on: Anomaly Hunt; or, How To Write a Research Paper

By: Hal

Hal — Mon, 24 Jun 2013 05:14:02 +0000

As that rara avis, an actual professional statistician, I observe that people can make good money doing this sort of thing (think government, Wall Street, or writing political columns) while learning nothing at all about the real world. If you take a dataset, look for “anomalies”, and test everything you find (we used to refer to “torturing the data until they confessed”), the significance tests and methodologies are very different.

By: Josh Sher

Josh Sher — Thu, 30 Aug 2007 06:40:27 +0000

Well the usual test used for this type of problem is the Chi-squared test. The Chi-Squared test asks:

Assuming that two different datasets were drawn from identical distributions, what is the likelihood that the results are as different as they are. usually you set a significance level (5% is common) where if the results were less than 5% likely, you reject the Null hypothesis that they are actually from the same distribution.

(Note: the chi-squared test takes advantage of the fact the binomial distribution is approximately normal when the true probability p is not too close to 0, or the sample size is large enough. These hypothesis are easily satisfied here.)

How does that work here?

See: http://en.wikipedia.org/wiki/Pearson's_chi-square_test

But here is the calculation:
Let Ei=expected number of 4-i series=Probability given by aaron*95
Let Ai=actual number of 4-i series=Probability given by aaron*95

Form the sum{ (Ai-E-)^2/Ei }

In this case this equals 4.68

This is a chi-squared variable with 3 degrees of freedom (there are 4 possible outcomes but if you know the number for 3 of them, the 4’th is determined by subtraction from 95).

For a chi-squared variable with 3 degrees of freedom, the cutoff for p=0.05 significance is 7.82.

Thus we cannot reject the Null hypothesis that the distribution of results is actually generated by independent coin flips.

By: Gary Farber

Gary Farber — Tue, 16 Jan 2007 20:06:51 +0000

Humble apologies for being off-topic; I’d prefer to e-mail, but don’t see an e-mail address.

Thanks muchly for your sidebar link! (I didn’t know you were even aware of my existence.) Trivial note: the first link, under my name, is broken.

Thanks again.

By: Statistical Modeling, Causal Inference, and Social Science

Statistical Modeling, Causal Inference, and Social Science — Wed, 10 Jan 2007 19:15:16 +0000

Theories of information and interestingness

Jean-Luc pointed me to Anomaly Hunt; or, How To Write a Research Paper. This brings me to the vague topic of what is interesting. They say that you haven’t understood a concept until you have been able to explain it…

By: Steve Sailer

Steve Sailer — Tue, 09 Jan 2007 22:59:26 +0000

No NBA team has ever come back from being down 3-0 at any playoff level. Two NHL teams have.

I believe there is more randomness in baseball results than in basketball because pitchers have such a huge influence on the outcome. It's common to see matchups of starting pitchers where the team that is inferior overall has a big advantage in a single game due to sending, say, its #1 starter out against the better team's #4 starter.

By: Bill Kaplan

Bill Kaplan — Tue, 09 Jan 2007 16:42:38 +0000

This is the excellent foppery of the world of baseball, that, when a team is sick in fortune — often the surfeit of its own behavior — it makes guilty of its disasters its prior disasters; and, despite they be champions, to lay its present circumstances on those immediately before.

By: Aaron Haspel

Aaron Haspel — Tue, 09 Jan 2007 15:56:15 +0000

Steve: Before we buy into this theory of mailing it in, we should probably check it against other sports, like basketball. In the NBA (and ABA) finals teams down 3-0 have come back to win the next game 6 out of 13, approximately what you'd expect. That's all I can be bothered to check, but I'll be less willing to credit the baseball results if they can't be reproduced in other sports. Albatross: Unfortunately you can usually find "significant" regressions in even completely random data sets. Fortunately there are more rigorous tests that can help to weed out spurious ones. Econometricians run into this problem all the time. DavidB: At least 50, though not in the sense you mean.

By: albatross

albatross — Tue, 09 Jan 2007 14:43:36 +0000

So, what we need is a formula for setting the required significance level based on how long the researcher can afford to sift through the data, looking for an anomaly, and how many models he can test per unit of time? Should the review reject the paper if he can produce an equally significant observation from the data with no apparent meaning or theoretical significance?

By: David B

David B — Tue, 09 Jan 2007 09:20:02 +0000

Why is it called the World Series? How many countries participate?

By: Steve Sailer

Steve Sailer — Tue, 09 Jan 2007 08:43:09 +0000

It's not hugely uncommon for a team to lose the first two on the road, then come home, win game 3, and go on to take the series in seven or even six games. But losing game 3 at home seems to be a psychological death blow. It will be interesting over the next several decades to see if the Red Sox rally from down 3-0 in 2004 will change that psychology.

The funny thing is that going all out in a baseball game just isn't that hard, except for the pitchers. You'd think baseball players wouldn't give up, but it looks like they sometimes do.

Still, when you work through the history of a sweep, you can see why the losers might pack it in.

In the past, when teams had four man rotations in the regular series, they'd use their three best pitchers in the Series (there are off days after Games 2 and 5). If they all won, that could be depressing to the team that was down.

For example, in the 1963 World Series, the mighty Yankees lost to Sandy Koufax in the first game 5-2 in Yankee Stadium, with Koufax striking out 15, then lost to Johnny Podres in the second 4-1. Then they went to Dodger Stadium, and Don Drysdale beat them 1-0.

So, now the Yankees are down 3-0 on the road, the Dodgers are giving up 1.3 runs per game, and the opposing pitcher in Game 4 is, oh crap, Sandy Koufax again, who went 25-5 during the season. And if they manage to beat Koufax, then they've got to beat Podres in Game 5, who had 5 shutouts during the season, and then beat Drysdale in Game 6, who had won 25 the year before.

And, then, even if they somehow won three straight, they'd still have to to beat Koufax again in Game 7. Not surprisingly, they lost Game 4 2-1 and were swept.

So you can see how teams down 3-0 would get depressed.

Nowadays, with 5 man rotations, a team winning 3-0 is likely to send their number 4 starter out for the 4th game (assuming both teams won the LCS quickly), while the desperate trailing team might send their ace out on 3 days rest, so the immediate situation isn't so dire, but the long term situation is even worse, because your pitchers will all be on short rest for the rest of the series, unless it rains.