Distribution requirements
Steven Dubner wrote a post today highlighting the bursty nature of the linescore for the Rangers recent 30 to 3 victory over the Orioles. Knowing that the Rangers scored 30 runs, Steven says he would have predicted an inning-by-inning score that looked something like 4 3 1 0 5 6 3 5 3, but the real score was 0 0 0 5 0 9 0 10 6.
Steven’s prediction was definitely off the mark. In baseball, even if you know the total number of runs, you still don’t know the total number of hits (or similarly, men on base). If you have a very even run distribution over the 9 innings, a lot more people probably got on base, since you usually need 2 or 3 players on base before the first run is scored. This makes getting only a few runs each inning very unlikely.
Still, you can’t take this reasoning too far as a linescore that looks like 0 0 0 0 30 0 0 0 0 is also very suspicious. Specifically, it only happen 9 ways (the 30 runs could be scored in each inning). In contrast, Steven’s number could appear in all kinds of orderings, it’s just that the numbers themselves weren’t realistically chosen.
Obviously it’s extremely unlikely that anyone would guess the true linescore for a 30 to 3 game. The interesting question is this: Given a bunch of linescores, can you tell if they are real or fake? As Steven points out, when asks to fake a long sequence of coin flips, people tend to severely underestimate the number of times they should have many heads or tails appear in a row. Similarly, I’ve noticed that when people randomly put dots on a sheet of paper they tend to spread the marks more evenly than if they were truly random. As a result, it is often possible to detect falsified data, as human-generated numbers tend to deviate dramatically from the desired distribution.
By far most interesting things I’ve ever seen on this subject Benford’s Law. In the 1930’s, a scientist named Frank Benford observed that the first digit in all sorts of real world measurements tended to obey a very specific distribution .
Specifically, he saw that number 1 tends to appear as the first digit about 30 percent of the time, and the number 2 18 percent. If you were forging your income taxes you’d probably be tempted to make up a bunch of random deductions. If you don’t want to get caught, however, you’d better make sure the number 1 appears with the correct frequency. This sounds far-fetched, but Benford’s law has actually been used to identify fraudulent returns.