I often see these sorts of statistics that purport to show that some fraction of the X’s are disproportionately prone to event Y. One paper I read, for instance, reported that 10% of all police officers in a department account for 25% of all abuse complaints, and used that as evidence for the proposition that some police officers are especially prone to misbehavior. One can imagine similar claims when, say, 10% of all holes on a golf course account for 25% of all hole-in-ones, or 10% of all slot machines account for 25% of all jackpots, and so on.
The trouble is that this data point, standing alone, is entirely consistent with the X’s being equally prone to Y. Even if, for instance, all the holes on a golf course are equally difficult (or all the police officers equally prone to abuse complaints), and hole-in-ones (or complaints) are entirely randomly distributed across all holes (or officers), one can easily see the 10%/25% distribution, or 20%/80% distribution, or whatever else.
Consider a boundary case: Say that each police officer has a 10% chance of having a complaint this year. Then, on average 10% of all officers will have 100% of this year’s complaints. Likewise, say that each police officer has a 1% chance of having a complaint each year for 10 years, and the probabilities are independent from year to year (since complaints are entirely random, and all the officers are equally prone to them). Then, on average 9.5% (1 – 0.99^10) of all police officers will have 100% of the complaints over the 10 years, since 0.99^10 of the officers will have no complaints.
Or consider a less boundary case, where the math is still easily intuitive. Say that you have 100 honest coins, each 50% likely to turn up heads and tails. You toss each coin twice. On average,
-
25 of the coins will come up heads twice, accounting for 50 heads.
-
50 of the coins will come up heads once and tails once, accounting for 50 heads.
-
25 of the coins will come up tails twice, accounting for no heads.
This means that 25% of the coins account for 50% of the heads — but because of randomness, not because some particular coins are more likely to turn up heads than others.
Likewise, we see the same in slightly more complicated models. Say that each police officer has a 10% chance of having a complaint each year, and we’re looking at results over 10 years. Then 7% of all officers will have 3 or more complaints (that’s SUM (10-choose-i x 0.1^i x 0.9^(10-i)) as i goes from 3 to 10). But those 7% will account for 22.5% of all complaints (that’s SUM (10-choose-i x 0.1^i x 0.9^(10-i) x i) as i goes from 3 to 10). And again this is so even though each officer is equally likely to get a complaint in any year.
Now of course it seems very likely that in fact some officers are more prone to complaints than others. My point is simply that this conclusion can’t flow from our observation of the 10%/25% disparity, or 7%/22.5% disparity, or even a 20%/80% disparity. We can reasonably believe it for other reasons (such as our knowledge of human nature), but not because of that disparity, because that disparity is entirely consistent with a model in which all officers are equally prone to complaints.
If you have more data, that data can indeed support the disproportionate-propensity conclusion. For instance if nearly the same group of officers lead the complaint tallies each year (or nearly the same group of slot machines leads the payouts two months running), that’s generally not consistent with the random model I describe. Likewise, if you have more statistics of some other sort — for instance, if you know what the complaint rate per officer is, and can look at that together with the “X% of all officers yield Y% of the complaints” numbers — that too could be inconsistent with a random distribution.
But often we hear just a “10% of all X’s account for 25% of all Y’s” report, or some such, and are asked to infer from there that those 10% have a disproportionate propensity to Y. And that inference is not sound, because these numbers can easily be reached even if everyone’s propensity is equal.
UPDATE: (1) Some commenters suggested this phenomenon “depends on the sample size; if the sample size is large enough, the inference is sound.” That’s not quite right, I think.
The sample size in the sense of the number of police officers / golf holes / coins does not affect the result. I could give my coin example, where 25% of all coins yield 50% of all heads, with a million tosses.
The sample size in the sense of the number of intervals during which an event can happen (e.g., the length of time the officers are on the force, if in the model there’s a certain probability of a complaint each year) does affect the result. But if the probability per interval is low enough, we can see this result even when there are many intervals.
Say, for instance, that there’s a 1% chance of a complaint per month for each officer, and we look at 240 months (representing an average police career of 20 years). Then even when all officers have precisely the same propensity to draw a complaint, 9.5% of all officers would have 5 or more complaints, and would account for over 21.5% of all complaints. So a 9.5%/21.5% split would be consistent with all officers having an identical propensity to generate complaints, even with a “sample size” of 240 intervals. If the monthly complaint probability was 0.005, then 12% of all officers would account for over 33% of all complaints.
(2) More broadly, this isn’t a matter of “sample size” in the sense that we’d use the term when discussing significance testing, and talking about “statistical significance” wouldn’t be that helpful, I think. If you have a lot of data points, you can determine whether some difference between two sets of results over those data points is statistically significant. But here I’m talking about people’s drawing inferences from one piece of (aggregated) data — 10% of all X’s account for 25% of all Y’s. Statistical significance testing is not apt here.