I often see these sorts of statistics that purport to show that some fraction of the X's are disproportionately prone to event Y. One paper I read, for instance, reported that 10% of all police officers in a department account for 25% of all abuse complaints, and used that as evidence for the proposition that some police officers are especially prone to misbehavior. One can imagine similar claims when, say, 10% of all holes on a golf course account for 25% of all hole-in-ones, or 10% of all slot machines account for 25% of all jackpots, and so on.
The trouble is that this data point, standing alone, is entirely consistent with the X's being equally prone to Y. Even if, for instance, all the holes on a golf course are equally difficult (or all the police officers equally prone to abuse complaints), and hole-in-ones (or complaints) are entirely randomly distributed across all holes (or officers), one can easily see the 10%/25% distribution, or 20%/80% distribution, or whatever else.
Consider a boundary case: Say that each police officer has a 10% chance of having a complaint this year. Then, on average 10% of all officers will have 100% of this year's complaints. Likewise, say that each police officer has a 1% chance of having a complaint each year for 10 years, and the probabilities are independent from year to year (since complaints are entirely random, and all the officers are equally prone to them). Then, on average 9.5% (1 - 0.99^10) of all police officers will have 100% of the complaints over the 10 years, since 0.99^10 of the officers will have no complaints.
Or consider a less boundary case, where the math is still easily intuitive. Say that you have 100 honest coins, each 50% likely to turn up heads and tails. You toss each coin twice. On average,
25 of the coins will come up heads twice, accounting for 50 heads.
50 of the coins will come up heads once and tails once, accounting for 50 heads.
25 of the coins will come up tails twice, accounting for no heads.
This means that 25% of the coins account for 50% of the heads — but because of randomness, not because some particular coins are more likely to turn up heads than others.
Likewise, we see the same in slightly more complicated models. Say that each police officer has a 10% chance of having a complaint each year, and we're looking at results over 10 years. Then 7% of all officers will have 3 or more complaints (that's SUM (10-choose-i x 0.1^i x 0.9^(10-i)) as i goes from 3 to 10). But those 7% will account for 22.5% of all complaints (that's SUM (10-choose-i x 0.1^i x 0.9^(10-i) x i) as i goes from 3 to 10). And again this is so even though each officer is equally likely to get a complaint in any year.
Now of course it seems very likely that in fact some officers are more prone to complaints than others. My point is simply that this conclusion can't flow from our observation of the 10%/25% disparity, or 7%/22.5% disparity, or even a 20%/80% disparity. We can reasonably believe it for other reasons (such as our knowledge of human nature), but not because of that disparity, because that disparity is entirely consistent with a model in which all officers are equally prone to complaints.
If you have more data, that data can indeed support the disproportionate-propensity conclusion. For instance if nearly the same group of officers lead the complaint tallies each year (or nearly the same group of slot machines leads the payouts two months running), that's generally not consistent with the random model I describe. Likewise, if you have more statistics of some other sort — for instance, if you know what the complaint rate per officer is, and can look at that together with the "X% of all officers yield Y% of the complaints" numbers — that too could be inconsistent with a random distribution.
But often we hear just a "10% of all X's account for 25% of all Y's" report, or some such, and are asked to infer from there that those 10% have a disproportionate propensity to Y. And that inference is not sound, because these numbers can easily be reached even if everyone's propensity is equal.
UPDATE: (1) Some commenters suggested this phenomenon "depends on the sample size; if the sample size is large enough, the inference is sound." That's not quite right, I think.
The sample size in the sense of the number of police officers / golf holes / coins does not affect the result. I could give my coin example, where 25% of all coins yield 50% of all heads, with a million tosses.
The sample size in the sense of the number of intervals during which an event can happen (e.g., the length of time the officers are on the force, if in the model there's a certain probability of a complaint each year) does affect the result. But if the probability per interval is low enough, we can see this result even when there are many intervals.
Say, for instance, that there's a 1% chance of a complaint per month for each officer, and we look at 240 months (representing an average police career of 20 years). Then even when all officers have precisely the same propensity to draw a complaint, 9.5% of all officers would have 5 or more complaints, and would account for over 21.5% of all complaints. So a 9.5%/21.5% split would be consistent with all officers having an identical propensity to generate complaints, even with a "sample size" of 240 intervals. If the monthly complaint probability was 0.005, then 12% of all officers would account for over 33% of all complaints.
(2) More broadly, this isn't a matter of "sample size" in the sense that we'd use the term when discussing significance testing, and talking about "statistical significance" wouldn't be that helpful, I think. If you have a lot of data points, you can determine whether some difference between two sets of results over those data points is statistically significant. But here I'm talking about people's drawing inferences from one piece of (aggregated) data -- 10% of all X's account for 25% of all Y's. Statistical significance testing is not apt here.
Common sense would tell us that some officers are more often than others in circumstances that could lead to complaints of abuse. The cop sitting at the desk all day and not out on patrol will have less encounters with the public in hostile situations. The cop on patrol will have more opportunities to be accused.
Or take the drunk who wanted to know why he was waking up every morning with a hangover. He kept careful records of what he drank as follows:
Monday: Scotch and soda, result: hangover
Tuesday, whiskey and soda, result: hangover
Wednesday, rye and soda, result: hangover
Thursday, vodka and soda, result: hangover
Conclusion: Soda causes hangovers.
I suggest you fix the typo. [Whoops, fixed, thanks! -EV]
And it would be reckless to use this statistic alone. I doubt that it is. (And of course your slot machine example is exactly correct since the house randomly and frequently reprograms slot machines to pay out at better odds--a single machine might be paying out 2 cents on the dollar one day and 12 cents the next day).
Suppose, instead, that the expected number of complaints (=total #complaints/#officers) is higher (maybe 20). In this case, the 10/25 realization would indeed be significant. What is the expected number of complaints in the paper referred to?
Zathras: That's the problem -- the cited study didn't give any other data points that could lead us to reject the randomness hypothesis.
The percentages standing alone are not meaningful, but they can be if they also talk numbers. Incidentally, it's this same type of statistic that is widely used in the Gini curve to look at distribution of wealth and income. While there are problems with that analysis (chiefly that it is a static picture and does not account for movement between groups - problems that EV pointed out for this case as well), it can provide a static picture which can be combined with a dynamic understanding of events as well.
(sorry bob, couldn't resist)
For this reason I disagree with Professor Volokh that the 10%/25% formulation, without more, is necessarily misleading. It is only misleading when the sample size is too small to support the inference, or when it is unclear whether the sample size is large enough to support the inference. Sometimes it is obvious that the sample size is large enough, and in those cases it is unnecessary to include the sample size or any other data.
For example, it would be valid for a journalist to write that 10% of major league baseball players account for 25% of all home runs, in order to show that some players are better at hitting home runs than others. She would not have to include further data, because most baseball fans know the approximate number of games and at-bats in a baseball season, and the underlying statistical significance is intuitively obvious.
Whether or not it is used alone, the statistic is irrelevant, since it neither supports nor refutes the proposition. Given that it's likely to make some readers more supportive of the proposition, it's misleading (I wouldn't go so far as to say reckless, but its use is certainly troubling).
In most of the examples we've considered, the number of observations per individual is small relative to the number of individuals, which means that the limit argument almost certainly does not apply, even approximately.
**Yes I stole that from Scott Adams.
I recommend the book Innumeracy by John Allen Paulos that discusses the inability of the general public to understand basic statistics and the way that they misapprehend risk and misinterpret data.
A Poisson process may be better. This models the probability that an officer gets a given number of complaints (possibly zero) in a year, and so contribute multiple times to the overall complaint counter. And that more closely matches the language of "25% of all complaints" (rather than "25% of those with complaints lodged against them").
My question is what percentage of all gun sales are by those same 2% of gun dealers?
Most gun dealers only sell a handful of guns in a year. The majority of gun sales are by a small number of high-volume gun dealers. (The same is true for pretty much everything - remember Pareto's law).
It's impossible to judge whether there those dealers represent a problem, without the additional information that the media never seems to report.
Why is that? Bias? Or simple laziness?
The problem with your counterargument is that people (and golf courses) aren't coins. That is, common sense teaches that people are not identical; some of them are miscreants, and some are not. Likewise, the holes on a golf course are not identical; some of them have more hazards than others. Thus, based on his or her common sense and common experience, a person reading the relevant statistic (10% of officers are responsible for 25% of the complaints) will rightly assume that there is a non-random distribution of complaints, and that the distribution is likely to stay that way even as the number of complaints increases.
In your example, any given coin is equally likely to populate the "2 coins give 2 heads" class; in real life, not every officer is equally likely to populate the "receives more complaints" class.
SenatorX: The sample size in the sense of the number of police officers / golf holes / coins does not affect the result. I could give my coin example, where 25% of all coins yield 50% of all heads, with a million tosses.
The sample size in the sense of the number of intervals during which an event can happen (e.g., the length of time the officers are on the force, if in the model there's a certain probability of a complaint each year) does affect the result. But if the probability per interval is low enough, we can see this result even when there are many intervals.
Say, for instance, that there's a 1% chance of a complaint per month for each officer, and we look at 240 months (representing an average police career of 20 years). Then even with an entirely random model, 9.5% of all officers would have 5 or more complaints, yet they would account for over 21.5% of all complaints. So a 9.5%/21.5% split would be consistent with all officers having an identical propensity to generate complaints, even with 240 intervals to work from. If the monthly complaint probability was 0.005, then 12% of all officers would account for over 33% of all complaints.
As I mentioned in my response to JerryH, we could still infer that some police officers do have a higher propensity to generate complaints than others do. But we can't assume that from the 9.5%/21.5% split, or a 12%/33% split, or whatever else. Only if we have more data (e.g., evidence that the same officers tend to have the most complaints every year) would we be able to draw a sensible inference from that data.
Of course, that's nothing compared to the money lawyers can make. If there are a hundred different brands of detergent, people using about 3% of the brands will have an above-average cancer rate five years in a row. And about one in a thousand will have it ten years in a row. Same with obstetricians and birth defects.
Westie: your common sense assumes its conclusion. That is, you notice differences and assume they're significant of something, when that's not always the case. I would agree that this is accurately describes normal human behavior, but its prevalence doesn't make it any better-considered.
i) Great post Prof. V! If I were in charge of things, statistics would be taught much more extensively and early in high school and college than it currently is.
ii) The examples (coins and police officers) all involve sums or averages of binomially distributed variables. There is a small sample size problem that the commenters started to zero in on, and that Prof. V. addressed somewhat in the update. But I think the plainest way to say it is that formulating a meaningful, relevant "x% of X accounts for y% of Y" statistic would involve individual estimation of the binomial probability p for each member of the population. In other words, adding more police officers or coins to our data set isn't going to help. We need more flips of each coin or more years observing each officer. If we individually estimate p (probability of success/failure) for each coin or officer, standard statistical tests can tell you how likely it is that the p values would arise by chance.
iii) Another way to eyeball whether "x% of X accounts for y% of Y" stastics make sense for a given data set is just to construct the Lorenz curve for the data. This curve shows how y varies with x, so you can read off a "x% of X accounts for y% of Y" for any value of x you want. If the Lorenz curve has hops, skips, or jumps, it's suggestive that you might not have enough data points to have confidence in your statistic. If it is relatively smooth, may be meaningful to invoke the statistic.
I would put it this way: Some data is random and some has an assignable cause. Statistical outliers should be investigated. If ten percent of your staff is taking 80 percent of the sick days, you should probably have a talk with that ten percent. Rank order whatever data you're analyzing. Focus on the cops with the most complaints. Changing their approach to handling the public has the possibility of making the biggest reduction in complaints.
What normal curve? Who's talking about a normal curve?
However, as is usual with such things the original statement was very poorly put and merely shows that EV is astatistical.
On a guess, I decided to look for trends in the one place a safety professional never should look — the people. After pulling 5 years of injury and illness data, it was plain that for this group of 3000 people, only 6 percent of the population was responsible for 96% of the injuries. Shocked, I re-ran the numbers and the data still came out the same. Then I looked at how many of that 6% were in my building and, shockingly, I had 70% of them. As a co-op, I was the safety person responsible for the Second Shift. We were averaging a few injuries a week at that point — about the same as the other two shifts but with no patterns we could detect. Armed with my data (I shared it with my boss and another co-op who both dismissed it), I went out every night and personally spoke with every person on my list — I asked how they were doing, if they were having any issues with their jobs, about their kids... basically BS'd with them and I did it every night.
What happened next was dramatic. We went from having a few injuries per week to having none for my last three months in the facility (I left for a better job). The change occurred almost instantly once I started working my "list".
One other thing we noted — when we cross checked the "list" with FMLA data, that 6% accounted for 82% of the FMLA time and 91% of the lost work days for either work-related injury or personnel medical.
I've held a lot of safety roles since that job, each increasing in responsibility. And everytime I start a new job, I do a quick, discrete check of the historical data to see if this pattern is present. Fortunately, I've never found the pattern in the years since.
I think your point (2) misses the mark. Journalism is not scholarship; the math should be correct, but the reporter doesn't have to "show his work." If the 10%/25% inference is in fact sound -- which it is for some populations -- it doesn't strike me as highly blameworthy for a reporter to draw that inference without providing additional data.
Interestingly, one of your examples -- holes in one-- comes close to illustrating my point. Holes in one are unusual so the 10%/25% inference may not be valid with respect to them. But with respect to more common occurrences like birdies, a 10%/25% distribution would very likely indicate significant differences among the difficulty of golf holes (holding constant the skill of the golfers). It would be nice to see the calculations, but I wouldn't blame a reporter for making this point without providing additional information.
An example of power law distribution can be seen by looking at the number of kills made by fighter pilots in WW I &II (where there are a large number of samples). A relatively small number of pilots accounted for most of the kills. As you move down the "tail," there are a large number of pilots with a few kills and many with none at all. Given the large number of examples and the time involved this cannot all be luck. Some pilots are just better -- a lot better -- than others, and it shows. For a much better explanation and more examples I recommend Chris Anderson's THE LONG TAIL and N. N. Taleb's THE BLACK SWAN.
Thus in your example, analyzing it with a gaussian model shows it as being random noise, while a power law model explains very well that there are a few bad cops who are causing the problems. Of course they may also be the most effective ones as well, but that's another problem. FWIW Malcolm Gladwell did an article (in the New Yorker, I think) showing that in prisons most complaints of brutality can be traced a small number of guards.
Also, for whatever reason, it looks like I've been readmitted to the fold if you are reading this.
I'm confused by FredR's comment, because I haven't noticed anyone talking about the Gaussian distribution to model anything. Well-developed statistical methods can determine how well an experimentally measured distribution (e.g. of coin flips or police complaints) fits to the equations for a power-law distribution, or a normal distribution. I don't think these methods are particularly relevant here.
Eugene was just using complaints against police as one example of a frequent phenomenon, but even there you're missing the point. I'm not a cop, but in every job involving public contact I've seen, there are earned complaints and there are undeserved complaints. I'd expect that some citizens just have unreasonable expectations of cops. I've known people that have an animus against cops in general that may be a motive for trying to get a cop in trouble, or that are willing to fabricate a case against a cop in a particular instance to make a political point. And some complaints are flat out lies by criminals trying to gain leverage against the department in hopes they can negotiate a reduction in charges.
Presumably the unjustified complaints will be randomly distributed across all the officers exposed to the situations where such complaints arise, while bad cops will repeatedly receive justified ones. So a way to test whether a large police department has a good process for sorting justified from unjustified complaints and disciplining or getting rid of bad cops is to compare the distribution of complaints across the department to what would be expected if they were all undeserved. The excess number of cops with many complaints then gives an estimate of how many bad cops there are, which can be compared to the number of disciplinary actions arising from complaints.
Reminds me of E. O. Wilson's (an expert on ants) pithy evaluation of socialism. "Good idea, wrong species."
Just asserting that you're right doesn't convince me. Why are power laws "right" and other distributions "wrong"? Keep in mind that in the OP, Prof. Volokh was simply making up his own hypotheticals to show how "x% of X accounts for y% of Y" statistics can be misleading.
What do power laws have to do with that? Where is your proof that Prof. Volokh's hypotheticals are invalid? Is there a Fundamental Theorem (or Very Large Incontravertible Data Set) of Cop Complaints Distributions or something that I don't know?
Also, I still don't see what an example with coins has to do with the Gaussian distribution. Coins are usually assumed to follow a binomial distribution.