Multiple Comparison (or Multiple Testing) Issue

Tags:

http://en.wikipedia.org/wiki/Multiple_comparisons

One practical example of this. Suppose that you compare search quality of two search engines multiple times: one is good engine while the other is bad. If you compare good engine and bad engine multiple times, you’ll observe bad wins in a comparison simply by accumulated statistical testing errors while it loses for 99 cases.

Solution to this problem is decreasing \alpha for each testing so that total type I error can not exceeds \alpha, e.g., Bonferroni Correction.

Wiki has great explanation on this problem:

For example, one might declare that a coin was biased if in 10 flips it landed heads at least 9 times. Indeed, if one assumes as a null hypothesis that the coin is fair, then the probability that a fair coin would come up heads at least 9 out of 10 times is (10+1)\times(1/2)^{10}=0.0107. This is relatively unlikely, and under statistical criteria such as p-value < 0.05, one would declare that the null hypothesis should be rejected — i.e., the coin is unfair. A multiple-comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. Imagine if one was to test 100 fair coins by this method. Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see a particular (i.e., pre-selected) coin come up heads 9 or 10 times would still be very unlikely, but seeing some coin behave that way, without concern for which one, would be more likely than not. Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is [latex](1-0.0107)^{100} \approx 0.34[/latex]. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would more likely than not falsely identify at least one fair coin as unfair.