Proportion estimation in R – Passion is like genius; a miracle.

Given 1 (head of a coin) and 0(tail of a coin) as a sequence like 101011110101, …, we want to figure out the proportion of 1 in population, i.e., how likely is it to observe head for the given coin.

Let X be the random variable where 1 means head and 0 means tail. If the probability of observing head is p, then X ~ Bernoulli(p). If we flip coin n times and sum the random numbers, 1 (head) and 0(tail), then SUM ~ B(n, p). If np >= 5 and n(1-p) >= 5, by Central Limit Theorem, SUM ~ N(np, np(1-p)). To get proportion, say, Y, Y = SUM/n ~ N(p, p(1-p)/n). So, proportion is p +/- 1.96 * sqrt(p(1-p)/n).

To perform proportion estimation in R,

> heads <- rbinom(1, size=100, prob = .8)
> prop.test(heads, 100)

 1-sample proportions test with continuity
 correction

data:  heads out of 100, null probability 0.5 
X-squared = 53.29, df = 1, p-value = 2.878e-13
alternative hypothesis: true p is not equal to 0.5 
95 percent confidence interval:
 0.7843987 0.9262321 
sample estimates:
   p 
0.87

What’s interesting here is it’s not symmetric interval:

> .87-0.7843987
[1] 0.0856013
> 0.9262321 -.87
[1] 0.0562321

It’s because the interval is not using gaussian distribution described above. Instead, it’s using other confidence interval computation method:
http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/prop.test.html

If two proportions are involved, we can compare them(see prop.test manual in R):

## Data from Fleiss (1981), p. 139.
## H0: The null hypothesis is that the four populations from which
##     the patients were drawn have the same true proportion of smokers.
## A:  The alternative is that this proportion is different in at
##     least one of the populations.
> smokers  <- c( 83, 90, 129, 70 )
> patients <- c( 86, 93, 136, 82 )
> prop.test(smokers, patients)

 4-sample test for equality of proportions without
 continuity correction

data:  smokers out of patients 
X-squared = 12.6004, df = 3, p-value = 0.005585
alternative hypothesis: two.sided 
sample estimates:
   prop 1    prop 2    prop 3    prop 4 
0.9651163 0.9677419 0.9485294 0.8536585

As p value < 0.05, we reject H0, accepting A. Note the relationship between prop.test and chisq.test (chisqaure testing in contingency table): http://stats.stackexchange.com/questions/2391/what-is-the-relationship-between-a-chi-square-test-and-test-of-equal-proportions