Before we get started, open up Gaussian distribution and Chi, t, F distributions if you need some reference on the math.
One sample t-test
If we don’t know variance of population (that is usually the case), for :
where s is standard deviation of samples and n is the number of samples.
As an example, to test if the mean of “1, 3, 2, 7, 8, 9, 3, 4, 5” is 5, we should test if they’re normally distributed:
> x = c(1, 3, 2, 7, 8, 9, 3, 4, 5) > shapiro.test(x) Shapiro-Wilk normality test data: x W = 0.9409, p-value = 0.5917
As p-value > 0.05, we can not reject H0, i.e., it’s following normal distribution. See Testing Normality for additional way of testing normality.
Now, to apply t-test:
> t.test(x, mu=5) One Sample t-test data: x t = -0.3592, df = 8, p-value = 0.7287 alternative hypothesis: true mean is not equal to 5 95 percent confidence interval: 2.526785 6.806548 sample estimates: mean of x 4.666667
As p-value is 0.7287 > 0.05, H0 is not rejected, meaning that the true mean is 5.
Or to see if the mean of x is larger than 5:
> t.test(x, mu=5, alternative="greater") One Sample t-test data: x t = -0.3592, df = 8, p-value = 0.6356 alternative hypothesis: true mean is greater than 5 95 percent confidence interval: 2.941079 Inf sample estimates: mean of x 4.666667
In this case, true mean is NOT greater than 5 as p-value > 0.05.
Independent two sample t-test
Here, we want to know if the mean of and are the same when and are independent and , .
1) If we know and .
The test statistics is
However, we usually don’t know and .
2) We don’t know , , but and are big enough.
Then the test statistics is:
Usually 30 is magic number to determine if the sample size is big.
3) We don’t know , , but
Test statistics can be written as
where is so called pooled sample variance:
(Note: We’re still assuming that and follows normal distribution. See assumptions of t-test.)
As an example, let’s test if the means are the same for “1, 3, 2, 7, 8, 9, 3, 4, 5” and “1, 2, 4, 3, 2, 5, 6, 7, 8, 2, 3, 5”.
Let’s test if the variances are the same:
> var.test(x,y) F test to compare two variances data: x and y F = 1.5787, num df = 8, denom df = 11, p-value = 0.4734 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.4308902 6.6990915 sample estimates: ratio of variances 1.578704
As p-value > 0.05, we can not reject that their variances are the same.
If we need to test normality, we want to see if is normally distributed. In this example, we know that the variance is the same for and . So, when using shapiro.test, we need to think of this t-test as simplified version of anova. Then, and where . As and is normally distributed with the same mean and variance, put them together and test normality. Suppose that we have data “1, 3, 2, 7, 8, 9, 3, 4, 5″ and “1, 2, 4, 3, 2, 5, 6, 7, 8, 2, 3, 5″. Then, run shapiro.test like the below:
> x = c(1, 3, 2, 7, 8, 9, 3, 4, 5) > y = c(1, 2, 4, 3, 2, 5, 6, 7, 8, 2, 3, 5) > shapiro.test(c(x-mean(x), y-mean(y))) Shapiro-Wilk normality test data: c(x - mean(x), y - mean(y)) W = 0.9426, p-value = 0.2452
In this case, H0 holds: it’s normal.
Another way of doing this is using lm:
> f = data.frame(val=c(x, y), klass=c(rep("x", NROW(x)), rep("y", NROW(y)))) > f val klass 1 1 x 2 3 x 3 2 x 4 7 x 5 8 x 6 9 x 7 3 x 8 4 x 9 5 x 10 1 y 11 2 y 12 4 y 13 3 y 14 2 y 15 5 y 16 6 y 17 7 y 18 8 y 19 2 y 20 3 y 21 5 y > # As klass is a factor variable, val = alpha * klass + epsilon where alpha is either 0 or 1. > shapiro.test(resid(lm(val ~ klass, data=f))) Shapiro-Wilk normality test data: resid(lm(val ~ klass, data = f)) W = 0.9426, p-value = 0.2452
As you can see, using lm gives the same result with subtracting mean from x and y separately.
If the variances were different, we would use shapiro.test for each of an separately.
Now, t-test:
> t.test(x, y, var.equal=TRUE) Two Sample t-test data: x and y t = 0.6119, df = 19, p-value = 0.5479 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.613802 2.947136 sample estimates: mean of x mean of y 4.666667 4.000000
It’s confidence interval includes zero. Thus p-value > 0.05, meaning that we can not reject H0 that their means are the same.
3) If we know that
We’re still assuming that and follow normal distribution and they’re independent. As their variances are not the same, we just use the fact that .
Because we do not know their variances, use sample variance:
Code for R is the same, except that we use t.test(x, y, var.equal=FALSE).
But one should think really hard why he/she want to compare mean in the first place when they have different variances.
Paired sample t-test
I think this is the data that any intelligent engineer will try to get from their experiment. Paired samples has data in this form: . For example, it could be like data of (old method performance, new method performance) observed from several machines.
If and are normally distributed, follows normal distribution. Even when it’s not the case, Central Limit Theorem states that sample average follows normal distribution. Therefore .
As we do not know variance of , use sample variance to get:
In R (I am assuming and . Without it, one should run normality test first as we have small number of data in this example):
> x = c(1, 2, 3, 4, 3, 2) > y = c(5, 3, 2, 3, 1, 7) > t.test(x, y, paired=TRUE) Paired t-test data: x and y t = -0.8452, df = 5, p-value = 0.4366 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.041553 2.041553 sample estimates: mean of the differences -1
We conclude that we can not reject H0 that true means are the same. In human language, “their mean are the same”.
If all the assumptions do not hold
All the methods in the above have some kind of assumptions like sample size is large or normal distribution.
If such assumptions look invalid, one could use non-parametric methods like rank sum test. For example, for the paired t-test case in the above:
> x = c(1, 2, 3, 4, 3, 2) > y = c(5, 3, 2, 3, 1, 7) > library(BSDA) > wilcox.test(x, y, paired=TRUE) Wilcoxon signed rank test with continuity correction data: x and y V = 8, p-value = 0.6716 alternative hypothesis: true location shift is not equal to 0
See Rank Tests for more examples.
Refernces)
배도선 외, 통계학 이론과 응용, 청문각.
임동훈, R을 이용한 비모수 통계학, 자유아카데미.
김재희, R을 이용한 통계 프로그래밍 기초, 자유아카데미.
안재형, R을 이용한 누구나하는 통계분석, 한나래.