서평: GGPLOT2, Elegant Graphics for Data Analysis (Use R!)

GGPLOT2는 R을 위한 문법 기반의 그래픽 시스템입니다. 기본적으로 포함된 R의 plotting function 들이 보통 하나의 함수안에서 모든 기능을 다 넣기때문에 plot을 여러가지로 변형하거나 재사용하거나 확장하기가 어려웠던 반면, gpplot2는 graphics자체를 다시 생각하고 차트의 요소를 geom, statistics, scales, coordinate system, faceting, position, aesthetics등으로 분리했습니다. 그리고 각 차트는 이러한 요소들의 조합으로 그려지게 됩니다. 그렇기 때문에 각 요소를 손쉽게 조합해서 더 복잡한 차트를 만들거나 더 잘 customize할 수 있게 되었습니다.

제가 아는한은 ggplot2에 대한 책은 GGPLOT2, Elegant Graphics for Data Analysis (Use R!)가 거의 유일하기때문에 ggplot2를 배우겠다고 생각한다면 아마 이책외에는 답이 없을 것 같습니다. 하지만 다행스럽게도 책은 읽기쉽고 예제가 충분합니다. 코드와 차트가 번갈아 가면서 나오는형태라서 볼만하고, 가끔 문법적인 설명이 부족한 경우가 있기는 하지만, 그래도 다른 R책에 비해서는 상당히 양호합니다.

단점이라면 차트 그리기만 계속 설명하는 한권의 책을 읽는다는게 좀 지겹다는 정도겠네요.

마지막으로 ggplot2로 그린 몇개 차트를 예시로 올려봅니다. 먼저 airquality데이터에서 월별 오존양에 대한 density 차트입니다.

> library(datasets)
> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
> airquality$MonthF <- factor(airquality$Month)
# x축은 Ozone값, y축은 각 Ozone값에대한 density입니다. 월별로 분리해서 density를 그렸습니다.
> qplot(Ozone, data=airquality, geom="density", fill=MonthF, alpha=I(0.2))

linear model도 쉽게 그려볼 수 있습니다.

# Straight line
> qplot(Wind, Ozone, data=airquality,  geom=c("point", "smooth"), method="lm")
# y = b + ax + ax^2
> qplot(Wind, Ozone, data=airquality,  geom=c("point", "smooth"), method="lm", formula=y ~ poly(x, 2))


Similar Posts:

Octave Tutorial

GNU Octave is a high-level interpreted language, primarily intended for numerical computations. (http://www.gnu.org/software/octave/)

Here’s tutorial from Andrew Ng. He mentions that Octave is great prototyping language for implementing algorithm, and that Octave is superior to NumPy as Python is clunkier than Octave for ML programming purpose.

Similar Posts:

서평: R을 이용한 누구나하는 통계 분석

R을 이용한 누구나 하는 통계분석은 책의 저자가 서문에 적었듯이 잘 만들어진 R cookbook입니다. 그렇기에 통계적 방법에 대한 설명이 체계적으로 나열되고, 결과에 대한 분석도 빠지지 않고 잘 설명되어있습니다. 개인적으로는 쿡북으로 유명한 오라일리에서 나온 R cookbook 책보다 훨씬 가치가 있다고 생각이 드는 책입니다. 특히 이 책의 전반부에서 나오는 다양한 환경에서의 평균 비교(paired, two-sample, non-parametric) 방법에 대한 구성이나 예제는 다른 어떤 책보다 잘 되어있다고 생각이 듭니다.

하지만 어떤 모델을 R로 구현할때 모델의 가정이나 모델 수식을 적지 않거나 가볍게 다루는 점은 아쉬웠습니다. 예를들어 제일 마지막 챕터에서는 원하는 검정력(power)을 얻기 위해 필요한 표본수를 설명하는데 뜻밖에도 검정력에 대한 정의는 짚어주지 않습니다. 이런점은 기초 통계학을 R로 설명한 다른 책(예를들어 R을 이용한 통계 프로그래밍 기초)에서는 보통 수식이나 기본 개념을 꼭 복습하고 시작한다는 점을 생각해 볼 때 아쉬운 점입니다. 그렇기에 이 책의 독자는 몇가지 조건을 만족해야합니다. 일단 다뤄지는 주제(기술통계, 회귀분석, 분산분석 등)에 대해 기본적인 통계지식을 사전에 갖고 있어야합니다. 두번째로는 R을 이미 알고 있어야합니다. 이런 조건이 너무 빡빡해 보이기는 하지만 의외로 이책이 벌써 3쇄까지 나왔다는 것만 봐도 그런 상황에 있는 독자가 많다는 것을 의미합니다.

개인적으로는 책을 읽으면서 이해가 안가는 점을 저자가 운영하는 카페에 질문하고 정확한 설명을 들을 수 있었고, 또 상당히 많은 분석방법을 체계적으로 정리했다는데에서 추천하고 싶은 책입니다.

Similar Posts:

Sentiment Analysis Resource

Sentiment Symposium Tutorial is a nice website with detailed explanation and even some codes.

Thumbs up? Sentiment Classification using Machine Learning Techniques is a paper quoted +1700 times.

These two are recommended reading material from nlp-class.org on sentiment analysis.

Similar Posts:

Kappa for inter-rater agreement

Cohen’s kappa coefficient is a statistical measure of inter-rater agreement or inter-annotator agreement for qualitative (categorical) items. (See http://en.wikipedia.org/wiki/Cohen’s_kappa)

Kappa is computed as:
  \kappa = \dfrac{P(a) - P(e)}{1 - P(e)}

P(a) is observed prob. of agrement and P(e) is prob. of agreement by chance, i.e., P(e) is the chance of agreement assuming the independence of raters. So, the equation is looking at ‘prob. of observed agreement – prob. of chance agreement’ over ‘perfect agreement(P(a)=1) – prob. of chance agreement’. See http://en.wikipedia.org/wiki/Cohen’s_kappa#Example for example.

I find fmsb package has readable output though there’s other pakcage like irr (Various Coefficients of Interrater Reliability and Agreement).


> library(fmsb)
> d = matrix(c(10, 1, 1, 10), nrow=2)
> d
     [,1] [,2]
[1,]   10    1
[2,]    1   10
> Kappa.test(d)
$Result

	Estimate Cohen's kappa statistics and test the null hypothesis that
	the extent of agreement is same as random (kappa=0)

data:  d
Z = 3.8376, p-value = 6.212e-05
95 percent confidence interval:
 0.5779259 1.0584377
sample estimates:
[1] 0.8181818

$Judgement
[1] "Almost perfect agreement"

Read $Judgement for the answer. In this example, we observed almost perfect agreement with very low p-value.

Similar Posts:

Rattle for exploration of variables

I’m reading a book on rattle. In rattle, data exploration is easy.

To see pairs (or splom) plot, select explore tab then click execute. Following is the output for weather dataset.

There’s histogram in the diagonal. Upper right side has correlation in numbers. Lower left has scatter plot with smoothing lines.

To see correlation, check correlation radio button in Explore tab and click execute:

Shape in the picture represents correlation; high correlation = line and low correlation = ellipsis or circle. Blue color represents plus correlation while red is for minus.

I like rattle as its output comes with corresponding R commands. In Log tab, every commands used for diagrams are shown.

Similar Posts:

Random forest for variable selection

Package randomForest has importance() to estimate the importance of variables.

The example in the reference manual has this:

> library(randomForest)
> data(mtcars)
> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE, importance=TRUE)
> importance(mtcars.rf)
       %IncMSE IncNodePurity
cyl  16.050788     171.09822
disp 18.868236     232.56372
hp   17.031602     198.29501
drat  7.728328      64.23068
wt   18.595598     260.77604
qsec  5.607246      33.88488
vs    5.124934      26.49292
am    3.938463      13.72707
gear  4.482608      18.85271
carb  7.823431      33.94279
> importance(mtcars.rf, type=1)
       %IncMSE
cyl  16.050788
disp 18.868236
hp   17.031602
drat  7.728328
wt   18.595598
qsec  5.607246
vs    5.124934
am    3.938463
gear  4.482608
carb  7.823431

In importance(), type=1 shows mean squared error increase if each variable is removed from the predictors. Type 2 shows increase in node impurity averaged over all trees.

To visualize:

> varImpPlot(mtcars.rf)

To get the top three important variables:

> mtcars.imp <- importance(mtcars.rf, type=1)
> mtcars.imp[order(mtcars.imp, decreasing=TRUE),]
     disp        wt        hp       cyl      carb      drat      qsec        vs
18.868236 18.595598 17.031602 16.050788  7.823431  7.728328  5.607246  5.124934
     gear        am
 4.482608  3.938463
> names(mtcars.imp[order(mtcars.imp, decreasing=TRUE),])[1:3]
[1] "disp" "wt"   "hp"

Thus we get disp, wt, and hp.

Similar Posts:

ROC graph 101


Tom Fawcet, ROC Graphs: Notes and Practical Considerations for Data Mining Researcher, HP Labs Technical Reports, 2003.

This is a paper on the ROC graph, and I really enjoyed reading it. Though many ‘introduction to machine learning’ books describe ROC curve, none of them could explain it in this much depth.

Starting from algorithms to draw the graph correctly and efficiently, it explains that ROC curve is class skew invariant unlike precision-recall graph, and it explains how to use cross validation to draw a vertically averaged graph(so that we can find confidence interval for each false positive rate) and to draw an averaged curve by threshold(which may not be attractive if we’re averaging different models and if scores are not probabilities).

The paper goes even further to explain cost sensitive ROC curve and multi-class ROC graph(and AUC of it). Finally, it describes interpolation of classifiers to get a classifier somewhere in the middle of two points in the ROC graph(we can do this by random sampling classifier output) and it describes conditional classifier for removing concavities in ROC graph. Chained classifier was also discussed (by mentioning that it’s violating the assumption that each model in ROC graph is supposed to be independent).

I recommend this to everyone who didn’t study ROC graph in details.

Similar Posts:

Tweaking bayes theorem

Tweaking Bayes’ Theorem

This is my own trial to explain the tweak mentioned in the above link.

In the video, what we want is to find the best english text for the given foreign text, and it can be written as:
  Pr(e|f) = Pr(e)Pr(f|e)/Pr(f).

For the purpose of finding english text, ignore Pr(f), i.e.,:

  argmax_e Pr(e|f) \\  = argmax_e Pr(e)Pr(f|e)/Pr(f) \\  = argmax_e Pr(e)Pr(f|e) \\  \simeq argmax_e p(e)p(f|e) \\  \simeq argmax_e p(e)^{1.5}p(f|e)

What’s pointed out as interesting in the linked document is 1.5.

Here’s my explanation.

As it’s probability^1.5, it makes the probability lower, but not higher, i.e., x^1.5 < x if 0 <= x <=1.

I think this might be a tweak due to data scarcity. For p(e), there’s tons of data to build a model. On the other hand, p(f|e) requires for you to get parallel corpus (texts that’s written in both of English and foreign language) which is inherently scarce.

As a result, p(e)^1.5 * p(e|f) lowers p(e) as it’s supposed to be too high compared to p(f|e).

Similar Posts:

T-test for comparing means

Before we get started, open up Gaussian distribution and Chi, t, F distributions if you need some reference on the math.

One sample t-test

If we don’t know variance of population (that is usually the case), for X \sim N(\mu, \sigma^2):

  \dfrac{\overline{X}-\mu}{s/\sqrt{n}} \sim t(n-1)

where s is standard deviation of samples and n is the number of samples.

As an example, to test if the mean of “1, 3, 2, 7, 8, 9, 3, 4, 5″ is 5, we should test if they’re normally distributed:

> x = c(1, 3, 2, 7, 8, 9, 3, 4, 5)
> shapiro.test(x)

	Shapiro-Wilk normality test

data:  x
W = 0.9409, p-value = 0.5917

As p-value > 0.05, we can not reject H0, i.e., it’s following normal distribution. See Testing Normality for additional way of testing normality.

Now, to apply t-test:

> t.test(x, mu=5)

	One Sample t-test

data:  x
t = -0.3592, df = 8, p-value = 0.7287
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 2.526785 6.806548
sample estimates:
mean of x
 4.666667

As p-value is 0.7287 > 0.05, H0 is not rejected, meaning that the true mean is 5.

Or to see if the mean of x is larger than 5:

> t.test(x, mu=5, alternative="greater")

	One Sample t-test

data:  x
t = -0.3592, df = 8, p-value = 0.6356
alternative hypothesis: true mean is greater than 5
95 percent confidence interval:
 2.941079      Inf
sample estimates:
mean of x
 4.666667

In this case, true mean is NOT greater than 5 as p-value > 0.05.

Independent two sample t-test

Here, we want to know if the mean of X_1, X_2, \cdots, X_{n_1} and Y_1, Y_2, \cdots, Y_{n_2} are the same when X_{n_1} and Y_{n_2} are independent and X \sim N(\mu_1, \sigma_1), Y \sim N(\mu_2, \sigma_2).

1) If we know \sigma_1 and \sigma_2.
The test statistics is

  \dfrac{\overline{X}-\overline{Y}}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \sim N(0,1)

However, we usually don’t know \sigma_1 and \sigma_2.

2) We don’t know \sigma_1, \sigma_2, but n_1 and n_2 are big enough.
Then the test statistics is:

  \dfrac{\overline{X}-\overline{Y}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \sim N(0, 1)

Usually 30 is magic number to determine if the sample size is big.

3) We don’t know \sigma_1, \sigma_2, but \sigma_1=\sigma_2
Test statistics can be written as

  \dfrac{\hat{X}-\hat{Y}-(\mu_1-\mu2)}{S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim t(n_1 + n_2 - 2)

where S_p is so called pooled sample variance:

  S_p=\dfrac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{(n_1 + n_2 - 2)}

(Note: We’re still assuming that X and Y follows normal distribution. See assumptions of t-test.)

As an example, let’s test if the means are the same for “1, 3, 2, 7, 8, 9, 3, 4, 5″ and “1, 2, 4, 3, 2, 5, 6, 7, 8, 2, 3, 5″.

Let’s test if the variances are the same:


> var.test(x,y)

	F test to compare two variances

data:  x and y
F = 1.5787, num df = 8, denom df = 11, p-value = 0.4734
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.4308902 6.6990915
sample estimates:
ratio of variances
          1.578704

As p-value > 0.05, we can not reject that their variances are the same.

If we need to test normality, we want to see if \overline{X}-\overline{Y} is normally distributed. In this example, we know that the variance is the same for X and Y. So, when using shapiro.test, we need to think of this t-test as simplified version of anova. Then, X_i = \mu_i + \epsilon_i and Y_j = \mu_j + \epsilon_j where \epsilon_{i} \sim N(0, \sigma_E),~\epsilon_{j} \sim N(0, \sigma_E). As \epsilon_i and \epsilon_j is normally distributed with the same mean and variance, put them together and test normality. Suppose that we have data “1, 3, 2, 7, 8, 9, 3, 4, 5″ and “1, 2, 4, 3, 2, 5, 6, 7, 8, 2, 3, 5″. Then, run shapiro.test like the below:

> x = c(1, 3, 2, 7, 8, 9, 3, 4, 5)
> y = c(1, 2, 4, 3, 2, 5, 6, 7, 8, 2, 3, 5)
> shapiro.test(c(x-mean(x), y-mean(y)))

	Shapiro-Wilk normality test

data:  c(x - mean(x), y - mean(y))
W = 0.9426, p-value = 0.2452

In this case, H0 holds: it’s normal.

Another way of doing this is using lm:

> f = data.frame(val=c(x, y), klass=c(rep("x", NROW(x)), rep("y", NROW(y))))
> f
   val klass
1    1     x
2    3     x
3    2     x
4    7     x
5    8     x
6    9     x
7    3     x
8    4     x
9    5     x
10   1     y
11   2     y
12   4     y
13   3     y
14   2     y
15   5     y
16   6     y
17   7     y
18   8     y
19   2     y
20   3     y
21   5     y
> # As klass is a factor variable, val = alpha * klass + epsilon where alpha is either 0 or 1.
> shapiro.test(resid(lm(val ~ klass, data=f)))

	Shapiro-Wilk normality test

data:  resid(lm(val ~ klass, data = f))
W = 0.9426, p-value = 0.2452

As you can see, using lm gives the same result with subtracting mean from x and y separately.

If the variances were different, we would use shapiro.test for each of X an Y separately.

Now, t-test:

> t.test(x, y, var.equal=TRUE)

	Two Sample t-test

data:  x and y
t = 0.6119, df = 19, p-value = 0.5479
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.613802  2.947136
sample estimates:
mean of x mean of y
 4.666667  4.000000

It’s confidence interval includes zero. Thus p-value > 0.05, meaning that we can not reject H0 that their means are the same.

3) If we know that \sigma_1 \neq \sigma_2

We’re still assuming that X and Y follow normal distribution and they’re independent. As their variances are not the same, we just use the fact that X - Y = N(\mu_1 - \mu_2, \sigma_1 + \sigma_2).

Because we do not know their variances, use sample variance:

  \dfrac{\overline{X}-\overline{Y}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \sim t(df)

Code for R is the same, except that we use t.test(x, y, var.equal=FALSE).

But one should think really hard why he/she want to compare mean in the first place when they have different variances.

Paired sample t-test

I think this is the data that any intelligent engineer will try to get from their experiment. Paired samples has data in this form: (X_1, Y_1),~ (X_2, Y_2),~ \cdots,~ (X_n, Y_n). For example, it could be like data of (old method performance, new method performance) observed from several machines.

If X and Y are normally distributed, D=X-Y follows normal distribution. Even when it’s not the case, Central Limit Theorem states that sample average follows normal distribution. Therefore D \sim N.

As we do not know variance of D, use sample variance to get:

  \dfrac{D - \mu_D}{S_D / \sqrt{n}} \sim t(n-1)

In R (I am assuming X \sim N and Y \sim N. Without it, one should run normality test first as we have small number of data in this example):

> x = c(1, 2, 3, 4, 3, 2)
> y = c(5, 3, 2, 3, 1, 7)
> t.test(x, y, paired=TRUE)

	Paired t-test

data:  x and y
t = -0.8452, df = 5, p-value = 0.4366
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.041553  2.041553
sample estimates:
mean of the differences
                     -1

We conclude that we can not reject H0 that true means are the same. In human language, “their mean are the same”.

If all the assumptions do not hold

All the methods in the above have some kind of assumptions like sample size is large or normal distribution.

If such assumptions look invalid, one could use non-parametric methods like rank sum test. For example, for the paired t-test case in the above:

> x = c(1, 2, 3, 4, 3, 2)
> y = c(5, 3, 2, 3, 1, 7)
> library(BSDA)
> wilcox.test(x, y, paired=TRUE)

	Wilcoxon signed rank test with continuity correction

data:  x and y
V = 8, p-value = 0.6716
alternative hypothesis: true location shift is not equal to 0

See Rank Tests for more examples.

Refernces)
배도선 외, 통계학 이론과 응용, 청문각.
임동훈, R을 이용한 비모수 통계학, 자유아카데미.
김재희, R을 이용한 통계 프로그래밍 기초, 자유아카데미.
안재형, R을 이용한 누구나하는 통계분석, 한나래.

Similar Posts: