<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Computational Prediction</title>
	<atom:link href="http://mkseo.pe.kr/stats/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://mkseo.pe.kr/stats</link>
	<description></description>
	<lastBuildDate>Thu, 17 May 2012 15:29:08 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>서평: GGPLOT2, Elegant Graphics for Data Analysis (Use R!)</title>
		<link>http://mkseo.pe.kr/stats/?p=567</link>
		<comments>http://mkseo.pe.kr/stats/?p=567#comments</comments>
		<pubDate>Thu, 17 May 2012 15:29:08 +0000</pubDate>
		<dc:creator>Minkoo Seo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mkseo.pe.kr/stats/?p=567</guid>
		<description><![CDATA[GGPLOT2는 R을 위한 문법 기반의 그래픽 시스템입니다. 기본적으로 포함된 R의 plotting function 들이 보통 하나의 함수안에서 모든 기능을 다 넣기때문에 plot을 여러가지로 변형하거나 재사용하거나 확장하기가 어려웠던 반면, gpplot2는 graphics자체를 다시 생각하고 차트의 요소를 geom, statistics, scales, coordinate system, faceting, position, aesthetics등으로 분리했습니다. 그리고 각 차트는 이러한 요소들의 조합으로 그려지게 됩니다. 그렇기 때문에 각 요소를 손쉽게 [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://had.co.nz/ggplot2/">GGPLOT2</a>는 R을 위한 문법 기반의 그래픽 시스템입니다. 기본적으로 포함된 R의 plotting function 들이 보통 하나의 함수안에서 모든 기능을 다 넣기때문에 plot을 여러가지로 변형하거나 재사용하거나 확장하기가 어려웠던 반면, gpplot2는 graphics자체를 다시 생각하고 차트의 요소를 geom, statistics, scales, coordinate system, faceting, position, aesthetics등으로 분리했습니다. 그리고 각 차트는 이러한 요소들의 조합으로 그려지게 됩니다. 그렇기 때문에 각 요소를 손쉽게 조합해서 더 복잡한 차트를 만들거나 더 잘 customize할 수 있게 되었습니다.</p>
<p>제가 아는한은 ggplot2에 대한 책은 <a href="http://tinyurl.com/ggplot2-book">GGPLOT2, Elegant Graphics for Data Analysis (Use R!)</a>가 거의 유일하기때문에 ggplot2를 배우겠다고 생각한다면 아마 이책외에는 답이 없을 것 같습니다. 하지만 다행스럽게도 책은 읽기쉽고 예제가 충분합니다. 코드와 차트가 번갈아 가면서 나오는형태라서 볼만하고, 가끔 문법적인 설명이 부족한 경우가 있기는 하지만, 그래도 다른 R책에 비해서는 상당히 양호합니다.</p>
<p>단점이라면 차트 그리기만 계속 설명하는 한권의 책을 읽는다는게 좀 지겹다는 정도겠네요.</p>
<p>마지막으로 ggplot2로 그린 몇개 차트를 예시로 올려봅니다.  먼저 airquality데이터에서 월별 오존양에 대한 density 차트입니다.</p>
<pre class="brush: plain; title: ; notranslate">
&gt; library(datasets)
&gt; head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
&gt; airquality$MonthF &lt;- factor(airquality$Month)
# x축은 Ozone값, y축은 각 Ozone값에대한 density입니다. 월별로 분리해서 density를 그렸습니다.
&gt; qplot(Ozone, data=airquality, geom=&quot;density&quot;, fill=MonthF, alpha=I(0.2))
</pre>
<p><img src="http://mkseo.pe.kr/stats/wp-content/uploads/2012/05/ozone_density.png" alt="" title="ozone_density" width="655" height="318" class="alignnone size-full wp-image-568" /></p>
<p>linear model도 쉽게 그려볼 수 있습니다.</p>
<pre class="brush: plain; title: ; notranslate">
# Straight line
&gt; qplot(Wind, Ozone, data=airquality,  geom=c(&quot;point&quot;, &quot;smooth&quot;), method=&quot;lm&quot;)
# y = b + ax + ax^2
&gt; qplot(Wind, Ozone, data=airquality,  geom=c(&quot;point&quot;, &quot;smooth&quot;), method=&quot;lm&quot;, formula=y ~ poly(x, 2))
</pre>
<p><img src="http://mkseo.pe.kr/stats/wp-content/uploads/2012/05/ggplot_lm.png" alt="" title="ggplot_lm" width="675" height="311" class="alignnone size-full wp-image-569" /><br />
<img src="http://mkseo.pe.kr/stats/wp-content/uploads/2012/05/ggpot_lm_poly.png" alt="" title="ggpot_lm_poly" width="672" height="315" class="alignnone size-full wp-image-570" /></p>
]]></content:encoded>
			<wfw:commentRss>http://mkseo.pe.kr/stats/?feed=rss2&#038;p=567</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Octave Tutorial</title>
		<link>http://mkseo.pe.kr/stats/?p=564</link>
		<comments>http://mkseo.pe.kr/stats/?p=564#comments</comments>
		<pubDate>Tue, 08 May 2012 15:27:31 +0000</pubDate>
		<dc:creator>Minkoo Seo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mkseo.pe.kr/stats/?p=564</guid>
		<description><![CDATA[GNU Octave is a high-level interpreted language, primarily intended for numerical computations. (http://www.gnu.org/software/octave/) Here&#8217;s tutorial from Andrew Ng. He mentions that Octave is great prototyping language for implementing algorithm, and that Octave is superior to NumPy as Python is clunkier than Octave for ML programming purpose.]]></description>
			<content:encoded><![CDATA[<blockquote><p>
GNU Octave is a high-level interpreted language, primarily intended for numerical computations. (<a href="http://www.gnu.org/software/octave/">http://www.gnu.org/software/octave/</a>)
</p></blockquote>
<p>Here&#8217;s tutorial from Andrew Ng. He mentions that Octave is great prototyping language for implementing algorithm, and that Octave is superior to NumPy as Python is clunkier than Octave for ML programming purpose.</p>
<p><iframe width="560" height="315" src="http://www.youtube.com/embed/xi4K57bgCXk" frameborder="0" allowfullscreen></iframe></p>
<p><iframe width="560" height="315" src="http://www.youtube.com/embed/f-OIDo6Bxw0" frameborder="0" allowfullscreen></iframe></p>
<p><iframe width="560" height="315" src="http://www.youtube.com/embed/7IpwLWHxM5U" frameborder="0" allowfullscreen></iframe></p>
<p><iframe width="560" height="315" src="http://www.youtube.com/embed/r4pJori2klI" frameborder="0" allowfullscreen></iframe></p>
<p><iframe width="560" height="315" src="http://www.youtube.com/embed/jrWZvYDxAfw" frameborder="0" allowfullscreen></iframe></p>
<p><iframe width="560" height="315" src="http://www.youtube.com/embed/ABk_FmspLqs" frameborder="0" allowfullscreen></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://mkseo.pe.kr/stats/?feed=rss2&#038;p=564</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>서평: R을 이용한 누구나하는 통계 분석</title>
		<link>http://mkseo.pe.kr/stats/?p=557</link>
		<comments>http://mkseo.pe.kr/stats/?p=557#comments</comments>
		<pubDate>Tue, 17 Apr 2012 13:52:09 +0000</pubDate>
		<dc:creator>Minkoo Seo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mkseo.pe.kr/stats/?p=557</guid>
		<description><![CDATA[R을 이용한 누구나 하는 통계분석은 책의 저자가 서문에 적었듯이 잘 만들어진 R cookbook입니다. 그렇기에 통계적 방법에 대한 설명이 체계적으로 나열되고, 결과에 대한 분석도 빠지지 않고 잘 설명되어있습니다. 개인적으로는 쿡북으로 유명한 오라일리에서 나온 R cookbook 책보다 훨씬 가치가 있다고 생각이 드는 책입니다. 특히 이 책의 전반부에서 나오는 다양한 환경에서의 평균 비교(paired, two-sample, non-parametric) 방법에 대한 구성이나 [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.yes24.com/24/Goods/4510634?Acode=101">R을 이용한 누구나 하는 통계분석</a>은 책의 저자가 서문에 적었듯이 잘 만들어진 R cookbook입니다. 그렇기에 통계적 방법에 대한 설명이 체계적으로 나열되고, 결과에 대한 분석도 빠지지 않고 잘 설명되어있습니다. 개인적으로는 쿡북으로 유명한 오라일리에서 나온 R cookbook 책보다 훨씬 가치가 있다고 생각이 드는 책입니다. 특히 이 책의 전반부에서 나오는 다양한 환경에서의 평균 비교(paired, two-sample, non-parametric) 방법에 대한 구성이나 예제는 다른 어떤 책보다 잘 되어있다고 생각이 듭니다.</p>
<p>하지만 어떤 모델을 R로 구현할때 모델의 가정이나 모델 수식을 적지 않거나 가볍게 다루는 점은 아쉬웠습니다. 예를들어 제일 마지막 챕터에서는 원하는 검정력(power)을 얻기 위해 필요한 표본수를 설명하는데 뜻밖에도 검정력에 대한 정의는 짚어주지 않습니다. 이런점은 기초 통계학을 R로 설명한 다른 책(예를들어 <a href="http://www.yes24.com/24/Goods/3413742?Acode=101">R을 이용한 통계 프로그래밍 기초</a>)에서는 보통 수식이나 기본 개념을 꼭 복습하고 시작한다는 점을 생각해 볼 때 아쉬운 점입니다. 그렇기에 이 책의 독자는 몇가지 조건을 만족해야합니다. 일단 다뤄지는 주제(기술통계, 회귀분석, 분산분석 등)에 대해 기본적인 통계지식을 사전에 갖고 있어야합니다. 두번째로는 R을 이미 알고 있어야합니다. 이런 조건이 너무 빡빡해 보이기는 하지만 의외로 이책이 벌써 3쇄까지 나왔다는 것만 봐도 그런 상황에 있는 독자가 많다는 것을 의미합니다.</p>
<p>개인적으로는 책을 읽으면서 이해가 안가는 점을 저자가 운영하는 카페에 질문하고 정확한 설명을 들을 수 있었고, 또 상당히 많은 분석방법을 체계적으로 정리했다는데에서 추천하고 싶은 책입니다.</p>
]]></content:encoded>
			<wfw:commentRss>http://mkseo.pe.kr/stats/?feed=rss2&#038;p=557</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sentiment Analysis Resource</title>
		<link>http://mkseo.pe.kr/stats/?p=552</link>
		<comments>http://mkseo.pe.kr/stats/?p=552#comments</comments>
		<pubDate>Wed, 04 Apr 2012 14:06:41 +0000</pubDate>
		<dc:creator>Minkoo Seo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mkseo.pe.kr/stats/?p=552</guid>
		<description><![CDATA[Sentiment Symposium Tutorial is a nice website with detailed explanation and even some codes. Thumbs up? Sentiment Classiﬁcation using Machine Learning Techniques is a paper quoted +1700 times. These two are recommended reading material from nlp-class.org on sentiment analysis.]]></description>
			<content:encoded><![CDATA[<p><a href="http://sentiment.christopherpotts.net/">Sentiment Symposium Tutorial</a> is a nice website with detailed explanation and even some codes.</p>
<p><a href="http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf">Thumbs up? Sentiment Classiﬁcation using Machine Learning Techniques</a> is a paper quoted +1700 times.</p>
<p>These two are recommended reading material from nlp-class.org on sentiment analysis.</p>
]]></content:encoded>
			<wfw:commentRss>http://mkseo.pe.kr/stats/?feed=rss2&#038;p=552</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Kappa for inter-rater agreement</title>
		<link>http://mkseo.pe.kr/stats/?p=531</link>
		<comments>http://mkseo.pe.kr/stats/?p=531#comments</comments>
		<pubDate>Sun, 01 Apr 2012 15:39:41 +0000</pubDate>
		<dc:creator>Minkoo Seo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mkseo.pe.kr/stats/?p=531</guid>
		<description><![CDATA[Cohen&#8217;s kappa coefficient is a statistical measure of inter-rater agreement or inter-annotator agreement for qualitative (categorical) items. (See http://en.wikipedia.org/wiki/Cohen&#8217;s_kappa) Kappa is computed as: is observed prob. of agrement and is prob. of agreement by chance, i.e., is the chance of agreement assuming the independence of raters. So, the equation is looking at &#8216;prob. of observed [...]]]></description>
			<content:encoded><![CDATA[<p>Cohen&#8217;s kappa coefficient is a statistical measure of inter-rater agreement or inter-annotator agreement for qualitative (categorical) items. (See <a href="http://en.wikipedia.org/wiki/Cohen's_kappa">http://en.wikipedia.org/wiki/Cohen&#8217;s_kappa</a>)</p>
<p>Kappa is computed as:<br />
<img src='http://s.wordpress.com/latex.php?latex=%20%20%5Ckappa%20%3D%20%5Cdfrac%7BP%28a%29%20-%20P%28e%29%7D%7B1%20-%20P%28e%29%7D%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \kappa = \dfrac{P(a) - P(e)}{1 - P(e)}  ' title='  \kappa = \dfrac{P(a) - P(e)}{1 - P(e)}  ' class='latex' /></p>
<p><img src='http://s.wordpress.com/latex.php?latex=P%28a%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(a)' title='P(a)' class='latex' /> is observed prob. of agrement and <img src='http://s.wordpress.com/latex.php?latex=P%28e%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(e)' title='P(e)' class='latex' /> is prob. of agreement by chance, i.e., <img src='http://s.wordpress.com/latex.php?latex=P%28e%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(e)' title='P(e)' class='latex' /> is the chance of agreement assuming the independence of raters. So, the equation is looking at &#8216;prob. of observed agreement &#8211; prob. of chance agreement&#8217; over &#8216;perfect agreement(<img src='http://s.wordpress.com/latex.php?latex=P%28a%29%3D1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(a)=1' title='P(a)=1' class='latex' />) &#8211; prob. of chance agreement&#8217;. See <a href="http://en.wikipedia.org/wiki/Cohen's_kappa#Example">http://en.wikipedia.org/wiki/Cohen&#8217;s_kappa#Example</a> for example.</p>
<p>I find <a href="http://cran.r-project.org/web/packages/fmsb/fmsb.pdf">fmsb</a> package has readable output though there&#8217;s other pakcage like <a href="http://cran.r-project.org/web/packages/irr/irr.pdf">irr (Various Coefﬁcients of Interrater Reliability and Agreement)</a>.</p>
<pre class="brush: plain; title: ; notranslate">

&gt; library(fmsb)
&gt; d = matrix(c(10, 1, 1, 10), nrow=2)
&gt; d
     [,1] [,2]
[1,]   10    1
[2,]    1   10
&gt; Kappa.test(d)
$Result

	Estimate Cohen's kappa statistics and test the null hypothesis that
	the extent of agreement is same as random (kappa=0)

data:  d
Z = 3.8376, p-value = 6.212e-05
95 percent confidence interval:
 0.5779259 1.0584377
sample estimates:
[1] 0.8181818

$Judgement
[1] &quot;Almost perfect agreement&quot;
</pre>
<p>Read $Judgement for the answer. In this example, we observed almost perfect agreement with very low p-value.</p>
]]></content:encoded>
			<wfw:commentRss>http://mkseo.pe.kr/stats/?feed=rss2&#038;p=531</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Rattle for exploration of variables</title>
		<link>http://mkseo.pe.kr/stats/?p=519</link>
		<comments>http://mkseo.pe.kr/stats/?p=519#comments</comments>
		<pubDate>Thu, 22 Mar 2012 15:50:39 +0000</pubDate>
		<dc:creator>Minkoo Seo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mkseo.pe.kr/stats/?p=519</guid>
		<description><![CDATA[I&#8217;m reading a book on rattle. In rattle, data exploration is easy. To see pairs (or splom) plot, select explore tab then click execute. Following is the output for weather dataset. There&#8217;s histogram in the diagonal. Upper right side has correlation in numbers. Lower left has scatter plot with smoothing lines. To see correlation, check [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m reading <a href="http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&#038;tag=togaware-20&#038;linkCode=as2&#038;camp=217145&#038;creative=399373&#038;creativeASIN=1441998896">a book on rattle</a>. In <a href="http://rattle.togaware.com/">rattle</a>, data exploration is easy.</p>
<p>To see pairs (or splom) plot, select explore tab then click execute. Following is the output for weather dataset.<br />
<img src="http://mkseo.pe.kr/stats/wp-content/uploads/2012/03/rattle_pairs.png" alt="" title="rattle_pairs" width="627" height="654" class="alignnone size-full wp-image-520" /></p>
<p>There&#8217;s histogram in the diagonal. Upper right side has correlation in numbers. Lower left has scatter plot with smoothing lines.</p>
<p>To see correlation, check correlation radio button in Explore tab and click execute:<br />
<img src="http://mkseo.pe.kr/stats/wp-content/uploads/2012/03/rattle_correlation.png" alt="" title="rattle_correlation" width="548" height="609" class="alignnone size-full wp-image-521" /></p>
<p>Shape in the picture represents correlation; high correlation = line and low correlation = ellipsis or circle. Blue color represents plus correlation while red is for minus.</p>
<p>I like rattle as its output comes with corresponding R commands. In Log tab, every commands used for diagrams are shown.</p>
]]></content:encoded>
			<wfw:commentRss>http://mkseo.pe.kr/stats/?feed=rss2&#038;p=519</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Random forest for variable selection</title>
		<link>http://mkseo.pe.kr/stats/?p=513</link>
		<comments>http://mkseo.pe.kr/stats/?p=513#comments</comments>
		<pubDate>Thu, 22 Mar 2012 15:27:52 +0000</pubDate>
		<dc:creator>Minkoo Seo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mkseo.pe.kr/stats/?p=513</guid>
		<description><![CDATA[Package randomForest has importance() to estimate the importance of variables. The example in the reference manual has this: In importance(), type=1 shows mean squared error increase if each variable is removed from the predictors. Type 2 shows increase in node impurity averaged over all trees. To visualize: To get the top three important variables: Thus [...]]]></description>
			<content:encoded><![CDATA[<p>Package randomForest has importance() to estimate the importance of variables.</p>
<p>The example in the <a href="http://cran.r-project.org/web/packages/randomForest/randomForest.pdf">reference manual</a> has this:</p>
<pre class="brush: plain; title: ; notranslate">
&gt; library(randomForest)
&gt; data(mtcars)
&gt; head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
&gt; mtcars.rf &lt;- randomForest(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE, importance=TRUE)
&gt; importance(mtcars.rf)
       %IncMSE IncNodePurity
cyl  16.050788     171.09822
disp 18.868236     232.56372
hp   17.031602     198.29501
drat  7.728328      64.23068
wt   18.595598     260.77604
qsec  5.607246      33.88488
vs    5.124934      26.49292
am    3.938463      13.72707
gear  4.482608      18.85271
carb  7.823431      33.94279
&gt; importance(mtcars.rf, type=1)
       %IncMSE
cyl  16.050788
disp 18.868236
hp   17.031602
drat  7.728328
wt   18.595598
qsec  5.607246
vs    5.124934
am    3.938463
gear  4.482608
carb  7.823431
</pre>
<p>In importance(), type=1 shows mean squared error increase if each variable is removed from the predictors. Type 2 shows increase in node impurity averaged over all trees.</p>
<p>To visualize:</p>
<pre class="brush: plain; title: ; notranslate">
&gt; varImpPlot(mtcars.rf)
</pre>
<p><img src="http://mkseo.pe.kr/stats/wp-content/uploads/2012/03/importance_rf.png" alt="" title="importance_rf" width="605" height="524" class="alignnone size-full wp-image-514" /></p>
<p>To get the top three important variables:</p>
<pre class="brush: plain; title: ; notranslate">
&gt; mtcars.imp &lt;- importance(mtcars.rf, type=1)
&gt; mtcars.imp[order(mtcars.imp, decreasing=TRUE),]
     disp        wt        hp       cyl      carb      drat      qsec        vs
18.868236 18.595598 17.031602 16.050788  7.823431  7.728328  5.607246  5.124934
     gear        am
 4.482608  3.938463
&gt; names(mtcars.imp[order(mtcars.imp, decreasing=TRUE),])[1:3]
[1] &quot;disp&quot; &quot;wt&quot;   &quot;hp&quot;
</pre>
<p>Thus we get disp, wt, and hp.</p>
]]></content:encoded>
			<wfw:commentRss>http://mkseo.pe.kr/stats/?feed=rss2&#038;p=513</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ROC graph 101</title>
		<link>http://mkseo.pe.kr/stats/?p=505</link>
		<comments>http://mkseo.pe.kr/stats/?p=505#comments</comments>
		<pubDate>Wed, 14 Mar 2012 15:36:38 +0000</pubDate>
		<dc:creator>Minkoo Seo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mkseo.pe.kr/stats/?p=505</guid>
		<description><![CDATA[Tom Fawcet, ROC Graphs: Notes and Practical Considerations for Data Mining Researcher, HP Labs Technical Reports, 2003. This is a paper on the ROC graph, and I really enjoyed reading it. Though many &#8216;introduction to machine learning&#8217; books describe ROC curve, none of them could explain it in this much depth. Starting from algorithms to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf"><br />
Tom Fawcet, ROC Graphs: Notes and Practical Considerations for Data Mining Researcher, HP Labs Technical Reports, 2003.</a></p>
<p>This is a paper on the ROC graph, and I really enjoyed reading it. Though many &#8216;introduction to machine learning&#8217; books describe ROC curve, none of them could explain it in this much depth.</p>
<p>Starting from algorithms to draw the graph correctly and efficiently, it explains that ROC curve is class skew invariant unlike precision-recall graph, and it explains how to use cross validation to draw a vertically averaged graph(so that we can find confidence interval for each false positive rate) and to draw an averaged curve by threshold(which may not be attractive if we&#8217;re averaging different models and if scores are not probabilities).</p>
<p>The paper goes even further to explain cost sensitive ROC curve and multi-class ROC graph(and AUC of it). Finally, it describes interpolation of classifiers to get a classifier somewhere in the middle of two points in the ROC graph(we can do this by random sampling classifier output) and it describes conditional classifier for removing concavities in ROC graph. Chained classifier was also discussed (by mentioning that it&#8217;s violating the assumption that each model in ROC graph is supposed to be independent).</p>
<p>I recommend this to everyone who didn&#8217;t study ROC graph in details.</p>
]]></content:encoded>
			<wfw:commentRss>http://mkseo.pe.kr/stats/?feed=rss2&#038;p=505</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tweaking bayes theorem</title>
		<link>http://mkseo.pe.kr/stats/?p=485</link>
		<comments>http://mkseo.pe.kr/stats/?p=485#comments</comments>
		<pubDate>Mon, 12 Mar 2012 16:22:43 +0000</pubDate>
		<dc:creator>Minkoo Seo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mkseo.pe.kr/stats/?p=485</guid>
		<description><![CDATA[Tweaking Bayes’ Theorem This is my own trial to explain the tweak mentioned in the above link. In the video, what we want is to find the best english text for the given foreign text, and it can be written as: For the purpose of finding english text, ignore Pr(f), i.e.,: What&#8217;s pointed out as [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://java.dzone.com/articles/tweaking-bayes%E2%80%99-theorem">Tweaking Bayes’ Theorem</a></p>
<p>This is my own trial to explain the tweak mentioned in the above link.</p>
<p>In the video, what we want is to find the best english text for the given foreign text, and it can  be written as:<br />
<img src='http://s.wordpress.com/latex.php?latex=%20%20Pr%28e%7Cf%29%20%3D%20Pr%28e%29Pr%28f%7Ce%29%2FPr%28f%29.%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  Pr(e|f) = Pr(e)Pr(f|e)/Pr(f).  ' title='  Pr(e|f) = Pr(e)Pr(f|e)/Pr(f).  ' class='latex' /></p>
<p>For the purpose of finding english text, ignore Pr(f), i.e.,:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20argmax_e%20Pr%28e%7Cf%29%20%5C%5C%20%20%3D%20argmax_e%20Pr%28e%29Pr%28f%7Ce%29%2FPr%28f%29%20%5C%5C%20%20%3D%20argmax_e%20Pr%28e%29Pr%28f%7Ce%29%20%5C%5C%20%20%5Csimeq%20argmax_e%20p%28e%29p%28f%7Ce%29%20%5C%5C%20%20%5Csimeq%20argmax_e%20p%28e%29%5E%7B1.5%7Dp%28f%7Ce%29%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  argmax_e Pr(e|f) \\  = argmax_e Pr(e)Pr(f|e)/Pr(f) \\  = argmax_e Pr(e)Pr(f|e) \\  \simeq argmax_e p(e)p(f|e) \\  \simeq argmax_e p(e)^{1.5}p(f|e)  ' title='  argmax_e Pr(e|f) \\  = argmax_e Pr(e)Pr(f|e)/Pr(f) \\  = argmax_e Pr(e)Pr(f|e) \\  \simeq argmax_e p(e)p(f|e) \\  \simeq argmax_e p(e)^{1.5}p(f|e)  ' class='latex' />
<p>What&#8217;s pointed out as interesting in the linked document is 1.5. </p>
<p>Here&#8217;s my explanation.</p>
<p>As it&#8217;s probability^1.5, it makes the probability lower, but not higher, i.e., x^1.5 < x if 0 <= x <=1.</p>
<p>I think this might be a tweak due to data scarcity. For p(e), there&#8217;s tons of data to build a model. On the other hand, p(f|e) requires for you to get parallel corpus (texts that&#8217;s written  in both of English and foreign language) which is inherently scarce.</p>
<p>As a result, p(e)^1.5 * p(e|f) lowers p(e) as it&#8217;s supposed to be too high compared to p(f|e).</p>
]]></content:encoded>
			<wfw:commentRss>http://mkseo.pe.kr/stats/?feed=rss2&#038;p=485</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>T-test for comparing means</title>
		<link>http://mkseo.pe.kr/stats/?p=396</link>
		<comments>http://mkseo.pe.kr/stats/?p=396#comments</comments>
		<pubDate>Fri, 09 Mar 2012 06:42:24 +0000</pubDate>
		<dc:creator>Minkoo Seo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mkseo.pe.kr/stats/?p=396</guid>
		<description><![CDATA[Before we get started, open up Gaussian distribution and Chi, t, F distributions if you need some reference on the math. One sample t-test If we don&#8217;t know variance of population (that is usually the case), for : where s is standard deviation of samples and n is the number of samples. As an example, [...]]]></description>
			<content:encoded><![CDATA[<p>Before we get started, open up <a href="http://mkseo.pe.kr/archives/gaussian_distribution_and_chi_t_F.pdf">Gaussian  distribution  and  Chi,  t,  F  distributions</a> if you need some reference on the math.</p>
<h3>One sample t-test</h3>
<p>If we don&#8217;t know variance of population (that is usually the case), for <img src='http://s.wordpress.com/latex.php?latex=X%20%5Csim%20N%28%5Cmu%2C%20%5Csigma%5E2%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X \sim N(\mu, \sigma^2)' title='X \sim N(\mu, \sigma^2)' class='latex' />:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20%5Cdfrac%7B%5Coverline%7BX%7D-%5Cmu%7D%7Bs%2F%5Csqrt%7Bn%7D%7D%20%5Csim%20t%28n-1%29%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \dfrac{\overline{X}-\mu}{s/\sqrt{n}} \sim t(n-1)  ' title='  \dfrac{\overline{X}-\mu}{s/\sqrt{n}} \sim t(n-1)  ' class='latex' />
<p>where s is standard deviation of samples and n is the number of samples.</p>
<p>As an example, to test if the mean of &#8220;1, 3, 2, 7, 8, 9, 3, 4, 5&#8243; is 5, we should test if they&#8217;re normally distributed:</p>
<pre class="brush: plain; title: ; notranslate">
&gt; x = c(1, 3, 2, 7, 8, 9, 3, 4, 5)
&gt; shapiro.test(x)

	Shapiro-Wilk normality test

data:  x
W = 0.9409, p-value = 0.5917
</pre>
<p>As p-value > 0.05, we can not reject H0, i.e., it&#8217;s following normal distribution. See <a href="http://mkseo.pe.kr/stats/?p=244">Testing Normality</a> for additional way of testing normality.</p>
<p>Now, to apply t-test:</p>
<pre class="brush: plain; title: ; notranslate">
&gt; t.test(x, mu=5)

	One Sample t-test

data:  x
t = -0.3592, df = 8, p-value = 0.7287
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 2.526785 6.806548
sample estimates:
mean of x
 4.666667
</pre>
<p>As p-value is 0.7287 > 0.05, H0 is not rejected, meaning that the true mean is 5.</p>
<p>Or to see if the mean of x is larger than 5:</p>
<pre class="brush: plain; title: ; notranslate">
&gt; t.test(x, mu=5, alternative=&quot;greater&quot;)

	One Sample t-test

data:  x
t = -0.3592, df = 8, p-value = 0.6356
alternative hypothesis: true mean is greater than 5
95 percent confidence interval:
 2.941079      Inf
sample estimates:
mean of x
 4.666667
</pre>
<p>In this case, true mean is NOT greater than 5 as p-value > 0.05.</p>
<h3>Independent two sample t-test</h3>
<p>Here, we want to know if the mean of <img src='http://s.wordpress.com/latex.php?latex=X_1%2C%20X_2%2C%20%5Ccdots%2C%20X_%7Bn_1%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X_1, X_2, \cdots, X_{n_1}' title='X_1, X_2, \cdots, X_{n_1}' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=Y_1%2C%20Y_2%2C%20%5Ccdots%2C%20Y_%7Bn_2%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y_1, Y_2, \cdots, Y_{n_2}' title='Y_1, Y_2, \cdots, Y_{n_2}' class='latex' /> are the same when <img src='http://s.wordpress.com/latex.php?latex=X_%7Bn_1%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X_{n_1}' title='X_{n_1}' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=Y_%7Bn_2%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y_{n_2}' title='Y_{n_2}' class='latex' /> are independent and <img src='http://s.wordpress.com/latex.php?latex=X%20%5Csim%20N%28%5Cmu_1%2C%20%5Csigma_1%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X \sim N(\mu_1, \sigma_1)' title='X \sim N(\mu_1, \sigma_1)' class='latex' />, <img src='http://s.wordpress.com/latex.php?latex=Y%20%5Csim%20N%28%5Cmu_2%2C%20%5Csigma_2%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y \sim N(\mu_2, \sigma_2)' title='Y \sim N(\mu_2, \sigma_2)' class='latex' />.</p>
<p><b>1) If we know <img src='http://s.wordpress.com/latex.php?latex=%5Csigma_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sigma_1' title='\sigma_1' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Csigma_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sigma_2' title='\sigma_2' class='latex' />.</b><br />
The test statistics is</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20%5Cdfrac%7B%5Coverline%7BX%7D-%5Coverline%7BY%7D%7D%7B%5Csqrt%7B%5Cfrac%7B%5Csigma_1%5E2%7D%7Bn_1%7D%2B%5Cfrac%7B%5Csigma_2%5E2%7D%7Bn_2%7D%7D%7D%20%5Csim%20N%280%2C1%29%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \dfrac{\overline{X}-\overline{Y}}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \sim N(0,1)  ' title='  \dfrac{\overline{X}-\overline{Y}}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \sim N(0,1)  ' class='latex' />
<p>However, we usually don&#8217;t know <img src='http://s.wordpress.com/latex.php?latex=%5Csigma_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sigma_1' title='\sigma_1' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Csigma_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sigma_2' title='\sigma_2' class='latex' />. </p>
<p><b>2) We don&#8217;t know <img src='http://s.wordpress.com/latex.php?latex=%5Csigma_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sigma_1' title='\sigma_1' class='latex' />, <img src='http://s.wordpress.com/latex.php?latex=%5Csigma_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sigma_2' title='\sigma_2' class='latex' />, but <img src='http://s.wordpress.com/latex.php?latex=n_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n_1' title='n_1' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=n_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n_2' title='n_2' class='latex' /> are big enough.</b><br />
Then the test statistics is:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20%5Cdfrac%7B%5Coverline%7BX%7D-%5Coverline%7BY%7D%7D%7B%5Csqrt%7B%5Cfrac%7BS_1%5E2%7D%7Bn_1%7D%20%2B%20%5Cfrac%7BS_2%5E2%7D%7Bn_2%7D%7D%7D%20%5Csim%20N%280%2C%201%29%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \dfrac{\overline{X}-\overline{Y}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \sim N(0, 1)  ' title='  \dfrac{\overline{X}-\overline{Y}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \sim N(0, 1)  ' class='latex' />
<p>Usually  30 is magic number to determine if the sample size is big.</p>
<p><b>3) We don&#8217;t know <img src='http://s.wordpress.com/latex.php?latex=%5Csigma_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sigma_1' title='\sigma_1' class='latex' />, <img src='http://s.wordpress.com/latex.php?latex=%5Csigma_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sigma_2' title='\sigma_2' class='latex' />, but <img src='http://s.wordpress.com/latex.php?latex=%5Csigma_1%3D%5Csigma_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sigma_1=\sigma_2' title='\sigma_1=\sigma_2' class='latex' /></b><br />
Test statistics can be written as</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20%5Cdfrac%7B%5Chat%7BX%7D-%5Chat%7BY%7D-%28%5Cmu_1-%5Cmu2%29%7D%7BS_p%5Csqrt%7B%5Cfrac%7B1%7D%7Bn_1%7D%2B%5Cfrac%7B1%7D%7Bn_2%7D%7D%7D%20%5Csim%20t%28n_1%20%2B%20n_2%20-%202%29%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \dfrac{\hat{X}-\hat{Y}-(\mu_1-\mu2)}{S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim t(n_1 + n_2 - 2)  ' title='  \dfrac{\hat{X}-\hat{Y}-(\mu_1-\mu2)}{S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim t(n_1 + n_2 - 2)  ' class='latex' />
<p>where <img src='http://s.wordpress.com/latex.php?latex=S_p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S_p' title='S_p' class='latex' /> is so called pooled sample variance:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20S_p%3D%5Cdfrac%7B%28n_1-1%29S_1%5E2%20%2B%20%28n_2-1%29S_2%5E2%7D%7B%28n_1%20%2B%20n_2%20-%202%29%7D%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  S_p=\dfrac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{(n_1 + n_2 - 2)}  ' title='  S_p=\dfrac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{(n_1 + n_2 - 2)}  ' class='latex' />
<p>(Note: We&#8217;re still assuming that <img src='http://s.wordpress.com/latex.php?latex=X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X' title='X' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' /> follows normal distribution. See <a href="http://en.wikipedia.org/wiki/Student%27s_t-test#Assumptions">assumptions of t-test</a>.)</p>
<p>As an example, let&#8217;s test if the means are the same for &#8220;1, 3, 2, 7, 8, 9, 3, 4, 5&#8243; and &#8220;1, 2, 4, 3, 2, 5, 6, 7, 8, 2, 3, 5&#8243;.</p>
<p>Let&#8217;s test if the variances are the same:</p>
<pre class="brush: plain; title: ; notranslate">

&gt; var.test(x,y)

	F test to compare two variances

data:  x and y
F = 1.5787, num df = 8, denom df = 11, p-value = 0.4734
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.4308902 6.6990915
sample estimates:
ratio of variances
          1.578704
</pre>
<p>As p-value > 0.05, we can not reject that their variances are the same.</p>
<p>If we need to test normality, we want to see if <img src='http://s.wordpress.com/latex.php?latex=%5Coverline%7BX%7D-%5Coverline%7BY%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\overline{X}-\overline{Y}' title='\overline{X}-\overline{Y}' class='latex' /> is normally distributed. In this example, we know that the variance is the same for <img src='http://s.wordpress.com/latex.php?latex=X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X' title='X' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' />. So, when using shapiro.test, we need to think of this t-test as simplified version of anova. Then, <img src='http://s.wordpress.com/latex.php?latex=X_i%20%3D%20%5Cmu_i%20%2B%20%5Cepsilon_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X_i = \mu_i + \epsilon_i' title='X_i = \mu_i + \epsilon_i' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=Y_j%20%3D%20%5Cmu_j%20%2B%20%5Cepsilon_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y_j = \mu_j + \epsilon_j' title='Y_j = \mu_j + \epsilon_j' class='latex' /> where <img src='http://s.wordpress.com/latex.php?latex=%5Cepsilon_%7Bi%7D%20%5Csim%20N%280%2C%20%5Csigma_E%29%2C%7E%5Cepsilon_%7Bj%7D%20%5Csim%20N%280%2C%20%5Csigma_E%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\epsilon_{i} \sim N(0, \sigma_E),~\epsilon_{j} \sim N(0, \sigma_E)' title='\epsilon_{i} \sim N(0, \sigma_E),~\epsilon_{j} \sim N(0, \sigma_E)' class='latex' />. As <img src='http://s.wordpress.com/latex.php?latex=%5Cepsilon_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\epsilon_i' title='\epsilon_i' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Cepsilon_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\epsilon_j' title='\epsilon_j' class='latex' /> is normally distributed with the same mean and variance, put them together and test normality. Suppose that we have data “1, 3, 2, 7, 8, 9, 3, 4, 5″ and “1, 2, 4, 3, 2, 5, 6, 7, 8, 2, 3, 5″. Then, run shapiro.test like the below:</p>
<pre class="brush: plain; title: ; notranslate">
&gt; x = c(1, 3, 2, 7, 8, 9, 3, 4, 5)
&gt; y = c(1, 2, 4, 3, 2, 5, 6, 7, 8, 2, 3, 5)
&gt; shapiro.test(c(x-mean(x), y-mean(y)))

	Shapiro-Wilk normality test

data:  c(x - mean(x), y - mean(y))
W = 0.9426, p-value = 0.2452
</pre>
<p>In this case, H0 holds: it&#8217;s normal.</p>
<p>Another way of doing this is using lm:</p>
<pre class="brush: plain; title: ; notranslate">
&gt; f = data.frame(val=c(x, y), klass=c(rep(&quot;x&quot;, NROW(x)), rep(&quot;y&quot;, NROW(y))))
&gt; f
   val klass
1    1     x
2    3     x
3    2     x
4    7     x
5    8     x
6    9     x
7    3     x
8    4     x
9    5     x
10   1     y
11   2     y
12   4     y
13   3     y
14   2     y
15   5     y
16   6     y
17   7     y
18   8     y
19   2     y
20   3     y
21   5     y
&gt; # As klass is a factor variable, val = alpha * klass + epsilon where alpha is either 0 or 1.
&gt; shapiro.test(resid(lm(val ~ klass, data=f)))

	Shapiro-Wilk normality test

data:  resid(lm(val ~ klass, data = f))
W = 0.9426, p-value = 0.2452
</pre>
<p>As you can see, using lm gives the same result with subtracting mean from x and y separately.</p>
<p>If the variances were different, we would use shapiro.test for each of <img src='http://s.wordpress.com/latex.php?latex=X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X' title='X' class='latex' /> an <img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' /> separately.</p>
<p>Now, t-test:</p>
<pre class="brush: plain; title: ; notranslate">
&gt; t.test(x, y, var.equal=TRUE)

	Two Sample t-test

data:  x and y
t = 0.6119, df = 19, p-value = 0.5479
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.613802  2.947136
sample estimates:
mean of x mean of y
 4.666667  4.000000
</pre>
<p>It&#8217;s confidence interval includes zero. Thus p-value > 0.05, meaning that we can not reject H0 that their means are the same.</p>
<p><b>3) If we know that <img src='http://s.wordpress.com/latex.php?latex=%5Csigma_1%20%5Cneq%20%5Csigma_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sigma_1 \neq \sigma_2' title='\sigma_1 \neq \sigma_2' class='latex' /></b></p>
<p>We&#8217;re still assuming that <img src='http://s.wordpress.com/latex.php?latex=X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X' title='X' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' /> follow normal distribution and they&#8217;re independent. As their variances are not the same, we just use the fact that <img src='http://s.wordpress.com/latex.php?latex=X%20-%20Y%20%3D%20N%28%5Cmu_1%20-%20%5Cmu_2%2C%20%5Csigma_1%20%2B%20%5Csigma_2%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X - Y = N(\mu_1 - \mu_2, \sigma_1 + \sigma_2)' title='X - Y = N(\mu_1 - \mu_2, \sigma_1 + \sigma_2)' class='latex' />.</p>
<p>Because we do not know their variances, use sample variance:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20%5Cdfrac%7B%5Coverline%7BX%7D-%5Coverline%7BY%7D%7D%7B%5Csqrt%7B%5Cfrac%7BS_1%5E2%7D%7Bn_1%7D%20%2B%20%5Cfrac%7BS_2%5E2%7D%7Bn_2%7D%7D%7D%20%5Csim%20t%28df%29%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \dfrac{\overline{X}-\overline{Y}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \sim t(df)  ' title='  \dfrac{\overline{X}-\overline{Y}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \sim t(df)  ' class='latex' />
<p>Code for R is the same, except that we use t.test(x, y, var.equal=FALSE).</p>
<p>But one should think really hard why he/she want to compare mean in the first place when they have different variances.</p>
<h3>Paired sample t-test</h3>
<p>I think this is the data that any intelligent engineer will try to get from their experiment. Paired samples has data in this form: <img src='http://s.wordpress.com/latex.php?latex=%28X_1%2C%20Y_1%29%2C%7E%20%28X_2%2C%20Y_2%29%2C%7E%20%5Ccdots%2C%7E%20%28X_n%2C%20Y_n%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(X_1, Y_1),~ (X_2, Y_2),~ \cdots,~ (X_n, Y_n)' title='(X_1, Y_1),~ (X_2, Y_2),~ \cdots,~ (X_n, Y_n)' class='latex' />. For example, it could be like data of (old method performance, new method performance) observed from several machines.</p>
<p>If <img src='http://s.wordpress.com/latex.php?latex=X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X' title='X' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' /> are normally distributed, <img src='http://s.wordpress.com/latex.php?latex=D%3DX-Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='D=X-Y' title='D=X-Y' class='latex' /> follows normal distribution. Even when it&#8217;s not the case, <a href="http://en.wikipedia.org/wiki/Central_limit_theorem">Central Limit Theorem</a> states that sample average follows normal distribution. Therefore <img src='http://s.wordpress.com/latex.php?latex=D%20%5Csim%20N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='D \sim N' title='D \sim N' class='latex' />. </p>
<p>As we do not know variance of <img src='http://s.wordpress.com/latex.php?latex=D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='D' title='D' class='latex' />, use sample variance to get:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20%5Cdfrac%7BD%20-%20%5Cmu_D%7D%7BS_D%20%2F%20%5Csqrt%7Bn%7D%7D%20%5Csim%20t%28n-1%29%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \dfrac{D - \mu_D}{S_D / \sqrt{n}} \sim t(n-1)  ' title='  \dfrac{D - \mu_D}{S_D / \sqrt{n}} \sim t(n-1)  ' class='latex' />
<p>In R (I am assuming <img src='http://s.wordpress.com/latex.php?latex=X%20%5Csim%20N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X \sim N' title='X \sim N' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=Y%20%5Csim%20N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y \sim N' title='Y \sim N' class='latex' />. Without it, one should run normality test first as we have small number of data in this example):</p>
<pre class="brush: plain; title: ; notranslate">
&gt; x = c(1, 2, 3, 4, 3, 2)
&gt; y = c(5, 3, 2, 3, 1, 7)
&gt; t.test(x, y, paired=TRUE)

	Paired t-test

data:  x and y
t = -0.8452, df = 5, p-value = 0.4366
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.041553  2.041553
sample estimates:
mean of the differences
                     -1
</pre>
<p>We conclude that we can not reject H0 that true means are the same. In human language, &#8220;their mean are the same&#8221;.</p>
<h3>If all the assumptions do not hold</h3>
<p>All the methods in the above have some kind of assumptions like sample size is large or normal distribution.</p>
<p>If such assumptions look invalid, one could use non-parametric methods like rank sum test. For example, for the paired t-test case in the above:</p>
<pre class="brush: plain; title: ; notranslate">
&gt; x = c(1, 2, 3, 4, 3, 2)
&gt; y = c(5, 3, 2, 3, 1, 7)
&gt; library(BSDA)
&gt; wilcox.test(x, y, paired=TRUE)

	Wilcoxon signed rank test with continuity correction

data:  x and y
V = 8, p-value = 0.6716
alternative hypothesis: true location shift is not equal to 0
</pre>
<p>See <a href="http://mkseo.pe.kr/stats/?p=319">Rank Tests</a> for more examples.</p>
<p>Refernces)<br />
배도선 외, 통계학 이론과 응용, 청문각.<br />
임동훈, R을 이용한 비모수 통계학, 자유아카데미.<br />
김재희, R을 이용한 통계 프로그래밍 기초, 자유아카데미.<br />
안재형, R을 이용한 누구나하는 통계분석, 한나래.</p>
]]></content:encoded>
			<wfw:commentRss>http://mkseo.pe.kr/stats/?feed=rss2&#038;p=396</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

