Passion is like genius; a miracle. – Page 28 – Blog on Software, Statistics, and Quant

A paper on imbalanced training sets

Miroslav Kubat and Stan Matwin, Addressing the Curse of Imbalanced Training Sets: One-Sided Selection, In Proceedings of the Fourteenth International Conference on Machine Learning, 1997 For a classification task where output is either T or F, if training set contains too many of T while the number of F is small, then the classifier performs…

February 15, 2012

Tags:

statistics
Testing Normality

In this post, I’ll demonstrate one sample test for checking if the given sample are from normal distribution with mean=0, stddev=1. Most representative test is Shapiro Wilk. As p > 0.05, we can not reject H0 (normal distribution). Another test is Kolmogorov-Smirnov test which is popular non-parametric test (this implies that K-S test works for…

February 14, 2012

Tags:

statistics
STL for extra large datasets

http://stxxl.sourceforge.net/ 요즘은 이런 툴도 나오네요. 결국은 기존 프로그램이 프로그래머의 노력없이 자동으로 병렬화되는 것이 최종 목표겠죠.

February 12, 2012

Tags:

software
Multivariate Adaptive Regression Spline

Multivariate Adaptive Regression Spline is an extension of linear regression to handle non-linearity as explained in the wikipedia. In R, Certainly, this can not be described by a simple linear model: We use earth: As one can see, we have a couple of hinges: More at Notes on earth package. There’s other package called mars…

February 12, 2012

Tags:

statistics
Support Vector Regression

SVM can be also used for Regression: A Tutorial on Support Vector Regression For libraries in R, Support Vector Machines in R.

February 12, 2012

Tags:

statistics
Random Forest in R

Random Forest is an ensemble classifier of decision trees. Wikipedia has a nice explanation on the learning algorithm. The key idea is to use bagging which uses m models where each of which is fit from n’ samples picked from data of size n. Plus, when choosing criteria for decision tree node splits, one feature…

February 8, 2012

Tags:

statistics
Why do we use 0.05 for statistical significance?

http://www.jerrydallal.com/LHSP/p05.htm Maybe I can summarize it like this. First of all, Fisher started using 0.05 when deciding statistically meaningful. p=0.05 is 2 * standard deviation and 1 over 20. Thus it looks good enough. Also, once it’s accepted, it was not welcomed to using p=0.06 or p=0.07. In addition, using p=0.05 is easy for scientific…

February 7, 2012

Tags:

statistics
unclass and as.character

http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/iris_plots/ Look for unclass() in the middle of page used for drawing 3 dimensional data in 2d plot with pch=unclass(…) Or one can use as.character().

January 21, 2012

Tags:

statistics
HMC Calculus Tutorial

http://www.math.hmc.edu/calculus/tutorials/ Contains calculus and linear algebra tutorials.

January 21, 2012

Tags:

statistics
Expectation Maximization

http://see.stanford.edu/see/materials/aimlcs229/handouts.aspx See ‘Mixtures of Gaussians and the EM algorithm’ and ‘The EM algorithm’. This is the easiest to understand explanation on this topic I’ve ever seen on the internet. For books, pattern classification by Duda has a good chapter on it.

January 19, 2012

Tags:

statistics