• A paper on imbalanced training sets

    Miroslav Kubat and Stan Matwin, Addressing the Curse of Imbalanced Training Sets: One-Sided Selection, In Proceedings of the Fourteenth International Conference on Machine Learning, 1997 For a classification task where output is either T or F, if training set contains too many of T while the number of F is small, then the classifier performs…

    Tags:

  • Testing Normality

    In this post, I’ll demonstrate one sample test for checking if the given sample are from normal distribution with mean=0, stddev=1. Most representative test is Shapiro Wilk. As p > 0.05, we can not reject H0 (normal distribution). Another test is Kolmogorov-Smirnov test which is popular non-parametric test (this implies that K-S test works for…

    Tags:

  • STL for extra large datasets

    http://stxxl.sourceforge.net/ 요즘은 이런 툴도 나오네요. 결국은 기존 프로그램이 프로그래머의 노력없이 자동으로 병렬화되는 것이 최종 목표겠죠.

    Tags:

  • Multivariate Adaptive Regression Spline

    Multivariate Adaptive Regression Spline is an extension of linear regression to handle non-linearity as explained in the wikipedia. In R, Certainly, this can not be described by a simple linear model: We use earth: As one can see, we have a couple of hinges: More at Notes on earth package. There’s other package called mars…

    Tags:

  • Support Vector Regression

    SVM can be also used for Regression: A Tutorial on Support Vector Regression For libraries in R, Support Vector Machines in R.

    Tags:

  • Random Forest in R

    Random Forest is an ensemble classifier of decision trees. Wikipedia has a nice explanation on the learning algorithm. The key idea is to use bagging which uses m models where each of which is fit from n’ samples picked from data of size n. Plus, when choosing criteria for decision tree node splits, one feature…

    Tags:

  • Why do we use 0.05 for statistical significance?

    http://www.jerrydallal.com/LHSP/p05.htm Maybe I can summarize it like this. First of all, Fisher started using 0.05 when deciding statistically meaningful. p=0.05 is 2 * standard deviation and 1 over 20. Thus it looks good enough. Also, once it’s accepted, it was not welcomed to using p=0.06 or p=0.07. In addition, using p=0.05 is easy for scientific…

    Tags:

  • unclass and as.character

    http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/iris_plots/ Look for unclass() in the middle of page used for drawing 3 dimensional data in 2d plot with pch=unclass(…) Or one can use as.character().

    Tags:

  • HMC Calculus Tutorial

    http://www.math.hmc.edu/calculus/tutorials/ Contains calculus and linear algebra tutorials.

    Tags:

  • Expectation Maximization

    http://see.stanford.edu/see/materials/aimlcs229/handouts.aspx See ‘Mixtures of Gaussians and the EM algorithm’ and ‘The EM algorithm’. This is the easiest to understand explanation on this topic I’ve ever seen on the internet. For books, pattern classification by Duda has a good chapter on it.

    Tags: