-
A paper on imbalanced training sets
Miroslav Kubat and Stan Matwin, Addressing the Curse of Imbalanced Training Sets: One-Sided Selection, In Proceedings of the Fourteenth International Conference on Machine Learning, 1997 For a classification task where output is either T or F, if training set contains too many of T while the number of F is small, then the classifier performs…
Tags:
-
Testing Normality
In this post, I’ll demonstrate one sample test for checking if the given sample are from normal distribution with mean=0, stddev=1. Most representative test is Shapiro Wilk. As p > 0.05, we can not reject H0 (normal distribution). Another test is Kolmogorov-Smirnov test which is popular non-parametric test (this implies that K-S test works for…
Tags:
-
STL for extra large datasets
http://stxxl.sourceforge.net/ 요즘은 이런 툴도 나오네요. 결국은 기존 프로그램이 프로그래머의 노력없이 자동으로 병렬화되는 것이 최종 목표겠죠.
Tags:
-
Multivariate Adaptive Regression Spline
Multivariate Adaptive Regression Spline is an extension of linear regression to handle non-linearity as explained in the wikipedia. In R, Certainly, this can not be described by a simple linear model: We use earth: As one can see, we have a couple of hinges: More at Notes on earth package. There’s other package called mars…
Tags:
-
Support Vector Regression
SVM can be also used for Regression: A Tutorial on Support Vector Regression For libraries in R, Support Vector Machines in R.
Tags:
-
Random Forest in R
Random Forest is an ensemble classifier of decision trees. Wikipedia has a nice explanation on the learning algorithm. The key idea is to use bagging which uses m models where each of which is fit from n’ samples picked from data of size n. Plus, when choosing criteria for decision tree node splits, one feature…
Tags:
-
Why do we use 0.05 for statistical significance?
http://www.jerrydallal.com/LHSP/p05.htm Maybe I can summarize it like this. First of all, Fisher started using 0.05 when deciding statistically meaningful. p=0.05 is 2 * standard deviation and 1 over 20. Thus it looks good enough. Also, once it’s accepted, it was not welcomed to using p=0.06 or p=0.07. In addition, using p=0.05 is easy for scientific…
Tags:
-
unclass and as.character
http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/iris_plots/ Look for unclass() in the middle of page used for drawing 3 dimensional data in 2d plot with pch=unclass(…) Or one can use as.character().
Tags:
-
HMC Calculus Tutorial
http://www.math.hmc.edu/calculus/tutorials/ Contains calculus and linear algebra tutorials.
Tags:
-
Expectation Maximization
http://see.stanford.edu/see/materials/aimlcs229/handouts.aspx See ‘Mixtures of Gaussians and the EM algorithm’ and ‘The EM algorithm’. This is the easiest to understand explanation on this topic I’ve ever seen on the internet. For books, pattern classification by Duda has a good chapter on it.
Tags: