Passion is like genius; a miracle.

Tag: statistics

Run test for testing randomness

One sample runs test examines if observations are random. Here, run menas the number of consecutive observations of one class from two categories. For example, MMMFFF: Observed three males then, three females. # of run = 2. MFMF: Observed, man, wonman, man, then woman. # of run = 4. This idea can be used for,…

February 26, 2012
Kruskal Wallis Test

Kruskal Wallis test is non parametric version of one way variance of anova. It’s also the extension of Wilcox’s rank sum test for more than 2 populations. Basic idea is very similar to rank sum test. We just sort all observation, and see if mean rank sum is different depending on classes. In the example…

February 25, 2012
Rank tests

This post discusses non parametric testing methods: sign test, signed rank test and rank sum test. Sign Test Sign test tests if the median of data is md. For example, given: If , then the number of positive signs of x – 5 should be about 4 (= the number of data / 2 =…

February 25, 2012
k nearest neighbor

knn, k-nearest neighbor, algorithm is a non parametric classification using neighbor of data to test. The idea is simple, and we can simply use knn() in ‘class’ package. But one thing to remember is that we should normalize data before measuring distance. For example, suppose that we measured variable x in cm and y in…

February 24, 2012
Chi-square test

Chi-square test can be used for testing independence using the formula at wiki. Given a table: We want to know the values are independent from class, i.e., A, B, C. It’s very trivial in R: Or one can use matrix: Or one can just list of values or its table: Two things to remember: –…

February 24, 2012
Scalable Machine Learning

http://alex.smola.org/teaching/berkeley2012/index.html Scalable machine learning class at berkeley with vids. at youtube and pdf slides. Nice resource for studying big data mining. I really appreciate researchers who make their teaching materials freely available.

February 22, 2012
Local Outlier Factor for finding outlier.

Local Outlier Factor(LOF) is a state-of-the-art for finding outlier (according to the book Data Mining with R). Its main idea is to finding objects whose local density is considerably lower than local density of its neighbor. DMwR has an implementation lofactor() to compute LOF.

February 21, 2012
Naive Bayes in R

Package e1071 provides with naiveBayes function. It assumes independence of predictors, and assumes Gaussian distribution for metric predictors. Its example includes iris sample: Our target variable is Species: To see its performance (remember that 5th column is Species): For prediction, use predict. When type=”raw” is given, probability is printed:

February 19, 2012
R data mining resource

rdatamining.com has a nice documents: http://www.rdatamining.com/docs. Esp., don’t miss R reference cards containing the list of important data mining functions for r.

February 17, 2012
SMOTE for handling class imbalance

Package DMwR which is heavily described by the book Data Mining with R: Learning with Case Studies has several interesting libraries. Among them, SMOTE is easy to use function to handle class imbalance. To quote the example of the package, first, generate small sample example: Then, we generate new data set by 1) adding new…

February 17, 2012