Tag: statistics

  • Run test for testing randomness

    One sample runs test examines if observations are random. Here, run menas the number of consecutive observations of one class from two categories. For example, MMMFFF: Observed three males then, three females. # of run = 2. MFMF: Observed, man, wonman, man, then woman. # of run = 4. This idea can be used for,…

  • Kruskal Wallis Test

    Kruskal Wallis test is non parametric version of one way variance of anova. It’s also the extension of Wilcox’s rank sum test for more than 2 populations. Basic idea is very similar to rank sum test. We just sort all observation, and see if mean rank sum is different depending on classes. In the example…

  • Rank tests

    This post discusses non parametric testing methods: sign test, signed rank test and rank sum test. Sign Test Sign test tests if the median of data is md. For example, given: If , then the number of positive signs of x – 5 should be about 4 (= the number of data / 2 =…

  • k nearest neighbor

    knn, k-nearest neighbor, algorithm is a non parametric classification using neighbor of data to test. The idea is simple, and we can simply use knn() in ‘class’ package. But one thing to remember is that we should normalize data before measuring distance. For example, suppose that we measured variable x in cm and y in…

  • Chi-square test

    Chi-square test can be used for testing independence using the formula at wiki. Given a table: We want to know the values are independent from class, i.e., A, B, C. It’s very trivial in R: Or one can use matrix: Or one can just list of values or its table: Two things to remember: –…

  • Scalable Machine Learning

    http://alex.smola.org/teaching/berkeley2012/index.html Scalable machine learning class at berkeley with vids. at youtube and pdf slides. Nice resource for studying big data mining. I really appreciate researchers who make their teaching materials freely available.

  • Local Outlier Factor for finding outlier.

    Local Outlier Factor(LOF) is a state-of-the-art for finding outlier (according to the book Data Mining with R). Its main idea is to finding objects whose local density is considerably lower than local density of its neighbor. DMwR has an implementation lofactor() to compute LOF.

  • Naive Bayes in R

    Package e1071 provides with naiveBayes function. It assumes independence of predictors, and assumes Gaussian distribution for metric predictors. Its example includes iris sample: Our target variable is Species: To see its performance (remember that 5th column is Species): For prediction, use predict. When type=”raw” is given, probability is printed:

  • R data mining resource

    rdatamining.com has a nice documents: http://www.rdatamining.com/docs. Esp., don’t miss R reference cards containing the list of important data mining functions for r.

  • SMOTE for handling class imbalance

    Package DMwR which is heavily described by the book Data Mining with R: Learning with Case Studies has several interesting libraries. Among them, SMOTE is easy to use function to handle class imbalance. To quote the example of the package, first, generate small sample example: Then, we generate new data set by 1) adding new…