• k nearest neighbor

    knn, k-nearest neighbor, algorithm is a non parametric classification using neighbor of data to test. The idea is simple, and we can simply use knn() in ‘class’ package. But one thing to remember is that we should normalize data before measuring distance. For example, suppose that we measured variable x in cm and y in…

    Tags:

  • Chi-square test

    Chi-square test can be used for testing independence using the formula at wiki. Given a table: We want to know the values are independent from class, i.e., A, B, C. It’s very trivial in R: Or one can use matrix: Or one can just list of values or its table: Two things to remember: –…

    Tags:

  • Pulse architecture

    http://eng.pulse.me/scaling-to-10m-on-aws/ Pulse (well known RSS reader) system architecture.

    Tags:

  • Scalable Machine Learning

    http://alex.smola.org/teaching/berkeley2012/index.html Scalable machine learning class at berkeley with vids. at youtube and pdf slides. Nice resource for studying big data mining. I really appreciate researchers who make their teaching materials freely available.

    Tags:

  • Local Outlier Factor for finding outlier.

    Local Outlier Factor(LOF) is a state-of-the-art for finding outlier (according to the book Data Mining with R). Its main idea is to finding objects whose local density is considerably lower than local density of its neighbor. DMwR has an implementation lofactor() to compute LOF.

    Tags:

  • Naive Bayes in R

    Package e1071 provides with naiveBayes function. It assumes independence of predictors, and assumes Gaussian distribution for metric predictors. Its example includes iris sample: Our target variable is Species: To see its performance (remember that 5th column is Species): For prediction, use predict. When type=”raw” is given, probability is printed:

    Tags:

  • Cookie syncing

    http://www.adopsinsider.com/ad-exchanges/ssp-to-dsp-cookie-synching-explained/ A way to share cookies between two sites. Simply pass my cookie as a parameter of request to another site. Then the callee can pair that passed cookie with its own cookie.

    Tags:

  • R data mining resource

    rdatamining.com has a nice documents: http://www.rdatamining.com/docs. Esp., don’t miss R reference cards containing the list of important data mining functions for r.

    Tags:

  • SMOTE for handling class imbalance

    Package DMwR which is heavily described by the book Data Mining with R: Learning with Case Studies has several interesting libraries. Among them, SMOTE is easy to use function to handle class imbalance. To quote the example of the package, first, generate small sample example: Then, we generate new data set by 1) adding new…

    Tags:

  • Cumulative gains and lift chart

    http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html http://www.nd.edu/~busiforc/handouts/DataMining/Lift%20Charts.html Cumulative gains and lift charts compares the predictive model with baseline. It emphasizes recall than precision.

    Tags: