Passion is like genius; a miracle. – Page 27 – Blog on Software, Statistics, and Quant

Automatic machine learning

Google has an interesting automatic prediction API: https://developers.google.com/prediction/ It has an easy to follow hello world which predicts the language(ENGLISH/SPANISH/FRENCH) of the given sentence: https://developers.google.com/prediction/docs/hello_world In the hello world example, one thing that was confusing was ‘Switching to private mode’. For that, you just need to turn on OAuth 2.0 on the top right of…

February 28, 2012

Tags:

statistics
Run test for testing randomness

One sample runs test examines if observations are random. Here, run menas the number of consecutive observations of one class from two categories. For example, MMMFFF: Observed three males then, three females. # of run = 2. MFMF: Observed, man, wonman, man, then woman. # of run = 4. This idea can be used for,…

February 26, 2012

Tags:

statistics
Kruskal Wallis Test

Kruskal Wallis test is non parametric version of one way variance of anova. It’s also the extension of Wilcox’s rank sum test for more than 2 populations. Basic idea is very similar to rank sum test. We just sort all observation, and see if mean rank sum is different depending on classes. In the example…

February 25, 2012

Tags:

statistics
Rank tests

This post discusses non parametric testing methods: sign test, signed rank test and rank sum test. Sign Test Sign test tests if the median of data is md. For example, given: If , then the number of positive signs of x – 5 should be about 4 (= the number of data / 2 =…

February 25, 2012

Tags:

statistics
k nearest neighbor

knn, k-nearest neighbor, algorithm is a non parametric classification using neighbor of data to test. The idea is simple, and we can simply use knn() in ‘class’ package. But one thing to remember is that we should normalize data before measuring distance. For example, suppose that we measured variable x in cm and y in…

February 24, 2012

Tags:

statistics
Chi-square test

Chi-square test can be used for testing independence using the formula at wiki. Given a table: We want to know the values are independent from class, i.e., A, B, C. It’s very trivial in R: Or one can use matrix: Or one can just list of values or its table: Two things to remember: –…

February 24, 2012

Tags:

statistics
Pulse architecture

http://eng.pulse.me/scaling-to-10m-on-aws/ Pulse (well known RSS reader) system architecture.

February 22, 2012

Tags:

software
Scalable Machine Learning

http://alex.smola.org/teaching/berkeley2012/index.html Scalable machine learning class at berkeley with vids. at youtube and pdf slides. Nice resource for studying big data mining. I really appreciate researchers who make their teaching materials freely available.

February 22, 2012

Tags:

statistics
Local Outlier Factor for finding outlier.

Local Outlier Factor(LOF) is a state-of-the-art for finding outlier (according to the book Data Mining with R). Its main idea is to finding objects whose local density is considerably lower than local density of its neighbor. DMwR has an implementation lofactor() to compute LOF.

February 21, 2012

Tags:

statistics
Naive Bayes in R

Package e1071 provides with naiveBayes function. It assumes independence of predictors, and assumes Gaussian distribution for metric predictors. Its example includes iris sample: Our target variable is Species: To see its performance (remember that 5th column is Species): For prediction, use predict. When type=”raw” is given, probability is printed:

February 19, 2012

Tags:

statistics