-
Automatic machine learning
Google has an interesting automatic prediction API: https://developers.google.com/prediction/ It has an easy to follow hello world which predicts the language(ENGLISH/SPANISH/FRENCH) of the given sentence: https://developers.google.com/prediction/docs/hello_world In the hello world example, one thing that was confusing was ‘Switching to private mode’. For that, you just need to turn on OAuth 2.0 on the top right of…
Tags:
-
Run test for testing randomness
One sample runs test examines if observations are random. Here, run menas the number of consecutive observations of one class from two categories. For example, MMMFFF: Observed three males then, three females. # of run = 2. MFMF: Observed, man, wonman, man, then woman. # of run = 4. This idea can be used for,…
Tags:
-
Kruskal Wallis Test
Kruskal Wallis test is non parametric version of one way variance of anova. It’s also the extension of Wilcox’s rank sum test for more than 2 populations. Basic idea is very similar to rank sum test. We just sort all observation, and see if mean rank sum is different depending on classes. In the example…
Tags:
-
Rank tests
This post discusses non parametric testing methods: sign test, signed rank test and rank sum test. Sign Test Sign test tests if the median of data is md. For example, given: If , then the number of positive signs of x – 5 should be about 4 (= the number of data / 2 =…
Tags:
-
k nearest neighbor
knn, k-nearest neighbor, algorithm is a non parametric classification using neighbor of data to test. The idea is simple, and we can simply use knn() in ‘class’ package. But one thing to remember is that we should normalize data before measuring distance. For example, suppose that we measured variable x in cm and y in…
Tags:
-
Chi-square test
Chi-square test can be used for testing independence using the formula at wiki. Given a table: We want to know the values are independent from class, i.e., A, B, C. It’s very trivial in R: Or one can use matrix: Or one can just list of values or its table: Two things to remember: –…
Tags:
-
Pulse architecture
http://eng.pulse.me/scaling-to-10m-on-aws/ Pulse (well known RSS reader) system architecture.
Tags:
-
Scalable Machine Learning
http://alex.smola.org/teaching/berkeley2012/index.html Scalable machine learning class at berkeley with vids. at youtube and pdf slides. Nice resource for studying big data mining. I really appreciate researchers who make their teaching materials freely available.
Tags:
-
Local Outlier Factor for finding outlier.
Local Outlier Factor(LOF) is a state-of-the-art for finding outlier (according to the book Data Mining with R). Its main idea is to finding objects whose local density is considerably lower than local density of its neighbor. DMwR has an implementation lofactor() to compute LOF.
Tags:
-
Naive Bayes in R
Package e1071 provides with naiveBayes function. It assumes independence of predictors, and assumes Gaussian distribution for metric predictors. Its example includes iris sample: Our target variable is Species: To see its performance (remember that 5th column is Species): For prediction, use predict. When type=”raw” is given, probability is printed:
Tags: