Scalable Machine Learning

http://alex.smola.org/teaching/berkeley2012/index.html

Scalable machine learning class at berkeley with vids. at youtube and pdf slides. Nice resource for studying big data mining.

I really appreciate researchers who make their teaching materials freely available.

Similar Posts:

Local Outlier Factor for finding outlier.

Local Outlier Factor(LOF) is a state-of-the-art for finding outlier (according to the book Data Mining with R). Its main idea is to finding objects whose local density is considerably lower than local density of its neighbor. DMwR has an implementation lofactor() to compute LOF.

Similar Posts:

Naive Bayes in R

Package e1071 provides with naiveBayes function. It assumes independence of predictors, and assumes Gaussian distribution for metric predictors.

Its example includes iris sample:

> library(e1071)
> data(iris)
> head(iris)
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Our target variable is Species:


> m <- naiveBayes(Species ~ ., iris)
> m

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
    setosa versicolor  virginica
 0.3333333  0.3333333  0.3333333 

Conditional probabilities:
            Sepal.Length
Y             [,1]      [,2]
  setosa     5.006 0.3524897
  versicolor 5.936 0.5161711
  virginica  6.588 0.6358796

            Sepal.Width
Y             [,1]      [,2]
  setosa     3.428 0.3790644
  versicolor 2.770 0.3137983
  virginica  2.974 0.3224966

            Petal.Length
Y             [,1]      [,2]
  setosa     1.462 0.1736640
  versicolor 4.260 0.4699110
  virginica  5.552 0.5518947

            Petal.Width
Y             [,1]      [,2]
  setosa     0.246 0.1053856
  versicolor 1.326 0.1977527
  virginica  2.026 0.2746501

To see its performance (remember that 5th column is Species):


> table(predict=predict(m, iris[, -5]), true=iris[,5])
            true
predict      setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         47         3
  virginica       0          3        47

For prediction, use predict. When type=”raw” is given, probability is printed:


> predict(m, iris[1:10, -5])
 [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica

> predict(m, iris[1:10, -5], type="raw")
      setosa   versicolor    virginica
 [1,]      1 2.981309e-18 2.152373e-25
 [2,]      1 3.169312e-17 6.938030e-25
 [3,]      1 2.367113e-18 7.240956e-26
 [4,]      1 3.069606e-17 8.690636e-25
 [5,]      1 1.017337e-18 8.885794e-26
 [6,]      1 2.717732e-14 4.344285e-21
 [7,]      1 2.321639e-17 7.988271e-25
 [8,]      1 1.390751e-17 8.166995e-25
 [9,]      1 1.990156e-17 3.606469e-25
[10,]      1 7.378931e-18 3.615492e-25
Similar Posts:

R data mining resource

rdatamining.com has a nice documents: http://www.rdatamining.com/docs.

Esp., don’t miss R reference cards containing the list of important data mining functions for r.

Similar Posts:

SMOTE for handling class imbalance

Package DMwR which is heavily described by the book Data Mining with R: Learning with Case Studies has several interesting libraries.

Among them, SMOTE is easy to use function to handle class imbalance. To quote the example of the package, first, generate small sample example:

> data(iris)
> data <- iris[, c(1, 2, 5)]
> data$Species <- factor(ifelse(data$Species == "setosa", "rare", "common"))
> head(data)
  Sepal.Length Sepal.Width Species
1          5.1         3.5    rare
2          4.9         3.0    rare
3          4.7         3.2    rare
4          4.6         3.1    rare
5          5.0         3.6    rare
6          5.4         3.9    rare
> table(data$Species)

common   rare
   100     50

Then, we generate new data set by 1) adding new examples for minority class based on k-nn and interpolation, and 2) under-sampling majority class examples:

> newData <- SMOTE(Species ~., data, perc.over=600, perc.under=100)
> table(newData$Species)

common   rare
   300    350
Similar Posts:

Cumulative gains and lift chart

http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html
http://www.nd.edu/~busiforc/handouts/DataMining/Lift%20Charts.html

Cumulative gains and lift charts compares the predictive model with baseline. It emphasizes recall than precision.

Similar Posts:

A paper on imbalanced training sets

Miroslav Kubat and Stan Matwin, Addressing the Curse of Imbalanced Training Sets: One-Sided Selection, In Proceedings of the Fourteenth International Conference on Machine Learning, 1997

For a classification task where output is either T or F, if training set contains too many of T while the number of F is small, then the classifier performs poorly for predicting F. It’s because the classifier achieves high precision just returning T for the most of time, meaning that it hardly learns how to classify data as F. This paper addresses a solution called one sided selection and cited 634 times according to Google Scholar.

Similar Posts:

Testing Normality

In this post, I’ll demonstrate one sample test for checking if the given sample are from normal distribution with mean=0, stddev=1.

> x = rnorm(30, 0, 1)

Most representative test is Shapiro Wilk.

> shapiro.test(x)

	Shapiro-Wilk normality test

data:  x
W = 0.9605, p-value = 0.3187

As p > 0.05, we can not reject H0 (normal distribution).

Another test is Kolmogorov-Smirnov test which is popular non-parametric test (this implies that K-S test works for small samples for which, in general, we can not assume a certain distribution) that checks if the given one-sample is from a certain distribution or two samples are from the same distribution. One limitation of Kolmogorov-Smirnov is that we can not estimate parameters of a distribution (mean and stddev, in this example) from the sample for testing purpose. Instead, we need to specify model fully.

Let’s see how K-S test works.

> ks.test(x, "pnorm", mean=0, sd=1)

	One-sample Kolmogorov-Smirnov test

data:  x
D = 0.2201, p-value = 0.09332
alternative hypothesis: two-sided

As explained, we need to specify model parameters. Anderson-Darling Test is one that overcomes this limitation.

> library(nortest)
> ad.test(x)

	Anderson-Darling normality test

data:  x
A = 0.6474, p-value = 0.08246

For more discussions, read:
1) Kirkman, T.W. (1996) Statistics to Use. http://www.physics.csbsju.edu/stats/ (Feb. 2011): Read Kolmogorov-Smirnov test section. It’s a nice document explaining how K-S test’s test statistics is computed.
2) Kolmogorov-Smirnov Goodness-of-Fit Test, Engineering Handbook
3) Vito Ricci, Fitting distributions with R.
4) Juergen Gross, Package nortest.
5) 임동훈, R을 이용한 비모수 통계학, 자유아카데미.

Similar Posts:

Multivariate Adaptive Regression Spline

Multivariate Adaptive Regression Spline is an extension of linear regression to handle non-linearity as explained in the wikipedia.

In R,

> d <- data.frame(x=c(1:50), y=c(1:25*(-2)+100, 26:50*3+7))
> plot(d)

Certainly, this can not be described by a simple linear model:

We use earth:

> library(earth)
> e <- earth(y ~ x, d)
> plotmo(e)

 grid:    x
       25.5

As one can see, we have a couple of hinges:

> summary(e)
Call: earth(formula=y~x, data=d)

            coefficients
(Intercept)    52.594370
h(x-22)         6.208889
h(22-x)         2.237602
h(x-28)        -3.132070

Selected 4 of 4 terms, and 1 of 1 predictors
Importance: x
Number of terms at each degree of interaction: 1 3 (additive model)
GCV 21.50975    RSS 795.4307    GRSq 0.9767953    RSq 0.9821302

More at Notes on earth package. There’s other package called mars which also implements this feature. Also, one may consider library splines.

Similar Posts:

Support Vector Regression

SVM can be also used for Regression:
A Tutorial on Support Vector Regression

For libraries in R,
Support Vector Machines in R.

Similar Posts: