LDA(Latent Dirichlet Allocation) Example in R

LDA is a generative model for finding topics in documents. I recommend probabilistic topic models article at Communications of ACM if you don’t know what LDA is.

I found two packages in R: lda and topicmodels. Among them, I’ve chosen topicmodels as it has Vignettes

Below is a simple example for using LDA() to model two topics from documents containing crude(Oil related news) and acq(M&A related news).

# If you get gsl error when installing topicmodes, 
# run 'sudo apt-get install libgsl0-dev' in ubuntu.

crude_acq <- c(DocumentTermMatrix(crude), DocumentTermMatrix(acq))
crude_doc_ids <- unlist(meta(crude, type='local', tag='id'))
acq_doc_ids <- unlist(meta(acq, type='local', tag='id'))

# Two topics, 1000 random starts with Gibbs sampling.
num_random_start <- 1000
m <- LDA(crude_acq, k=2, method="Gibbs", 
evaluate <- function(reference, prediction) {
  return(list(precision=NROW(intersect(reference, prediction)) / NROW(prediction),
              recall=NROW(intersect(reference, prediction)) / NROW(reference)))

# There is no way to figure out which topic corresponds to which data. Guess
# based on mean precision.
mean_precision <- foreach(i=1:2, .combine=c) %do% {
  maybe_crude <- names(which(topics(m) == i))
  maybe_acq <- names(which(topics(m) == 2 - i + 1))
    evaluate(crude_doc_ids, maybe_crude)$precision,
    evaluate(acq_doc_ids, maybe_acq)$precision))

lda_crude_doc_ids <- names(which(topics(m) == which.max(mean_precision)))
lda_acq_doc_ids <- names(which(topics(m) == 2 - which.max(mean_precision) + 1))

evaluate(crude_doc_ids, lda_crude_doc_ids)
evaluate(acq_doc_ids, lda_acq_doc_ids)


> evaluate(crude_doc_ids, lda_crude_doc_ids)
[1] 0.7272727

[1] 0.8

> evaluate(acq_doc_ids, lda_acq_doc_ids)
[1] 0.9166667

[1] 0.88

