LDA is a generative model for finding topics in documents. I recommend probabilistic topic models article at Communications of ACM if you don’t know what LDA is.
I found two packages in R: lda and topicmodels. Among them, I’ve chosen topicmodels as it has Vignettes
Below is a simple example for using LDA() to model two topics from documents containing crude(Oil related news) and acq(M&A related news).
# If you get gsl error when installing topicmodes, # run 'sudo apt-get install libgsl0-dev' in ubuntu. library(topicmodels) library(tm) library(randtoolbox) library(foreach) data(acq) data(crude) crude_acq <- c(DocumentTermMatrix(crude), DocumentTermMatrix(acq)) crude_doc_ids <- unlist(meta(crude, type='local', tag='id')) acq_doc_ids <- unlist(meta(acq, type='local', tag='id')) # Two topics, 1000 random starts with Gibbs sampling. num_random_start <- 1000 m <- LDA(crude_acq, k=2, method="Gibbs", control=list(seed=get.primes(num_random_start), nstart=num_random_start)) evaluate <- function(reference, prediction) { return(list(precision=NROW(intersect(reference, prediction)) / NROW(prediction), recall=NROW(intersect(reference, prediction)) / NROW(reference))) } # There is no way to figure out which topic corresponds to which data. Guess # based on mean precision. mean_precision <- foreach(i=1:2, .combine=c) %do% { maybe_crude <- names(which(topics(m) == i)) maybe_acq <- names(which(topics(m) == 2 - i + 1)) return(mean( evaluate(crude_doc_ids, maybe_crude)$precision, evaluate(acq_doc_ids, maybe_acq)$precision)) } lda_crude_doc_ids <- names(which(topics(m) == which.max(mean_precision))) lda_acq_doc_ids <- names(which(topics(m) == 2 - which.max(mean_precision) + 1)) evaluate(crude_doc_ids, lda_crude_doc_ids) evaluate(acq_doc_ids, lda_acq_doc_ids)
Result:
> evaluate(crude_doc_ids, lda_crude_doc_ids) $precision [1] 0.7272727 $recall [1] 0.8 > evaluate(acq_doc_ids, lda_acq_doc_ids) $precision [1] 0.9166667 $recall [1] 0.88