# Lda Perplexity Evaluation Essay

## A heuristic approach to determine an appropriate number of topics in topic modeling

1Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA

2College of Information Engineering, Xiangtan University, Xiangtan, Hunan Province, China

Corresponding author.

#### Supplement

Proceedings of the 12th Annual MCBIOS Conference

Jonathan D Wren, Whraddah Thakkar, Ramin Jomayouni, Donald J Johann and Mikhail G Dozmorov

#### Conference

12th Annual MCBIOS Conference

13-14 March 2015

Little Rock, AR, USA

## Abstract

### Background

Topic modelling is an active research field in machine learning. While mainly used to build models from unstructured textual data, it offers an effective means of data mining where samples represent documents, and different biological endpoints or omics data represent words. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. Often, time-consuming subjective evaluations are needed to compare models. Currently, research has yielded no easy way to choose the proper number of topics in a model beyond a major iterative approach.

### Methods and results

Based on analysis of variation of statistical perplexity during topic modelling, a heuristic approach is proposed in this study to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed.

### Conclusion

The proposed RPC-based method is demonstrated to choose the best number of topics in three numerical experiments of widely different data types, and for databases of very different sizes. The work required was markedly less arduous than if full systematic sensitivity studies had been carried out with number of topics as a parameter. We understand that additional investigation is needed to substantiate the method's theoretical basis, and to establish its generalizability in terms of dataset characteristics.

Keywords: Rate of perplexity change (RPC), perplexity, topic number, latent Dirichlet allocation (LDA)

## Background

Topic models are Bayesian statistical models where unstructured data, normally a set of textual documents, are structured in accordance with latent themes called topics that have multinomial distributions on words. Given a collection of unstructured text documents, topic modeling assumes that there are a certain number of latent topics in the collection of documents (corpus) and that each document contains multiple topics in different proportions. Researchers have developed several topic models, including Latent Semantic Indexing (LSA) [1], Probabilistic Latent Semantic Analysis (PLSA) [2,3], and Latent Dirichlet Allocation (LDA) [4]. Topic modeling has wide applications in various fields such as text mining [2-5], image retrieval [6], social network analysis [7] and bioinformatics analysis [8-11].

LDA, an unsupervised generative probabilistic method for modeling a corpus, is the most commonly used topic modeling method. LDA assumes that each document can be represented as a probabilistic distribution over latent topics, and that topic distribution in all documents share a common Dirichlet prior. Each latent topic in the LDA model is also represented as a probabilistic distribution over words and the word distributions of topics share a common Dirichlet prior as well. Given a corpus D consisting of M documents, with document d having Nd words (d ∈{1,..., M}), LDA models D according to the following generative process [4]:

(a) Select a multinomial distribution φt for topic t (t ∈{1,..., T}) from a Dirichlet distribution with parameter β.

(b) Select a multinomial distribution θd for document d (d ∈{1,..., M}) from a Dirichlet distribution with parameter α.

(c) For a word wn (n ∈{1,..., Nd }) in document d,

(i) Select a topic zn from θd.

(ii) Select a word wn from φzn.

In above generative process, words in documents are the only observed variables while others are latent variables (φ and θ) and hyper parameters (α and β). In order to infer the latent variables and hyper parameters, the probability of observed data D is computed and maximized as follows:

(1)

Due to the coupling between θ and φ in the integrand in Eq. (1), exact inference in LDA is intractable. Various approximate algorithms such as variational inference [4,6-8] or Markov chain Monte Carlo (MCMC) [5,9,11] are typically used for inference in LDA.

The effectiveness of LDA to segregate document collections into germane themes has been well demonstrated for document collections such as manually curated scientific literature where the "truth" within documents and the number of relevant themes are known a priori [10]; such sets of already structured documents are hereafter called truth sets. Difficulty arises, however, for unstructured document sets where document-wise content and number of relevant themes are not known a priori. That is, the best number of topics to enable the best topic model is unknown, while different numbers of topics will likely result in very different structuring of the corpus. An insufficient number of topics could render an LDA model that is too coarse to identify accurate classifiers. On the other hand, an excessive number of topics could result in a model that is too complex, making interpretation and subjective validation difficult [10]. We have been unable to identify any current efforts to develop a heuristic from which to evaluate an appropriate number of topics for a previously unseen and modelled, unstructured document set. Lacking such a heuristic to choose the number of topics, researchers have no recourse beyond an informed guess or time-consuming trial and error evaluation. For trial and error evaluation, an iterative approach is typical based on presenting different models with different numbers of topics, normally developed using cross-validation on held-out document sets, and selecting the number of topics for which the model is least perplexed by the test sets. Perplexity is a commonly used measurement in information theory to evaluate how well a statistical model describes a dataset, with lower perplexity denoting a better probabilistic model. Formally, for a test set of M documents, the perplexity is defined as [4]. Using the identified appropriate number of topics, LDA is performed on the whole dataset to obtain the topics for the corpus. We refer to this as the perplexity-based method.

Although the perplexity-based method may generate meaningful results in some cases, it is not stable and the results vary with the selected seeds even for the same dataset. In this study, we propose a new approach in which the rate of perplexity change (RPC) is calculated, and the change point of RPC is determined to be the most appropriate number of topics. The proposed approach is designated as RPC-based change point method (RPC is used hereafter). Three different types of datasets were applied to test the approach and the results validated the stability and effectiveness of the proposed method for selecting the best number of topics for LDA algorithms. The novel method was found to be unique, accurate, easy to use, and applicable to various kinds of datasets with different data types, and therefore, improving the accuracy and efficacy of topic model-based text mining and data mining.

## Materials and methods

### Datasets

In this study, three different types of datasets were utilized to test and evaluate the proposed approach. The first dataset is the whole genome sequences of 119 Salmonella enterica strains. The 119 Salmonella strains belong to Salmonella O antigen group B [12], including 75 S. Agona, 14 S. Heidelberg, 1 S. Paratyphi B, 2 S. Saintpaul, 2 S. Schwarzengrund, 1 S. Stanley, 22 S. Typhimurium, 1 S. Typhimurium var.5- and 1 S. 4,[5],12:i.

The second dataset was retrieved from the publicly available SIDER2 database (http://sideeffects.embl.de) [13]. The dataset includes 996 drugs with 4500 side effects filtered by MedDRA (Medical Dictionary for Regulatory Activities: http://www.meddra.org). The original dataset was represented by a 996 × 4500 drug-side effect matrix, where each entry is either 1 or 0, indicating presence or absence in the drug profile. In data preprocessing, each drug was considered as a document and each existing side effect term in a document was considered as a word in the vocabulary. The Anatomical Therapeutic Chemical (ATC) classification system (http://www.who.int/classifications/atcddd/en/) was applied to classify the 996 drugs in SIDER2 dataset according to their target organs or systems and their therapeutic, pharmacological and chemical properties. The ATC terms were utilized to evaluate the proposed method by calculating the k-means cluster purities.

We created the third dataset by retrieving the abstracts of papers published in the IEEE Transactions on Computational Biology and Bioinformatics (TCBB) from the PubMed database. The dataset was comprised of all the abstracts of 885 papers published in TCBB from 2004 to 2013. The dataset was preprocessed by tokenizing, removing stop words and stemming.

### Developing the heuristic approach to determine the appropriate topic number

Models were built using m-fold cross validation. Data were randomly divided into m subsets denoted as S1, S2, ..., Sm. Candidate numbers of topics t1, t2,..., tr were sorted in increasing order. For each number of topics t, an LDA model was built m times on a training set combining m-1 subsets of the entire dataset. The trained LDA model was then utilized to calculate the perplexity on the held-out testing subset. Thus, each subset Si (i∈{1,...,m}) was included in the training set (m-1) times and tested once. The average of perplexities from m testing sets was taken to be perplexity result for each candidate number of topics. Denoting the average perplexities for r candidate number of topics as P1, P2... Pr, the rate of perplexity change (RPC) for topic number ti (1<ir) was calculated as in Eq. (2).

(2)

The LDA algorithm implemented in Mallet [14] was used in our study, where inference in Mallet was based on Gibbs sampling [5].

### Method evaluation

#### Evaluation of method stability

The whole genome sequence dataset of 119 S. enterica strains was used to evaluate the stability of the proposed RPC-based change point method. The dataset was preprocessed and aligned with the multiple sequence alignment (MSA) algorithm MUSCLE [15]. Nucleotide differences among the sequences of 119 strains were taken to be single nucleotide polymorphisms (SNPs). Each resultant SNP and its corresponding coordinate location in the aligned sequence were encoded as a word.

To evaluate stability, the RPC-based method was compared with the perplexity-based method. The testing topic numbers were selected as 5, 10, and then increments of 10 more up to 100. Model building using cross-validation was repeated 50 times. Each time, a different random seed in Gibbs sampling from Mallet's program was used for each approach, and generated an appropriate topic number for each of the two methods. The frequencies of the obtained appropriate topic numbers were counted, and could be viewed as a probabilistic distribution over tested topic numbers after normalization. Then the entropy of the distribution was calculated to evaluate the stability of the two methods [16]. In information theory, entropy is a measurement to evaluate the uncertainty of a source of information. The Shannon entropy [16] was calculated as in Eq. (3).

(3)

where the distribution P is the normalized frequency of the derived appropriate topic numbers obtained in each of the approaches. The smaller the entropy value, the more stable the method.

#### Evaluation of method efficiency

Cluster analysis was conducted on the output of LDA models with various numbers of topics to evaluate the efficiency of the proposed method. For the sequence dataset of 119 Salmonella strains, leave-one-out (119-fold) cross validation was applied to calculate RPC values on the tested topic numbers 5, 10, plus increments of 10 up to 100. Hierarchical clustering algorithm and k-means algorithm with 10 clusters were conducted on the probabilities of obtained topics for all 119 strains. The purities of the resultant clusters were calculated based on the true labels (real serotypes of the strains). The average purities were considered as the final evaluation values for the LDA models with different number of topics. The running time of LDA models with different number of topics was compared to show the efficiencies of the proposed method.

Five-fold cross validation was applied on the SIDER2 dataset. Clustering analysis using the topic probabilities of the different drugs (documents) was conducted to comparatively evaluate the LDA models with different number of topics. Hierarchical clustering algorithm and k-means algorithm with two different settings of k (i.e., number of clusters) were used. Each cluster was labelled as the dominant ATC code among the drugs in the cluster and the ratio of the ATC code was calculated as purity of the cluster. The average purity of the obtained clusters by k-means method was used to evaluate LDA models with different numbers of topics. The running time of LDA models with different number of topics was also compared.

Five-fold cross validation was utilized to select the most appropriate number of topics for the TCBB dataset for the proposed method. The topic numbers 5, 10, and increments of ten up to 100 were used. A visual representation, word cloud, was created based on the distribution over words, and manually interpreted to evaluate the accuracies of the proposed method [17]. The word cloud generator (http://www.jasondavies.com/wordcloud) was used.

## Results

### Development of RPC-based method

The RPC-based heuristic approach to select an appropriate number of topics for an LDA topic model was applied to three distinctly different datasets with very different data types. After data preprocessing as described in Material and Methods, the Salmonella sequence dataset was transformed into a corpus of 119 documents (corresponding to strains), where each document consisted of the same number of words (i.e., the number of SNPs after MSA). The final corpus had a total of 99,960 occurrences (119x840) in 119 documents that contained 2379 various SNPs. The SIDER2 corpus had a total of 117,329 occurrences in 996 documents and contained 4500 various words (i.e., side effects). The TCBB dataset corpus had a total of 84,646 occurrences in 885 documents (abstracts), and contained 5004 various words. RPC values for the LDA models at the candidate numbers of topics were calculated with m-fold cross validation for each of the three preprocessed datasets using Eq. (2). The results are plotted in Figure 1(a)-(c). Based on our method, the number of topics corresponding to the change of slope for the plot of RPC versus number of topics was deemed to be the most appropriate for a given dataset. That is, the first i that satisfied RPC(i) <RPC(i+1) was chosen as the most appropriate number. According the results in Figure ​1, the best number of topics were 20, 50, and 40 for the Salmonella sequence dataset, SIDER2 dataset, and the TCBB dataset, respectively.

Figure 1

RPC values of LDA models with various testing topic numbers in each of three datasets. (a) Salmonella sequence dataset; (b) SIDER2 dataset; (c) TCBB dataset.

### Evaluation of the proposed RPC-based method

Three different datasets were used in this study to evaluate the stability and efficiency of the approach proposed to choose a best number of topics in LDA topic modelling.

#### Comparison of method stabilities

Both the perplexity-based approach (Perplexity) and the proposed RPC-based approach (RPC) were repeated 50 times with different random seeds to the Salmonella sequence dataset. Figure ​2 plots the frequencies of the calculated most appropriate number of topics. The RPC-based method (green bars) chose 20 topics as most appropriate for 80% of the models, and 10, 30 or 40 topics for the remaining 20%. In contrast, the perplexity-based approach (red bars) appropriate ranged widely from 20 to 90 topics also, while 30 was selected as often most frequently, it was less in only 23 of 50 iterations. Additionally, the mean model entropy for the RPC-based method was 1.0, much lower than the 1.853 for perplexity-based models, further confirming RPC-based selection of numbers of topics to be the more stable approach.

Figure 2

Comparison of frequencies of candidate topic numbers obtained by perplexity-based method and RPC-based method.

#### Comparison of method efficiencies

LDA models were built for each of the three datasets for different numbers of topics, and of course including the selected appropriate numbers of topics. Each model's result provided matrices of topics, topic probability distributions across documents, and the word probability distributions across topics. The efficiencies of the proposed method were evaluated by data mining towards the derived LDA matrices from the three datasets.

Both hierarchical clustering and k-means clustering were performed on the Salmonella strains-topics (i.e, document-topic) LDA probability matrix for the Salmonella sequence dataset. The real serotypes of 119 Salmonella strains were used as the true labels to identify the misclassified strains. The resultant hierarchical cluster dendrogram trees for all numbers of topics considered yielded the highest purity when trees were cut at a height of 0.25. The numbers of misclassified strains from each hierarchical cluster and the LDA computing time for different number of topics are shown in Table ​1. The results of k-means (k = 10) showed that LDA models with 20 or 30 topics gave the best clustering accuracy with all 119 strains correctly identified (Table ​2). Since LDA models normally require more running time to converge with an increasing number of topics, 20 was determined as the most appropriate number of topics for the Salmonella sequence dataset based on both accuracy and efficiency. This result accommodates with that obtained by the proposed RPC-based approach.

Table 1

Hierarchical clustering accuracy and running time of Salmonella sequence dataset

Table 2

K-means clustering accuracy and running time of Salmonella sequence dataset

Hierarchical clustering and k-means algorithm with two different settings (k = 20 and 30) were also both utilized to cluster the drug-topic matrix derived from LDA models for the SIDER2 preprocessed dataset across the different numbers of topics. The 996 drugs in SIDER2 dataset were classified into 14 main groups according to the first level term of the ATC. To evaluate the accuracies of the proposed approach, the misclassified drugs from hierarchical clustering analysis and the purities of the k-means clusters were calculated on the basis of the ATC codes and classifications of the drugs as described in Material and Methods. The resultant hierarchical cluster dendrogram trees cut at a height of 0.6 showed that the least number of drugs (205) as misclassified when the number of topics was 50 (Table ​3). Similar results shown in Table ​4 confirms that the highest purities were obtained when the number of topics was 50 or 60. Because of the lower run time, 50 topics were considered as the most efficient.

Table 3

Hierarchical clustering accuracy and running time on SIDER2 dataset

Table 4

K-means clustering accuracy and running time of SIDER2 dataset

The TCBB dataset that was downloaded from PubMed database consists of 885 abstracts from ten years of publications in the journal IEEE Transactions on Computational Biology and Bioinformatics. Since no truth labels were available to classify them in a manner that would enable a cluster to be built and its purity computed, we used the qualitative approach to assess whether the PRC method could choose the best number of topics. Word clouds were used to represent LDA-derived topic-words matrices, and these matrices were, in turn, subjectively interpreted and evaluated to compare models built with different numbers of topics. Human assessment of topic model validity is a common practice, where topic meaning is subjectively interpreted from the topic-word multinomial distribution. Word clouds are just a way to visualize the distribution where word probabilistic weightings correspond to word graphical (font) sizes. The quality of a model is assessed as higher when its topic themes are more salient and distinguishable than those from other models. The RPC-based method selected 40 as the most appropriate number of topics. We therefore compared the model with 40 topics to the models with 20 and 60 topics. Figure ​3 gives word clouds for eight illustrative topics for the model with 40 topics (Suppl. Figure S1 in Additional file 1 ). Each of the eight topic word clouds in Figure ​3 depict unique and distinguishable theme, which correspond to distinct research fields of computational biology and bioinformatics. Results (Suppl. Figure S1 in Additional file 1) are similar for the remaining 32 topics. Consider Topic 8 (T8 in Figure ​3) for a closer check. Clearly, the salient theme is estimation models, with most words recognizable as pertinent to that field of research. We also located a number of documents in TCBB dataset that had their highest probabilistic association with Topic 8 as listed in Table ​5. Most of these papers were, indeed, subjectively judged to be primarily related to estimation models.

Figure 3

Eight example topics obtained by LDA modeling with 40 topics on TCBB dataset.

Table 5

Abstracts with label T8 (Estimation models)

For the model with 20 topics, some topics were found salient and distinct themes, and some were not, at least in comparison to the model with 40 topics. Some topics were missing, for example, estimation models such as Topic 8 in Figure ​3. Other topics seemed to lump what would preferably be better differentiated themes with 40 topics. For example, the word cloud of T4 shown in Figure 4(a) has at least three themes merged: protein interaction, biomedical task system, and the text extracting. Other topics seemed less specific or too broad as shown in Figure 4(b), compared to those from the model with 40 topics,

Figure 4

Two example topics from an LDA model with 20 topics derived from the TCBB dataset.

In the LDA models with 60 topics, a larger number of topics were judged to be less meaningful in terms of being able to discern a unique and salient theme, compared to the model with 40 topics. Figure ​5 gives word cloud representations of four illustrative topics. In each, a few words are displayed with comparable large front size, indicating that these words have comparable high probabilities within the same topic. Consequently, it is hard to distinguish the theme for each topic.

Figure 5

Four example topics derived by LDA modeling with 60 topics on TCBB dataset.

## Discussion

Topic models can often provide highly effective means for text mining and knowledge discovery, especially in the big data era. They are also agnostic as to data type since, for example, biological samples can be considered documents, and gene, protein, biological pathways and many other independent variables can be considered words. There are a myriad of potential applications.

Topic modelling also has drawbacks. They require skill and experience to successfully apply. With all text mining approaches, validation can be difficult, tedious and subjective, where truth is not known a priori. Finally, determining the "best model" is an iterative process to determine the parameter values that yield the best outcome, among which is the number of topics.

Currently, a set reasonable guesses or perplexity minimization is mostly used to select an appropriate number of topics for LDA modelling. Both of these approaches are reasonable, but carry a high burden of time and work to carry out the needed sensitivity (parameter) studies. A systematic sensitivity study is further complicated by the variation in models with random seed sampling during the generative model building process.

Since the objective function in Eq. (1) is a non-convex function, different initial parameters in approximate algorithms, such as Laplace approximation, variational approximation and MCMC, will lead to distinct local maximums. With different random seeds in MCMC or different initial parameters in variational inference approach, the approximate optimizing solutions to LDA may converge to a different local optimal point for the same dataset. As an example, when we applied the perplexity-based method to the Salmonella sequence dataset three times with different random seeds in MCMC, very different minimum perplexity values of 30, 60 and 90 (Figure 6(a) were obtained; bear in mind that the leave-one-out cross validation process for each number of topics is carried out with the random seed held constant. Figure ​6b shows a plot of perplexity versus number of topics for a wide range of topics up to 400. We can observe the types of variation across number of topics in Figure ​6b: (Left section) perplexity decreases steeply as more topics provide a better fit to predict the hold out data; (Middle section) perplexity fluctuates when small variation indicating good fit; and (Right section) perplexity increases due to over fitting of the training set. However, the main concern is that the flattened middle section spans a three-fold range of numbers of topics from 30 to 90, and as shown in Table ​1 more than 30 topics results in a much poorer model in terms of accurate serotyping.

Figure 6

Two drawbacks of a perplexity-based method in selecting topic numbers.

The new heuristic approach developed in this study attempts to overcome these weaknesses on the selection of an appropriate number of topics in LDA modelling by offering a heuristic alternative to a full-blown sensitivity study. Rather than choosing among several numbers of topics over a potentially large range where perplexity fluctuates (middle stage M in Figure ​6b), the quantity defined as the change-point of rate of perplexity change can be chosen as a putative best number of topics from a heuristic analysis.

We conjecture a theoretical justification for use of RPC-based method on the principle of change-point [18]. For a given series of random variables x1, x2,..., xT, the change-point is distinguished as t if a distribution function F1(x) shared by x1, x2,..., xt is different with F2(x) shared by xt, xt+1,..., xT. Applied on the RPC series with increasing candidate topic numbers T1, T2,..., TK, the first number Ti which satisfies RPC(Ti) <RPC(Ti+1) is considered as the most appropriate topic number for the corresponding dataset.

The results confirm that the proposed RPC-based method is stable, accurate and effective for the three numerical experiments presented, each of which constitutes very different data types. In particular, LDA models using numbers of topics from RPC-based selection yielded the matrices for data mining datasets for genomic sequence, drug pharmacology, and textual documents, demonstrating some generalizability across data types. Choosing the best number of topics is an omnipresent concern in topic modelling, as well as other latent variable methodologies. The comparatively simple RPC-based heuristic we propose could simplify topic model development, generally, for many applications, and offer an easier means to use development time for better fine tuning of models.

## Competing interests

The authors declare that they have no competing interests.

## Authors' contributions

WZ (Zhao) performed all the calculations and data analysis, and wrote the first draft of the manuscript. WZ developed the methods, had the original idea, and guided the data analysis and presentation of results. WZ, WZ (Zhao), RP, ZL, YD, and WG participated the dataset construction and result presentation. JC and WZ managed the research and guided the scientific discussing and editing. All authors contributed to data verification, approach evaluation, and assisted with writing the manuscript. All authors read and approved the final manuscript.

## Acknowledgements

The findings and conclusions in this article have not been formally disseminated by the US Food and Drug Administration (FDA) and should not be construed to represent the FDA determination or policy. This work and the publication were funded by FDA. Dr. Weizhong Zhao acknowledges the support of a fellowship from the Oak Ridge Institute for Science and Education, administered through an interagency agreement between the U.S. Department of Energy and the U.S. Food and Drug Administration. Dr. Weizhong Zhao would also like to thank the support of National Natural Science Foundation of China (No. 61105052, 61202398, 61272295). We are grateful to Ms. Beth Juliar and Mr. Mackean Maisha for critical reading of this manuscript. We specially thank Dr. Ke Yu for his valuable discussions and suggestions.

This article has been published as part of BMC Bioinformatics Volume 16 Supplement 13, 2015: Proceedings of the 12th Annual MCBIOS Conference. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S13.

## References

• Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. Indexing by Latent Semantic Analysis. J Am Soc Inform Sci. 1990;41(6):391–407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.[Cross Ref]
• Hofmann T. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning. 2001;42(1-2):177–196.
• Hofmann T. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. 1999. pp. 50–57.
• Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. Journal of Machine Learning Research. 2003;3:993–1022.
• Griffiths TL, Steyvers M. Finding scientific topics. Proc Natl Acad Sci U S A. 2004;101(Suppl 1):5228–5235.[PMC free article][PubMed]
• Blei DM, Jordan MI. Modeling annotated data. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. 2003. pp. 127–134.
• Airoldi EM, Blei DM, Fienberg SE, Xing EP. Mixed Membership Stochastic Blockmodels. J Mach Learn Res. 2008;9:1981–2014.[PMC free article][PubMed]
• Rogers S, Girolami M, Campbell C, Breitling R. The latent process decomposition of cDNA microarray data sets. IEEE/ACM transactions on computational biology and bioinformatics. 2005;2(2):143–156. doi: 10.1109/TCBB.2005.29.[PubMed][Cross Ref]
• Shivashankar S, Srivathsan S, Ravindran B, Tendulkar AV. Multi-view methods for protein structure comparison using latent dirichlet allocation. Bioinformatics. 2011;27(13):i61–i68. doi: 10.1093/bioinformatics/btr249.[PMC free article][PubMed][Cross Ref]
• Zhao W, Zou W, Chen JJ. Topic modeling for cluster analysis of large biological and medical datasets. BMC Bioinformatics. 2014;15(Suppl 11):S11. doi: 10.1186/1471-2105-15-S11-S11.[PMC free article][PubMed][Cross Ref]
• Coelho LP, Peng T, Murphy RF. Quantifying the distribution of probes between subcellular locations using unsupervised pattern unmixing. Bioinformatics. 2010;26(12):i7–i12. doi: 10.1093/bioinformatics/btq220.[PMC free article][PubMed][Cross Ref]
• Grimont PA, Weill FX. Paris, France: WHO Collaborting Centre for Reference and Research on Salmonella. 9 2007. Antigenic formulae of the Salmonella serovars.
• Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol. 2010;6:343.[PMC free article][PubMed]
• McCallun AK. MALLET: A Machine Learning for Language Toolkit. 2002. http://http://mallet.cs.umass.edu/ http://http://mallet.cs.umass.edu/
• Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. doi: 10.1093/nar/gkh340.[PMC free article][PubMed][Cross Ref]
• Shannon CE. A Mathematical Theory of Communication. At&T Tech J. 1948;27(3):379–423.
• Halvey MJ, Keane MT. An Assessment of Tag Presentation Techniques. Proceedings of the 16th international conference on World Wide Web. 2007. pp. 1313–1314.
• Giraitis L, Leipus R, Surgailis D. The change-point problem for dependent observations. J Stat Plan Infer. 1996;53(3):297–310. doi: 10.1016/0378-3758(95)00148-4.[Cross Ref]

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central

This content is from the fall 2016 version of this course. Please go here for the most recent version.

## Objectives

• Define topic modeling
• Explain Latent Dirichlet allocation and how this process works
• Demonstrate how to use LDA to recover topic structure from a known set of topics
• Demonstrate how to use LDA to recover topic structure from an unknown set of topics
• Identify methods for selecting the appropriate parameter for $$k$$

## Topic modeling

Typically when we search for information online, there are two primary methods:

1. Keywords - use a search engine and type in words that relate to whatever it is we want to find
2. Links - use the networked structure of the web to travel from page to page. Linked pages are likely to share similar or related content.

An alternative method would be to search and explore documents via themes. For instance, David Blei proposes searching through the complete history of the New York Times. Broad themes may relate to the individual sections in the paper (foreign policy, national affairs, sports) but there might be specific themes within or across these sections (Chinese foreign policy, the conflict in the Middle East, the U.S.’s relationship with Russia). If the documents are grouped by these themes, we could track the evolution of the NYT’s reporting on these issues over time, or examine how discussion of different themes intersects.

In order to do this, we would need detailed information on the theme of every article. Hand-coding this corpus would be exceedingly time-consuming, not to mention would requiring knowing the thematic structure of the documents before one even begins coding. For the vast majority of corpa, this is not a feasible approach.

Instead, we can use probabilistic topic models, statistical algorithms that analyze words in original text documents to uncover the thematic structure of the both the corpus and individual documents themselves. They do not require any hand coding or labeling of the documents prior to analysis - instead, the algorithms emerge from the analysis of the text.

## Latent Dirichlet allocation

LDA assumes that each document in a corpus contains a mix of topics that are found throughout the entire corpus. The topic structure is hidden - we can only observe the documents and words, not the topics themselves. Because the structure is hidden (also known as latent), this method seeks to infer the topic structure given the known words and documents.

## Food and animals

Suppose you have the following set of sentences:

1. I ate a banana and spinach smoothie for breakfast.
2. I like to eat broccoli and bananas.
3. Chinchillas and kittens are cute.
4. My sister adopted a kitten yesterday.
5. Look at this cute hamster munching on a piece of broccoli.

Latent Dirichlet allocation is a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

• Sentences 1 and 2: 100% Topic A
• Sentences 3 and 4: 100% Topic B
• Sentence 5: 60% Topic A, 40% Topic B

• Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …
• Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …

You could infer that topic A is a topic about food, and topic B is a topic about cute animals. But LDA does not explicitly identify topics in this manner. All it can do is tell you the probability that specific words are associated with the topic.

## An LDA document structure

LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you

• Decide on the number of words $$N$$ the document will have
• Choose a topic mixture for the document (according to a Dirichlet probability distribution over a fixed set of $$K$$ topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
• Generate each word in the document by:
• First picking a topic (according to the distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
• Then using the topic to generate the word itself (according to the topic’s multinomial distribution). For instance, the food topic might output the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

### Food and animals

How could we have generated the sentences in the previous example? When generating a document $$D$$:

• Decide that $$D$$ will be 1/2 about food and 1/2 about cute animals.
• Pick 5 to be the number of words in $$D$$.
• Pick the first word to come from the food topic, which then gives you the word “broccoli”.
• Pick the second word to come from the cute animals topic, which gives you “panda”.
• Pick the third word to come from the cute animals topic, giving you “adorable”.
• Pick the fourth word to come from the food topic, giving you “cherries”.
• Pick the fifth word to come from the food topic, giving you “eating”.

So the document generated under the LDA model will be “broccoli panda adorable cherries eating” (remember that LDA uses a bag-of-words model).

## Learning topic structure through LDA

Now suppose you have a set of documents. You’ve chosen some fixed number of $$K$$ topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. How do you do this? One way (known as collapsed Gibbs sampling) is the following:

• Go through each document, and randomly assign each word in the document to one of the $$K$$ topics
• Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics. But because it’s random, this is not a very accurate structure.
• To improve on them, for each document $$d$$:
• Go through each word $$w$$ in $$d$$
• And for each topic $$t$$, compute two things:
1. The proportion of words in document $$d$$ that are currently assigned to topic $$t$$ - $$p(t | d)$$
2. The proportion of assignments to topic $$t$$ over all documents that come from this word $$w$$ - $$p(w | t)$$
• Reassign $$w$$ a new topic, where you choose topic $$t$$ with probability $$p(t|d) \times p(w|t)$$ - this is the probability that topic $$t$$ generated word $$w$$
• In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
• After repeating the previous step a large number of times (really large number of times, like a minimum of 10,000), you’ll eventually reach a roughly steady state where your assignments are pretty good
• You can use these assignments to estimate two things:
1. The topic mixtures of each document (by counting the proportion of words assigned to each topic within that document)
2. The words associated to each topic (by counting the proportion of words assigned to each topic overall)

## LDA with a known topic structure

LDA can be useful if the topic structure of a set of documents is known a priori. For instance, suppose you have four books:

• Great Expectations by Charles Dickens
• The War of the Worlds by H.G. Wells
• Twenty Thousand Leagues Under the Sea by Jules Verne
• Pride and Prejudice by Jane Austen

A vandal has broken into your home and torn the books into individual chapters, and left them in one large pile. We can use LDA and topic modeling to discover how the chapters relate to distinct topics (i.e. books).

We’ll retrieve these four books using the package:

As pre-processing, we divide these into chapters, use tidytext’s to separate them into words, then remove . We’re treating every chapter as a separate “document”, each with a name like or .

## Latent Dirichlet allocation with the package

Right now this data frame is in a tidy form, with one-term-per-document-per-row. However, the topicmodels package requires a (from the tm package). We can cast a one-token-per-row table into a with ’s :

Now we are ready to use the package to create a four topic LDA model.

• In this case we know there are four topics because there are four books; this is the value of knowing the latent topic structure
• sets the starting point for the random iteration process. If we don’t set a consistent seed, each time we run the script we may estimate slightly different models

Now gives us the option of returning to a tidy analysis, using the and verbs borrowed from the package. In particular, we start with the verb.

Notice that this has turned the model into a one-topic-per-term-per-row format. For each combination the model has beta ($$\beta$$), the probability of that term being generated from that topic.

We could use ’s to find the top 5 terms within each topic:

This model lends itself to a visualization:

• These topics are pretty clearly associated with the four books
• “nemo”, “sea”, and “nautilus” belongs to Twenty Thousand Leagues Under the Sea
• “jane”, “darcy”, and “elizabeth” belongs to Pride and Prejudice
• “pip” and “joe” from Great Expectations
• “martians”, “black”, and “night” from The War of the Worlds
• Also note that does not assign any label to each topic. They are simply topics 1, 2, 3, and 4. We can infer these are associated with each book, but it is merely our inference.

## Per-document classification

Each chapter was a “document” in this analysis. Thus, we may want to know which topics are associated with each document. Can we put the chapters back together in the correct books?

Setting returns a tidied version with one-document-per-topic-per-row. Now that we have these document classifiations, we can see how well our unsupervised learning did at distinguishing the four books. First we re-separate the document name into title and chapter:

Then we examine what fraction of chapters we got right for each:

We notice that almost all of the chapters from Pride and Prejudice, War of the Worlds, and Twenty Thousand Leagues Under the Sea were uniquely identified as a single topic each.

We can determine this by finding the consensus book for each, which we note is correct based on our earlier visualization:

Then we see which chapters were misidentified:

 Great Expectations Great Expectations 57 Great Expectations Pride and Prejudice 1 Great Expectations The War of the Worlds 1 Pride and Prejudice Pride and Prejudice 61 The War of the Worlds The War of the Worlds 27 Twenty Thousand Leagues under the Sea Twenty Thousand Leagues under the Sea 46

We see that only a few chapters from Great Expectations were misclassified.

## By word assignments:

One important step in the topic modeling expectation-maximization algorithm is assigning each word in each document to a topic. The more words in a document are assigned to that topic, generally, the more weight () will go on that document-topic classification.

We may want to take the original document-word pairs and find which words in each document were assigned to which topic. This is the job of the verb.

We can combine this with the consensus book titles to find which words were incorrectly classified.

We can, for example, create a “confusion matrix” using dplyr’s and tidyr’s :

 Great Expectations 49770 3876 1845 77 Pride and Prejudice 1 37229 7 5 The War of the Worlds 0 0 22561 7 Twenty Thousand Leagues under the Sea 0 5 0 39629

We notice that almost all the words for Pride and Prejudice, Twenty Thousand Leagues Under the Sea, and War of the Worlds were correctly assigned, while Great Expectations had a fair amount of misassignment.

What were the most commonly mistaken words?

Notice the word “flopson” here; these wrong words do not necessarily appear in the novels they were misassigned to. Indeed, we can confirm “flopson” appears only in Great Expectations:

The algorithm is stochastic and iterative, and it can accidentally land on a topic that spans multiple books.

## LDA with an unknown topic structure

Frequently when using LDA, you don’t actually know the underlying topic structure of the documents. Generally that is why you are using LDA to analyze the text in the first place. LDA is still useful in these instances, but we have to perform additional tests and analysis to confirm that the topic structure uncovered by LDA is a good structure.

## Associated Press articles

The package includes a document-term matrix of a sample of articles published by the Associated Press in 1992. Let’s load them into R and convert them to a tidy format.

is originally in a document-term matrix, exactly what we need for topic modeling. Why tidy it first? Because the original dtm contains stop words - we want to remove them before modeling the data. Let’s remove the stop words, then cast the data back into a document-term matrix.

## Selecting $$k$$

Remember that for LDA, you need to specify in advance the number of topics in the underlying topic structure.

### $$k=4$$

Let’s estimate an LDA model for the Associated Press articles, setting $$k=4$$.

What do the top terms for each of these topics look like?

Fair enough. The four topics generally look to describe:

1. American-Soviet relations
2. Crime and education
3. American (domestic) government
4. It’s the economy, stupid

### $$k=12$$

What happens if we set $$k=12$$? How do our results change?

Hmm. Well, these topics appear to be more specific, yet not as easily decodeable.

1. Iraq War (I)
2. Bush’s reelection campaign
3. Federal courts
4. Apartheid and South Africa
5. Crime
6. Economy
7. ???
8. Soviet Union
9. Environment
10. Stock market
11. Wildfires?
12. Bush-Congress relations (maybe domestic policy?)

Alas, this is the problem with LDA. Several different values for $$k$$ may be plausible, but by increasing $$k$$ we sacrifice clarity. Is there any statistical measure which will help us determine the optimal number of topics?

## Perplexity

Well, sort of. Some aspects of LDA are driven by gut-thinking (or perhaps truthiness). However we can have some help. Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of $$k$$, you estimate the LDA model. Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents.

includes a function which calculates this value for a given model.

However, the statistic is somewhat meaningless on its own. The benefit of this statistic comes in comparing perplexity across different models with varying $$k$$s. The model with the lowest perplexity is generally considered the “best”.

Let’s estimate a series of LDA models on the Associated Press dataset. Here I make use of and the functions to iteratively generate a series of LDA models for the AP corpus, using a different number of topics in each model.1

It looks like the 100-topic model has the lowest perplexity score. What kind of topics does this generate? Let’s look just at the first 12 topics produced by the model ( has difficulty rendering a graph for 100 separate facets):

We are getting even more specific topics now. The question becomes how would we present these results and use them in an informative way? Not to mention perplexity was still dropping at $$k=100$$ - would $$k=200$$ generate an even lower perplexity score?2

Again, this is where your intuition and domain knowledge as a researcher is important. You can use perplexity as one data point in your decision process, but a lot of the time it helps to simply look at the topics themselves and the highest probability words associated with each one to determine if the structure makes sense. If you have a known topic structure you can compare it to (such as the books example above), this can also be useful.