Hidden Topic Markov Models
Hidden Topic Markov Models
Amit Gruber
Michal Rosen-Zvi
Yair Weiss
School of CS and Eng.
IBM Research Laboratory in Haifa
School of CS and Eng.
The Hebrew University
Haifa University, Mount Carmel
The Hebrew University
Jerusalem 91904 Israel
Haifa 31905 Israel
Jerusalem 91904 Israel
amitg@cs.huji.ac.il
rosen@il.ibm.com
yweiss@cs.huji.ac.il
Abstract
represented by a co-occurrence matrix of words and
documents. The probabilistic latent semantic analy-
Algorithms such as Latent Dirichlet Alloca-
sis (PLSA) model [8] is such a model. It employs two
tion (LDA) have achieved significant progress
parameters inferred from the observed words: (a) A
in modeling word document relationships.
global parameter that ties the documents in the cor-
These algorithms assume each word in the
pora, is fixed for the corpora and stands for the prob-
document was generated by a hidden topic
ability of words given topics. (b) A set of parameters
and explicitly model the word distribution of
for each of the documents that stand for the proba-
each topic as well as the prior distribution
bility of topics in a document. The Latent Dirichlet
over topics in the document. Given these pa-
Allocation (LDA) model [3] introduces a more consis-
rameters, the topics of all words in the same
tent probabilistic approach as it ties the parameters
document are assumed to be independent.
of all documents via a hierarchical generative model.
The mixture of topics per document in the LDA model
In this paper, we propose modeling the top-
is generated from a Dirichlet prior mutual to all doc-
ics of words in the document as a Markov
uments in the corpus.
chain. Specifically, we assume that all words
in the same sentence have the same topic, and
These models are not only computationally efficient,
successive sentences are more likely to have
but also seem to capture correlations between words
the same topics. Since the topics are hid-
via the topics. Yet, the assumption that the order of
den, this leads to using the well-known tools
words can be ignored is an unrealistic oversimplifica-
of Hidden Markov Models for learning and
tion. Relaxing this assumption is expected to yield
inference. We show that incorporating this
better models in terms of the latent aspects inferred,
dependency allows us to learn better topics
their performance in word sense disambiguation task
and to disambiguate words that can belong
and the ability of the model to predict previously un-
to different topics. Quantitatively, we show
observed words in trained documents.
that we obtain better perplexity in model-
Markov models such as N-grams and HMMs that cap-
ing documents with only a modest increase
ture local dependencies between words have been em-
in learning and inference complexity.
ployed mainly in part-of-speech tagging [5]. Models
for semantic parsing tasks often use a “shallow” model
with no hidden states [9]. In recent years several prob-
1
Introduction
abilistic models for text that infer topics and incorpo-
rate Markovian relations have been studied. In [7] a
Extensively large text corpora are becoming abundant
model that integrates topics and syntax is introduced.
due to the fast growing storage and processing capa-
It contains a latent variable per each word that stands
bilities of modern computers. This has lead to a resur-
for syntactic classes. The model posits that words are
gence of interest in automated extraction of useful in-
either generated from topics that are randomly drawn
formation from text. Modeling the observed text as
from the topic mixture of the document or from the
generated from latent aspects or topics is a prominent
syntactic classes that are drawn from the previous syn-
approach in machine learning studies of texts (e.g.,
tactic class. Only the latent variables of the syntac-
[8, 14, 3, 4, 6]). In such models, the ”bag-of-words”
tic classes are treated as a sequence with local depen-
assumption is often employed, an assumption that the
dencies while latent assignments of topics are treated
order of words can be ignored and text corpora can be
similar to the LDA model, namely topics extraction
eterized by η.
is not benefited from the additional information con-
In order to make the independence assumptions in
veyed in the structure of words. During the last couple
LDA more explicit, figure 1b shows an alternative
of years, a few models were introduced in which con-
graphical model for the same generative process. Here,
secutive words are modeled by Markovian relations.
we have explicitly denoted the hidden topics z
These are the Bigram topic model [15], the LDA collo-
i and
the observed words w
cation model and the Topical n-grams model [16]. All
i as separate nodes in the graphs
(rather than summarizing them with a plate). From
these models assume that words generation in texts
figure 1b it is evident that conditioned on θ and β, the
depend on a latent topic assignment as well as on the
hidden topics are independent.
n-previous words in the text. This added complexity
seem to provide the models with more predictive power
The Hidden Topic Markov Model drops this indepen-
compared to the LDA model and to capture relations
dence assumption. As shown in figure 1c, the topics
between consecutive words. We follow the same lines,
in a document form a Markov chain with a transition
while we allow Markovian relations between the hid-
probability that depends on θ and a topic transition
den aspects. A somewhat related model is the aspect
variable ψn. When ψn = 1, a new topic is drawn from
HMM model [2], though it models unstructured data
θ. When ψn = 0 the topic of the nth word is identical
that contains stream of words. The model contains
to the previous one. We assume that topic transitions
latent topics that have Markovian relations. In the as-
can only occur between sentences, so that ψn may only
pect HMM model, documents or segments are inferred
be nonzero for the first word in a sentence.
using heuristics that assume that each segment is gen-
Formally the model can be described as:
erated from a unique topic assignment and thus there
is no notion of mixture of topics.
1. for z=1...K,
We strive to extract latent aspects from documents by
Draw βz ∼ Dirichlet(η)
making use of the information conveyed in the division
into documents as well as the particular order of words
2. for d=1...D,
in each document. Following these lines we propose in
Document d is generated as follows:
this paper a novel and consistent probabilistic model
(a) Draw θ ∼ Dirichlet(α)
we call the Hidden Topic Markov Model (HTMM). Our
model is similar to the LDA model in tying together
(b) Set ψ1 = 1
parameters of different documents via a hierarchical
(c) for n=2 . . . Nd
generative model, but unlike the LDA model it does
i. If (begin sentence) draw ψn ∼ Binom(ǫ)
not assume documents are “bags of words”. Rather
else ψn = 0
it assumes that the topics of words in a document
(d) for n=1. . . Nd
form a Markov chain, and that subsequent words are
i. if ψ
more likely to have the same topic. We show that the
n == 0 then zn = zn−1
else z
HTMM model outperforms the LDA model in its pre-
n ∼ multinomial(θ)
dictive performance and can be used for text parsing
ii. Draw wn ∼ multinomial(βz )
n
and word sense disambiguation purposes.
where K is the number of latent topics and Nd is the
length of document d. Note that if we force a topic
2
Hidden Topic Markov Model for
transition between any two words (i.e. set ψn = 1 for
Text Documents
all n) we obtain the LDA model [3]. At the other
extreme, if we do not allow any topic transitions and
set ψ
We start by reviewing the formalism of LDA. Figure 1a
n = 0 between any two words, we obtain the
mixture of unigrams model in which all words in the
shows the graphical model corresponding to the LDA
document are assumed to have the same topic [11].
generative model. To generate a new word w in a doc-
ument, one starts by first sampling a hidden topic z
Unlike the LDA and mixture of unigrams models, the
from a multinomial distribution defined by a vector
HTMM model (which allows infrequent topic transi-
θ corresponding to that document. Given the topic z,
tions within a document) is no longer invariant to a
the distribution over words is multinomial with param-
reshuffling of the words. Documents for which succes-
eters βz. The LDA model ties parameters between dif-
sive words have the same topics are more likely than
ferent documents by drawing θ of all documents from
a random permutation of the same words. The input
a common Dirichlet prior parameterized by α. It also
to the algorithm is the entire document, rather than a
ties parameters between topics by drawing the vector
document-word co-occurrence matrix. This obviously
βz of all topics from a common Dirichlet prior param-
increases the storage requirement for each document,
a
b
c
Figure 1: a. The LDA model. b. The LDA model. The repeated word generation within the document is
explicitly drawn rather than using the plate notation. c. The HTMM model proposed in this paper. The hidden
topics form a Markov chain. The order of words and their proximity plays an important role in the model.
but it allows us to perform inferences that are simply
Pr(zn, ψn|d, w1 . . . wN ; θ, β, ǫ)
is
computed
for
d
impossible in the bag of words models. For example,
each sentence in the document.
We compute this
the topic assignments calculated during HTMM infer-
probability using the forward-backward algorithm
ence (either hard assignments or soft ones) will not
for HMM. The transition matrix is specific per
give the same topic to all appearances of the same
document and depends on the parameters θd and ǫ.
word within the same document. This can be useful
The parameter βz,w induces local probabilities on the
for word sense disambiguation in applications such as
sentences. After computing Pr(zn, ψn|d, w1 . . . wN )
d
automatic translation of texts. Furthermore, the topic
we compute expectations required for the M-step.
assignment will tend to be linearly coherent - subsec-
Specifically, we compute (1) the expected number of
tions of the text will tend to be assigned to the same
topic transitions that ended in the topic z and (2)
topic, as opposed to “bag-of-words” assignments which
The expected number of co-occurrence of a word w
may fluctuate between topics in successive words.
with a topic z.
Let Cd,z denote the number of times that topic z was
3
Approximate Inference
drawn according to θd in document d. Let Cz,w denote
the number of times that word w was drawn from topic
z according to β
Due to the parameter-tying in LDA type models, cal-
z,w .
culating exact posterior probabilities over the parame-
ters β, θ is intractable. In recent years, several alterna-
tives for approximate inference have been suggested:
N
EM [8] or variational EM [3], Expectation propaga-
E(Cd,z)=
Pr(zd,n=z, ψd,n=1|w1 . . . wN )
(1)
d
tion (EP) [10] and Monte-Carlo sampling [13, 7]. In
n=1
D
N
this paper, we take advantage of the fact that con-
E(Cz,w)=
Pr(z
)(2)
ditioned on β and θ the HTMM model is a spe-
d,n=z, wd,n=w|w1 . . . wNd
d=1 n=1
cial type of HMM. This allows us to use the stan-
dard parameter estimation tools of HMMs, namely
Expectation-Maximization and the forward-backward
M-step:
algorithm [12].
The MAP estimators for θ and β are found under
Unlike fully Bayesian inference methods, the standard
the constraint that θd and βz are probability vec-
EM algorithm for HMMs distinguishes between la-
tors. Standard computation using Lagrange multipli-
tent variables and parameters. Applied to the HTMM
ers yields:
model, the latent variables are the topics zn and the
variables ψ
θd,z
∝ E(Cd,z) + α − 1
(3)
n that determine whether the topic n will
be identical to topic n − 1 or will it be drawn accord-
Nd
Pr(ψ
)
ǫ
=
d
n=2
d,n = 1|w1 . . . wNd
(4)
ing to θd. The parameters of the problem are θd , β
(N sen − 1)
d
d
and ǫ. We assume the hyper parameters α and η to
be known.
where θd,z is normalized such that θd is a distribution
and N sen is the number of sentences in the document
E-step:
d
d. After computing θ and ǫ the transition matrix is
In
the
E-step
the
probability
updated accordingly.
Similarly, the probabilities βz,w are set to:
2800
HTMM
VHTMM1
βz,w
∝
E(Cz,w) + η − 1
(5)
2600
LDA
Again, βz,w is normalized such that βz forms a distri-
2400
bution.
EM algorithms have been discussed for word document
Perplexity 2200
models in several recent papers. Hofmann used EM to
estimate the maximum likelihood parameters in the
2000
pLSA model [8]. Here, in contrast, we use EM to esti-
mate the MAP parameters in a hierarchical generative
1800 0
2
4
8
16
32
64 128 256 512
model similar to LDA (note that our M step explicitly
Observed words
takes into account the Dirichlet priors on β, θ). Grif-
fiths et al. [6] also discuss using EM in their HMM
Figure 2: Comparison between the average perplexity
model of topics and syntax and say that Gibbs sam-
of the NIPS corpus of HTMM, LDA and VHTMM1.
pling is preferable since EM can suffer from local min-
HTMM outperforms the bag of words (LDA) and
ima. We should note that in our experiments we found
VHTMM1 models.
the final solution calculated by EM to be stable with
respect to multiple initializations. Again, this may be
sentences are drawn from θ independently. In all the
due to our use of a hierarchical generative model which
experiments the same values of α, η were provided to
reduces somewhat the degrees of freedom available to
all algorithms.
EM.
The perplexity of an unseen test document after ob-
Our EM algorithm assumes the hyper parameters α, η
serving the first N words of the document is defined
are fixed. We set these parameters according to the
to be:
values used in previous papers [13] (α = 1 + 50/K and
log Pr(wN+1 . . . wN
|w
test
1 . . . wN )
η = 1.01).
Perplexity = exp(−
)
Ntest − N
(6)
4
Experiments
where Ntest is the length of the test document. The
perplexity reflects the difficulty of predicting a new
We first demonstrate our results on the NIPS dataset
unseen document after learning from a train set. The
1. The NIPS dataset consists of 1740 documents. The
lower the perplexity, the better the model.
train set consists of 1557 documents and the test set
The perplexity for the HTMM model is computed
consists of the remaining 183. The vocabulary con-
as follows:
First θnew is found from the first N
tains 12113 words. From the raw data we extracted
words of the new document.
θnew is found us-
(only) the words that appear in the vocabulary, pre-
ing EM holding β
=
βtrain and ǫ
=
ǫtrain
serving their order. Stop words do not appear in the
fixed.
Second, Pr(zN+1|w1 . . . wN ) is computed
vocabulary, hence they were discarded from the input.
by inference on the latent variables of the first
We divided the text to sentences according to the de-
N words using the forward backward algorithm
limiters .?!;. This simple preprocessing is very crude
for HMM and then inducing Pr(zN |w1 . . . wN ) on
and noisy since abbreviations such as “Fig. 1” will ter-
Pr(zN+1|w1 . . . wN ) (using the transition matrix).
minate the sentence before the period and will start a
Third, log Pr(wN+1 . . . wN
|w
test
1 . . . wN ) is computed
new one after it. The only exception is that we omitted
using forward backward with θnew, βtrain, ǫtrain and
appearances of “e.g.” and “i.e.” in our preprocessing
Pr(zN+1|w1 . . . wN ) that were previously found.
2.
On this dataset we compared perplexity with HTMM
and LDA with 100 topics. We also considered a vari-
Figure 2 shows the perplexity of HTMM, LDA and
ant of HTMM (which we call VHTMM1) that sets the
VHTMM1. with K = 100 topics as a function of
topic transition probability to be 1 between sentences.
the number of observed words at the beginning of the
This is a “bag of sentence” model (where the order of
test document, N . We see that for a small number of
words within a sentence is ignored): all words in the
words, both HTMM and VHTMM1 have significantly
same sentence share the same topic; topics of different
lower perplexity than LDA. This is due to combining
the information from subsequent words regarding the
1http://www.cs.toronto.edu/∼roweis/data.html
topic of the sentence.
2The
data
we
used
is
available
at:
http://www.cs.huji.ac.il/∼amitg/htmm.html
When the number of observed words increases, HTMM
Abstract We give necessary and sufficient conditions for uniqueness of the support vector solution for the
problems of pattern recognition and regression estimation, for a general class of cost functions. We show that
if the solution is not unique, all support vectors are necessarily at bound, and we give some simple examples
of non-unique solu- tions. We note that uniqueness of the primal (dual) solution does not necessarily imply
uniqueness of the dual (primal) solution. We show how to compute the threshold b when the solution is unique,
but when all support vectors are at bound, in which case the usual method for determining b does not work.
recognition and regression estimation algorithms [12], with arbitrary convex costs, the value of the normal w will
always be unique. Acknowledgments C. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent
Tech- nologies for their support . References [1] R. Fletcher. Practical Methods of Optimization. John Wiley
and Sons, Inc., 2 nd edition, 1987.
Figure 3: An example of segmenting a document according to its semantic and of word sense disambiguation.
Top Beginning of an abstract of a NIPS paper about support vector machines. The entire section is assigned
latent topic 24 that correspond to mathematical terms (see figure 4). The word “support” appears in this section
3 times as a part of the mathematical term “support vector” and is assigned the mathematical topic.
Bottom A section from the end of the same NIPS paper. The end of the discussion was assigned the same
mathematical topic (24) as the abstract, the acknowledgments section was assigned topic 15 and the references
section was assigned topic 9. Here, in the acknowledgements section, the word “support” refers to the help
received. Indeed it is assigned latent topic 15 like the entire acknowledgments section.
Topic #9
Topic #15
Topic #24
Topic #30
springer
0.03741
research
0.04595
function
0.02476
matrix
0.0347
york
0.03626
supported
0.03813
functions
0.0169
function
0.01282
verlag
0.02753
grant
0.03214
set
0.0111
equation
0.01197
wiley
0.02151
work
0.02919
linear
0.01031
eq
0.01186
theory
0.01854
acknowledgements
0.02036
support
0.009224
gradient
0.008769
analysis
0.01124
foundation
0.01732
problem
0.009067
diagonal
0.008462
statistics
0.01109
science
0.01544
space
0.008977
vector
0.007473
berlin
0.009896
national
0.01279
vector
0.00884
form
0.007055
john
0.009662
support
0.01271
case
0.007578
eigenvalues
0.005681
sons
0.008996
acknowledgments
0.01264
theorem
0.007385
algorithm
0.00551
press
0.008952
office
0.01199
solution
0.006966
ai
0.005413
eds
0.008708
discussions
0.01197
kernel
0.006489
learning
0.005249
organization
0.008542
part
0.01188
approximation
0.005983
positive
0.005247
data
0.008175
nsf
0.009757
convex
0.005292
elements
0.005147
vol
0.008063
acknowledgement
0.009486
regression
0.005185
matrices
0.004917
vapnik
0.007849
center
0.008025
ai
0.004909
terms
0.004876
statistical
0.007652
naval
0.007523
class
0.004766
space
0.004797
estimation
0.006906
contract
0.0067
algorithm
0.004686
order
0.0046
learning
0.006597
institute
0.00584
error
0.004642
covariance
0.004551
neural
0.006126
authors
0.005732
defined
0.004062
defined
0.004516
Topic #18
Topic #84
Topic #26
Topic #46
problem
0.07344
classification
0.06586
theorem
0.03941
vector
0.2523
solution
0.05702
classifier
0.05646
set
0.02531
vectors
0.141
optimization
0.04136
class
0.05401
proof
0.02116
dimensional
0.01993
solutions
0.02605
classifiers
0.03968
result
0.01805
number
0.01854
constraints
0.02465
support
0.02245
section
0.01518
distance
0.01715
problems
0.02122
classes
0.0215
case
0.01451
quantization
0.01589
constraint
0.02026
decision
0.02121
defined
0.0139
length
0.01463
set
0.02006
margin
0.01798
results
0.01286
class
0.01351
equation
0.01766
feature
0.01777
finite
0.01274
reference
0.01046
function
0.01746
kernel
0.01715
show
0.01245
lvq
0.009336
method
0.01478
svm
0.01251
bounded
0.01175
section
0.009204
optimal
0.0143
set
0.01218
convergence
0.01152
research
0.009005
find
0.01359
training
0.01201
condition
0.01088
function
0.008343
equations
0.01219
machines
0.01007
class
0.01036
distributed
0.008078
solve
0.01211
boosting
0.01007
define
0.01022
desired
0.007681
linear
0.01183
adaboost
0.009486
denote
0.009954
chosen
0.007284
solving
0.01115
algorithm
0.009237
assume
0.009113
shown
0.007218
quadratic
0.01099
data
0.00874
exists
0.008967
classical
0.007218
minimum
0.01079
examples
0.007622
positive
0.008677
euclidean
0.007151
solved
0.01007
machine
0.007332
analysis
0.008155
normal
0.00682
Figure 4: Twenty most probable words for 4 topics out of 100 found for each model. Top HTMM topics.
Bottom LDA topics. Aside each word is its probability to be generated from the corresponding topic (βz,w).
Abstract We give necessary and sufficient conditions for uniqueness of the support vector solution for the
problems of pattern recognition and regression estimation, for a general class of cost functions. We show that
if the solution is not unique, all support vectors are necessarily at bound, and we give some simple examples
of non-unique solu- tions. We note that uniqueness of the primal (dual) solution does not necessarily imply
uniqueness of the dual (primal) solution. We show how to compute the threshold b when the solution is unique,
but when all support vectors are at bound, in which case the usual method for determining b does not work.
recognition and regression estimation algorithms [12], with arbitrary convex costs, the value of the normal w will
always be unique. Acknowledgments C. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent
Tech- nologies for their support . References [1] R. Fletcher. Practical Methods of Optimization. John Wiley
and Sons, Inc., 2 nd edition, 1987.
Figure 5: Topics assigned by LDA to the words in the same sections as in figure 3. For demonstration, the four
most frequent topics in these sections are displayed in colors: topic 18 in green, topic 84 in blue, topic 46 in
yellow and topic 26 in red. In contrast to figure 3, LDA assigned all the 4 instances of the word “support” the
same topic (84) due to its inability to distinguish between different instances of the same word according to the
context.
and LDA have lower perplexity than VHTMM1.
in the acknowledgments section is assigned the ac-
HTMM is consistently better than LDA but the dif-
knowledgments topic 15. LDA, on the other hand as-
ference in perplexity is significant only for N ≤ 64 ob-
signs all the 4 instances the same topic 83 (see figure
served words (the average length of a document after
4.
our preprocessing was about 1300 words).
Figure 4 shows 4 topics learnt by HTMM and 4 topics
Figure 3 shows the topical segmentation calculated by
learnt by LDA. In each topic the 20 top words (most
HTMM. On top the beginning of the paper is shown.
probable words) are shown. Each word appears with
A single topic is assigned to the entire section. This is
its probability to be generated from the corresponding
topic 24 that corresponds to mathematical terms (re-
topic (βz,w).
lated to support vector machines which is the subject
The 4 topics of each model were chosen for illustra-
of the paper) and is shown in figure 4. On the bot-
tion out of 100 topics because they appear in the
tom we see a section taken from the end of the paper,
text sections in figures 3 and 5. The complete list-
consisting of the end of the discussion, acknowledge-
ing of top words of all the 100 topics found by HTMM
ments and the beginning of the references. The end
and LDA is available at: http://www.cs.huji.ac.il/∼
of the discussion is assigned topic 24 (the same as the
amitg/htmm.html
abstract), as it addresses mathematical issues. Then
the acknowledgments section is assigned topic 15 and
The 4 topics shown in figure 4 for both models are easy
the references section is assigned topic 9.
for interpretation according to the top words (among
the 100 topics some topics are “as nice” and some are
Figure 5 shows the topics assigned to the words in the
not). However, we see that the topics found by the two
same section by LDA. We see that LDA assigns differ-
models are different in nature. HTMM finds topics
ent topics to different words within the same sentence
that consist of words that are consecutive in the doc-
and therefore it is not suitable for topical segmenta-
ument. For example, topic 15 in figure 4 corresponds
tion.
to acknowledgements. Figure 3 shows that indeed the
The HTMM model distinguishes between different in-
entire acknowledgements section (and nothing but it)
stances of the same word according to the context.
was assigned to topic 15. On the other hand no such
Thus we can use it to disambiguate words that have
topic exists among the 100 topics found by LDA. An-
several meanings according to the context. The word
other example for a topic that consists of consecutive
“support” appears several times in the document in
words is topic 9 in figure 4 (that corresponds to ref-
figures 3 and 5. On top, it appears 3 times in the
erences as it consists of names of journals, publishers,
abstract, always as a part of the mathematical term
authors etc.). The value learnt for ǫ by HTMM and
“support vector”. On bottom, it appears once in the
reflects the topical coherence of the NIPS documents
acknowledgments section in the sense of help provided.
was ǫ = 0.376278.
HTMM assigns the 3 instances of the word in the ab-
Figure 7 shows the value of ǫ learnt by HTMM as a
stract the same mathematical topic 24. The instance
function of the number of topics. It shows how the
2800
was 200 words and there were 5 latent topics. The first
HTMM
VHTMM1
data set was generated using HTMM with ǫ = 0.1 and
2600
LDA
the second dataset was created using LDA.
In table 1 we compare LDA and HTMM in terms of
2400
perplexity using the first 100 words (the average num-
ber of words in a document in this dataset is 600),
Perplexity 2200
in terms of estimation error on the parameters β (i.e.
how well the true topics are learnt) and in terms of
2000
erroneous assignments of latent topics to words.
1800 0
20
40
60
80
100
As a sanity check, we find that indeed HTMM recov-
Number of topics
ers the correct parameters and correct ǫ for data gen-
erated by a HTMM. We also see that when the data
Figure 6: Perplexity of the different models as a func-
indeed deviates from the “bag of words” assumption
tion of the number of topics for N = 10 observed
- HTMM significantly outperforms LDA in all three
words.
criteria. However, when the data is generated with
the “bag of words” assumption - the HTMM model
0.5
no longer has better perplexity. Thus the fact that we
obtained better perplexity on the NIPS dataset with
0.4
HTMM may reflect the poor fit of the “bag of words”
assumption to the NIPS documents.
0.3
5
Discussion
Learnt epsilon 0.2
Statistical models for documents have been the focus
of much recent work in machine learning. In this pa-
0.10
20
40
60
80
100
per we have presented a new model for text documents.
Number of topics
We extend the LDA model by considering the under-
lying structure of a document. Rather than assuming
Figure 7: MAP values for ǫ as a function of the number
that the topic distribution within a document is con-
of topics for the NIPS train set. When only few topics
ditionally independent, we explicitly model the topic
are available, learnt topics are general and topic transi-
dynamics with a Markov chain. This leads to a HMM
tions are infrequent. When many topics are available
model of topics and words for which efficient learn-
topics become specific and topic transitions become
ing and inference algorithms exist. We find that this
more frequent.
Markovian structure allows us to learn more coher-
ent topics and to disambiguate the topics of ambigu-
ous words. Quantitatively we show that this leads to
dependency between topics changes with the granu-
a significant improvement in predicting unseen words
larity of topics. When the dataset is learnt with a few
(perplexity).
topics, these topics are general and do not change of-
ten. As more topics are available, the topics become
The incorporation of HMM into the LDA model is or-
more specific and topic transitions are more frequent.
thogonal to other extensions of LDA such as integrat-
This suggests a correlation between the different top-
ing syntax [7] and author [13] information. It would
ics ([1]). Having learnt always values smaller than 1
be interesting to combine these different extensions to-
means that the likelihood of data is higher when con-
gether to form a better document analysis system.
secutive latent topics are dependent.
Acknowledgements
The picture of perplexity we see in figure 2 may sug-
gest that the Markovian structure exists in real docu-
ments. We would like to eliminate the option that the
Support from the Israeli Science Foundation is great-
perplexity of HTMM might be lower than the perplex-
fully acknowledged.
ity of LDA only because it has less degrees of freedom
(due to the dependencies between latent topics). We
References
created two toy datasets with 1000 documents each,
from which 900 formed the train set and the other 100
[1] David M. Blei and John Lafferty. Correlated topic
formed the test set for each case. The vocabulary size
models. In Lawrence K. Saul, Yair Weiss, and
Table 1:
Comparison between LDA and HTMM. The models are compared using toy data generated
from the different models where there is ground truth to compare with. Comparison is made in terms of
errors in topic assignment to words, l1 norm between true and learnt probabilities and perplexity given 100 words.
Data Generator
Performance criterion
HTMM
LDA
HTMM ǫ = 0.1
Percent mistake
0
37.11%
HTMM ǫ = 0.1
βtrue − βrecovered 1
0.15
1.74
HTMM ǫ = 0.1
Perplexity
134.00
174.39
LDA
Percent mistake
73.26%
68.29%
LDA
βtrue − βrecovered 1
3.35
3.05
LDA
Perplexity
184.87
184.64
L´eon Bottou, editors, Advances in Neural Infor-
[9] David J. C. MacKay and Linda C. Peto. A hier-
mation Processing Systems 17, Cambridge, MA,
archical Dirichlet language model. Natural Lan-
2005. MIT Press.
guage Engineering, 1(3):1–19, 1994.
[2] David M. Blei and Pedro J. Moreno. Topic seg-
[10] Thomas Minka and John Lafferty. Expectation-
mentation with an aspect hidden markov model.
propagation for the generative aspect model. In
In Research and Development in Information Re-
Proceedings of the 18th Conference on Uncer-
trieval, pages 343–348, 2001.
tainty in Artificial Intelligence, pages 352–359,
San Francisco, CA, 2002. Morgan Kaufmann Pub-
[3] David M. Blei, Andrew Y. Ng, and Michael I.
lishers.
Jordan. Latent Dirichlet allocation. Journal of
Machine Learning Research, 3:993–1022, 2003.
[11] Kamal Nigam, Andrew Kachites McCallum, Se-
bastian Thrun, and Tom Mitchell. Text classifica-
[4] Wray L. Buntine and Aleks Jakulin. Applying dis-
tion from labeled and unlabeled documents using
crete PCA in data analysis. In Max Chickering
em. Mach. Learn., 39(2-3):103–134, 2000.
and Joseph Halpern, editors, Proceedings of the
20th Conference on Uncertainty in Artificial In-
[12] L.R. Rabiner. A tutorial on hidden Markov mod-
telligence, pages 59–66, San Francisco, CA, 2004.
els and selected applications in speech recogni-
Morgan Kaufmann Publishers.
tion. Proc. IEEE, 77(2):257–286, 1989.
[5] Kenneth Ward Church. A stochastic parts pro-
[13] Michal Rosen-Zvi, Tom Griffith, Mark Steyvers,
gram and noun phrase parser for unrestricted
and Padhraic Smyth. The author-topic model for
text. In Proceedings of the second conference on
authors and documents. In Max Chickering and
Applied natural language processing, pages 136–
Joseph Halpern, editors, Proceedings 20th Con-
143, Morristown, NJ, USA, 1988. Association for
ference on Uncertainty in Artificial Intelligence,
Computational Linguistics.
pages 487–494, San Francisco, CA, 2004. Morgam
Kaufmann.
[6] Thomas L. Griffiths and Mark Steyvers. Find-
[14] N. Slonim and N. Tishby. The power of word clus-
ing scientific topics.
Proceedings of the Na-
ters for text classification. In 23rd European Col-
tional Academy of Sciences of the United States
loquium on Information Retrieval Research, 2001.
of America, 101:5228–5235, 2004.
[15] Hanna M. Wallach. Topic modeling: Beyond bag-
[7] Thomas L. Griffiths, Mark Steyvers, David M.
of-words. In ICML ’06: 23rd International Con-
Blei, and Joshua B. Tenenbaum. Integrating top-
ference on Machine Learning, Pittsburgh, Penn-
ics and syntax. In Lawrence K. Saul, Yair Weiss,
sylvania, USA, June 2006.
and L´eon Bottou, editors, Advances in Neural In-
formation Processing Systems 17, pages 537–544.
[16] Xuerui Wang and Andrew McCallum. A note on
MIT Press, Cambridge, MA, 2005.
topical n-grams. Technical Report UM-CS-071,
Department of Computer Science University of
[8] Thomas Hofmann. Probabilistic latent semantic
Massachusetts Amherst, 2005.
analysis. In Proceedings of the 15th Annual Con-
ference on Uncertainty in Artificial Intelligence
(UAI-99), pages 289–29, San Francisco, CA, 1999.
Morgan Kaufmann.