Automatically Labeling Semantic Classes
Automatically Labeling Semantic Classes
Patrick Pantel and Deepak Ravichandran
Information Sciences Institute
University of Southern California
4676 Admiralty Way
Marina del Rey, CA 90292
{pantel,ravichan}@isi.edu
main specific senses. For example, WordNet misses the
Abstract
user-interface-object sense of the word dialog (as often
used in software manuals). WordNet also contains a
Systems that automatically discover semantic
very poor coverage of proper nouns.
classes have emerged in part to address the
There is a need for (semi-) automatic approaches to
limitations of broad-coverage lexical re-
building and extending ontologies as well as for validat-
sources such as WordNet and Cyc. The cur-
ing the structure and content of existing ones. With the
rent state of the art discovers many semantic
advent of the Web, we have access to enormous
classes but fails to label their concepts. We
amounts of text. The future of ontology growing lies in
propose an algorithm labeling semantic
leveraging this data by harvesting it for concepts and
classes and for leveraging them to extract is-a
semantic relationships. Moreover, once such knowledge
relationships using a top-down approach.
is discovered, mechanisms must be in place to enrich
current ontologies with this new knowledge.
To address some of the coverage and specificity
1 Introduction
problems in WordNet and Cyc, Pantel and Lin (2002)
proposed and algorithm, called CBC, for automatically
The natural language literature is rich in theories of se-
extracting semantic classes. Their classes consist of
mantics (Barwise and Perry 1985; Schank and Abelson clustered instances like the three shown below:
1977). However, WordNet (Miller 1990) and Cyc (Le-
(A)
multiple sclerosis, diabetes,
nat 1995) aside, the community has had little success in
osteoporosis, cardiovascular disease,
actually building large semantic repositories. Such
Parkinson's, rheumatoid arthritis, heart
disease, asthma, cancer, hypertension,
broad-coverage lexical resources are extremely useful in
lupus, high blood pressure, arthritis,
applications such as word sense disambiguation (Lea-
emphysema, epilepsy, cystic fibrosis,
cock, Chodorow and Miller 1998) and question answer-
leukemia, hemophilia, Alzheimer, myeloma,
ing (Pasca and Harabagiu 2001).
glaucoma, schizophrenia, ...
Current manually constructed ontologies such as
(B) Mike Richter, Tommy Salo, John
WordNet and Cyc have important limitations. First, they
Vanbiesbrouck, Curtis Joseph, Chris Osgood,
often contain rare senses. For example, WordNet in-
Steve Shields, Tom Barrasso, Guy Hebert,
cludes a rare sense of computer that means ‘the person
Arturs Irbe, Byron Dafoe, Patrick Roy, Bill
who computes’. Using WordNet to expand queries to an
Ranford, Ed Belfour, Grant Fuhr, Dominik
Hasek, Martin Brodeur, Mike Vernon, Ron
information retrieval system, the expansion of computer
Tugnutt, Sean Burke, Zach Thornton, Jocelyn
will include words like estimator and reckoner. Also,
Thibault, Kevin Hartman, Felix Potvin, ...
the words dog, computer and company all have a sense
that is a hyponym of person. Such rare senses make it
(C) pink, red, turquoise, blue, purple,
difficult for a coreference resolution system to use
green, yellow, beige, orange, taupe, white,
lavender, fuchsia, brown, gray, black,
WordNet to enforce the constraint that personal pro-
mauve, royal blue, violet, chartreuse,
nouns (e.g. he or she) must refer to a person. The second
teal, gold, burgundy, lilac, crimson,
problem with these lexicons is that they miss many do-
garnet, coral, grey, silver, olive green,
cobalt blue, scarlet, tan, amber, ...
A limitation of these concepts is that CBC does not For example, in (D), the color and fruit senses of orange
discover their actual names. That is, CBC discovers a are mixed up.
semantic class of Canadian provinces such as Manitoba,
Lin and Pantel (2001) proposed a clustering algo-
Alberta, and Ontario, but stops short of labeling the rithm, UNICON, which generates similar lists but
concept as Canadian Provinces. Some applications such
discriminates between senses of words. Later, Pantel
as question answering would benefit from class labels. and Lin (2002) improved the precision and recall of
For example, given the concept list (B) and a label UNICON clusters with CBC (Clustering by Commit-
goalie/goaltender, a QA system could look for answers
tee). Using sets of representative elements called com-
to the question “Which goaltender won the most Hart mittees, CBC discovers cluster centroids that
Trophys?” in the concept.
unambiguously describe the members of a possible
In this paper, we propose an algorithm for automati-
class. The algorithm initially discovers committees that
cally inducing names for semantic classes and for find-
are well scattered in the similarity space. It then pro-
ing instance/concept (is-a) relationships. Using concept ceeds by assigning elements to their most similar com-
signatures (templates describing the prototypical syntac-
mittees. After assigning an element to a cluster, CBC
tic behavior of instances of a concept), we extract con-
removes their overlapping features from the element
cept names by searching for simple syntactic patterns before assigning it to another cluster. This allows CBC
such as “concept apposition-of instance”. Searching to discover the less frequent senses of a word and to
concept signatures is more robust than searching the avoid discovering duplicate senses.
syntactic features of individual instances since many
CBC discovered both the color sense of orange, as
instances suffer from sparse features or multiple senses.
shown in list (C) of Section 1, and the fruit sense shown
Once labels are assigned to concepts, we can extract
below:
a hyponym relationship between each instance of a con-
(E) peach, pear, apricot, strawberry, ba-
cept and its label. For example, once our system labels
nana, mango, melon, apple, pineapple,
list (C) as color, we may extract relationships such as:
cherry, plum, lemon, grapefruit, orange,
pink is a color, red is a color, turquoise is a color, etc.
berry, raspberry, blueberry, kiwi, ...
Our results show that of the 159,000 hyponyms we ex-
There have also been several approaches to discov-
tract using this simple method, 68% are correct. Of the ering hyponym (is-a) relationships from text. Hearst
65,000 proper name hyponyms we discover, 81.5% are (1992) used seven lexico-syntactic patterns, for example
correct.
“such NP as {NP,}*{(or|and)} NP” and “NP {, NP}*{,}
The remainder of this paper is organized as follows.
or other NP”. Berland and Charniak (1999) used similar
In the next section, we review previous algorithms for pattern-based techniques and other heuristics to extract
extracting semantic classes and hyponym relationships. meronymy (part-whole) relations. They reported an
Section 3 describes our algorithm for labeling concepts accuracy of about 55% precision on a corpus of 100,000
and for extracting hyponym relationships. Experimental
words. Girju, Badulescu and Moldovan (2003)
results are presented in Section 4 and finally, we con-
improved upon this work by using a machine learning
clude with a discussion and future work.
filter. Mann (2002) and Fleischman et al. (2003) used
part of speech patterns to extract a subset of hyponym
2 Previous Work
relations involing proper nouns.
There have been several approaches to automatically 3 Labeling Classes
discovering lexico-semantic information from text
(Hearst 1992; Riloff and Shepherd 1997; Riloff and The research discussed above on discovering hyponym
Jones 1999; Berland and Charniak 1999; Pantel and Lin
relationships all take a bottom up approach. That is,
2002; Fleischman et al. 2003; Girju et al. 2003). One they use patterns to independently discover semantic
approach constructs automatic thesauri by computing relationships of words. However, for infrequent words,
the similarity between words based on their distribution
these patterns do not match or, worse yet, generate in-
in a corpus (Hindle 1990; Lin 1998). The output of correct relationships.
these programs is a ranked list of similar words to each
Ours is a top down approach. We make use of co-
word. For example, Lin’s approach outputs the follow-
occurrence statistics of semantic classes discovered by
ing top-20 similar words of orange:
algorithms like CBC to label their concepts. Hyponym
(D) peach, grapefruit, yellow, lemon, pink,
relationships may then be extracted easily: one hypo-
avocado, tangerine, banana, purple, Santa
nym per instance/concept label pair. For example, if we
Ana, strawberry, tomato, red, pineapple,
labeled concept (A) from Section 1 with disease, then
pear, Apricot, apple, green, citrus, mango
we could extract is-a relationships such as: diabetes is a
A common problem of such lists is that they do not
disease, cancer is a disease, and lupus is a disease. A
discriminate between the senses of polysemous words. concept instance such as lupus is assigned a hypernym
disease not because it necessarily occurs in any particu-
n
m
lar syntactic relationship with disease, but because it
min ∑ c ,
c
ei ∑ jf
belongs to the class of instances that does.
cef
i 1=
j 1
=
×
(2)
The input to our labeling algorithm is a list of se-
c +1
ef
n
m
mantic classes, in the form of clusters of words, which
min ∑ c , c
ei ∑
+1
jf
may be generated from any source. In our experiments,
i 1=
j 1
=
we used the clustering outputs of CBC (Pantel and Lin
2002). The output of the system is a ranked list of con-
3.2
Phase II
cept names for each semantic class.
In the first phase of the algorithm, we extract feature
Following (Pantel and Lin 2002), we construct a com-
vectors for each word that occurs in a semantic class. mittee for each semantic class. A committee is a set of
Phase II then uses these features to compute grammati-
representative elements that unambiguously describe the
cal signatures of concepts using the CBC algorithm. members of a possible class.
Finally, we use simple syntactic patterns to discover
For each class c, we construct a matrix containing
class names from each class’ signature. Below, we de-
the similarity between each pair of words ei and ej in c
scribe these phases in detail.
using the cosine coefficient of their mutual information
vectors (Salton and McGill 1983):
3.1
Phase I
∑mi ×mi
e f
e f
i
j
We represent each word (concept instance) by a feature
sim(e ,e
(3)
i
j ) =
f
vector. Each feature corresponds to a context in which
∑mi 2 × mi 2
e f
i
∑
the word occurs. For example, “catch __” is a verb-
e f
j
f
f
object context. If the word wave occurred in this con-
text, then the context is a feature of wave.
For each word e, we then cluster its most similar in-
We first construct a frequency count vector C(e) = stances using group-average clustering (Han and Kam-
(c
ber 2001) and we store as a candidate committee the
e1, ce2, …, cem), where m is the total number of features
and c
highest scoring cluster c' according to the following
ef is the frequency count of feature f occurring in
word e. Here, c
metric:
ef is the number of times word e occurred
in a grammatical context f. For example, if the word
| c'| × avgsim(c') (4)
wave occurred 217 times as the object of the verb catch,
then the feature vector for wave will have value 217 for
where |c'| is the number of elements in c' and avgsim(c')
its “object-of catch” feature. In Section 4.1, we describe
is the average pairwise similarity between words in c'.
how we obtain these features.
The assumption is that the best representative for a con-
We then construct a mutual information vector cept is a large set of very similar instances. The commit-
MI(e) = (mie1, mie2, …, miem) for each word e, where mief
tee for class c is then the highest scoring candidate
is the pointwise mutual information between word e and
committee containing only words from c. For example,
feature f, which is defined as:
below are the committee members discovered for the
semantic classes (A), (B), and (C) from Section 1:
cef
N
1) cardiovascular disease, diabetes,
mi = log
ef
n
m
(1) multiple sclerosis, osteoporosis,
∑c
∑c
if
ej
Parkinson's, rheumatoid arthritis
i=1
j=
× 1
N
N
2) Curtis Joseph, John Vanbiesbrouck, Mike
n m
Richter, Tommy Salo
where n is the number of words and N = ∑ ∑ c is the
ij
i=1 j=1
3) blue, pink, red, yellow
total frequency count of all features of all words.
Mutual information is commonly used to measure 3.3 Phase III
the association strength between two words (Church and
Hanks 1989). A well-known problem is that mutual By averaging the feature vectors of the committee
information is biased towards infrequent ele-
members of a particular semantic class, we obtain a
ments/features. We therefore multiply mi
grammatical template, or signature, for that class. For
ef with the fol-
lowing discounting factor:
example, Figure 1 shows an excerpt of the grammatical
signature for concept (B) in Section 1. The vector is
obtained by averaging the feature vectors for the words
Curtis Joseph, John Vanbiesbrouck, Mike Richter, and
Tommy Salo (the committee of this concept). The
{Curtis Joseph, John Vanbiesbrouck,
• Apposition (N:appo:N)
Mike Richter, Tommy Salo}
e.g. ... Oracle, a company known
-N:gen:N
for its progressive employment
pad
57
11.19
policies, ...
backup
29
9.95
• Nominal subject (-N:subj:N)
crease
7
9.69
e.g. ... Apple was a hot young com-
glove
52
9.57
pany, with Steve Jobs in charge.
stick
20
9.15
shutout
17
8.80
• Such as (-N:such as:N)
-N:conj:N
e.g. ... companies such as IBM must
Hasek
15
12.36
be weary ...
Martin
Brodeur
12
12.26
Belfour
13
12.22
• Like (-N:like:N)
Patrick
Roy 10
11.90
e.g. ... companies like Sun Micro-
Dominik
Hasek 7
11.20
systems do no shy away from such
Roy
6
10.01
challenges, ...
-V:subj:N
sprawl
11
6.69
To name a class, we simply search for these syntac-
misplay
6
6.55
tic relationships in the signature of a concept. We sum
smother
10
6.54
up the mutual information scores for each term that oc-
skate
28
6.43
curs in these relationships with a committee of a class.
turn
back
10
6.28
The highest scoring term is the name of the class. For
stop
453
6.19
example, the top-5 scoring terms that occurred in these
N:appo:N
relationships with the signature of the concept repre-
goaltender
449
10.79
sented by the committee {Curtis Joseph, John
goalie
1641
10.76
netminder
57
10.39
Vanbiesbrouck, Mike Richter, Tommy Salo} are:
goalkeeper
487
9.69
1) goalie
40.37
N:conj:N
Martin
Brodeur
11
12.49
2) goaltender 33.64
Dominik
Hasek 11
12.33
Ed
Belfour
10
12.04
3) goalkeeper 19.22
Curtis
Joseph 7
11.46
4) player
14.55
Tom
Barrasso 5
10.85
Byron
Dafoe 5
10.80
5) backup
9.40
Chris
Osgood 4
10.25
Figure 1. Excerpt of the grammatical signature for the
The numbers are the total mutual information scores
goalie/goaltender concept.
of each name in the four syntactic relationships.
4 Evaluation
“-V:subj:N:sprawl” feature indicates a subject-verb re-
In this section, we present an evaluation of the class
lationship between the concept and the verb sprawl
labeling algorithm and of the hyponym relationships
while “N:appo:N:goaltender” indicates an apposition discovered by our system.
relationship between the concept and the noun goal-
tender. The (-) in a relationship means that the right 4.1
Experimental Setup
hand side of the relationship is the head (e.g. sprawl is
the head of the subject-verb relationship). The two col-
We used Minipar (Lin 1994), a broad coverage parser,
umns of numbers indicate the frequency and mutual to parse 3GB of newspaper text from the Aquaint
information score for each feature respectively.
(TREC-9) collection. We collected the frequency counts
In order to discover the characteristics of human of the grammatical relationships (contexts) output by
naming conventions, we manually named 50 concepts Minipar and used them to compute the pointwise mutual
discovered by CBC. For each concept, we extracted the
information vectors described in Section 3.1.
relationships between the concept committee and the
We used the 1432 noun clusters extracted by CBC1
assigned label. We then added the mutual information as the list of concepts to name. For each concept, we
scores for each extracted relationship among the 50 then used our algorithm described in Section 3 to extract
concepts. The top-4 highest scoring relationships are:
the top-20 names for each concept.
1 Available at http://www.isi.edu/~pantel/demos.htm
Table 1. Labels assigned to 10 randomly selected concepts (each represented by three committee members.
CBC CONCEPT
HUMAN LABEL
WORDNET LABELS
SYSTEM LABELS (RANKED)
BMG, EMI, Sony
record label
none
label / company / album /
machine / studio
Preakness Stakes, Preakness, Belmont
horse race
none
race / event / run / victory /
Stakes
start
Olympia Snowe, Susan Collins, James
US senator
none
republican / senator / chair-
Jeffords
man / supporter / conservative
Eldoret, Kisumu, Mombasa
African city
none
city / port / cut off / town /
southeast
Bronze Star, Silver Star, Purple Heart
medal
decoration / laurel
distinction / set / honor / sym-
wreath / medal / medal-
bol
lion / palm
Mike Richter, Tommy Salo, John
NHL goalie
none
goalie / goaltender / goal-
Vanbiesbrouck
keeper / player / backup
Dodoma, Mwanza, Mbeya
African city
none
facilitator / town
fresco, wall painting, Mural
art
painting / picture
painting / world / piece / floor
/ symbol
Qinghua University, Fudan University, university none
university
/ institution / stock-
Beijing University
holder / college / school
Federal Bureau of Investigation, Drug
governmental depart-
law enforcement agency
agency / police / investigation
Enforcement Administration, FBI
ment
/ department / FBI
source of a label nor the system’s ranking of the labels.
4.2
Labeling Precision
For each name, we asked the judges to assign a score of
Out of the 1432 noun concepts, we were unable to name
correct, partially correct, or incorrect. We then com-
21 (1.5%) of them. This occurs when a concept’s com-
puted the mean reciprocal rank (MRR) of the system,
mittee members do not occur in any of the four syntactic
human, and WordNet labels. For each concept, a nam-
relationships described in Section 0. We performed a ing scheme receives a score of 1 / M where M is the
manual evaluation of the remaining 1411 concepts.
rank of the first name judged correct. Table 2 shows the
We randomly selected 125 concepts and their top-5 results. Table 3 shows similar results for a more lenient
highest ranking names according to our algorithm. Ta-
evaluation where M is the rank of the first name judged
ble 1 shows the first 10 randomly selected concepts correct or partially correct.
(each concept is represented by three of its committee
Our system achieved an overall MRR score of
members).
77.1%. We performed much better than the baseline
For each concept, we added to the list of names a WordNet (19.9%) because of the lack of coverage
human generated name (obtained from an annotator (mostly proper nouns) in the hierarchy. For the 33 con-
looking at only the concept instances). We also ap-
cepts that WordNet named, it achieved a score of 75.3%
pended concept names extracted from WordNet. For and a lenient score of 82.7%, which is high considering
each concept that contains at least five instances in the the simple algorithm we used to extract labels using
WordNet hierarchy, we named the concept with the WordNet.
most frequent common ancestor of each pair of in-
The Kappa statistic (Siegel and Castellan Jr. 1988)
stances. Up to five names were generated by WordNet measures the agreements between a set of judges’ as-
for each concept. Because of the low coverage of proper
sessments correcting for chance agreements:
nouns in WordNet, only 33 of the 125 concepts we
P(A)− P(E)
evaluated had WordNet generated labels.
K =
(5)
1− P(E)
We presented to three human judges the 125 ran-
domly selected concepts together with the system, hu-
where P(A) is the probability of agreement between the
man, and WordNet generated names randomly ordered.
judges and P(E) is the probability that the judges agree
That way, there was no way for a judge to know the
Table 2. MRR scores for the human evaluation of naming 125
Table 3. Lenient MRR scores for the human evaluation of
random concepts.
naming 125 random concepts.
JUDGE
HUMAN
WordNet
System
JUDGE
HUMAN
WordNet
System
LABELS
Labels
Labels
LABELS
Labels
Labels
1
100% 18.1% 74.4% 1
100% 22.8% 85.0%
2
91.2% 20.0% 78.1% 2
96.0% 20.8% 86.5%
3
89.6% 21.6% 78.8% 3
92.0% 21.8% 85.2%
Combined
93.6% 19.9% 77.1%
Combined
96.0% 21.8% 85.6%
Table 4. Percentage of concepts with a correct name in the
Table 5. Accuracy of 159,000 extracted hyponyms and a sub-
top-5 ranks returned by our system.
set of 65,000 proper noun hyponyms.
JUDGE
TOP-1 TOP-2 TOP-3 TOP-4 TOP-5
JUDGE
All Nouns
Proper Nouns
1
68.8% 75.2% 78.4% 83.2% 84.0%
Strict Lenient Strict Lenient
2
73.6% 80.0% 81.6% 83.2% 84.8%
1
62.0% 68.0% 79.0% 82.0%
3
73.6% 80.0% 82.4% 84.0% 88.8%
2
74.0% 76.5% 84.0% 85.5%
Combined 72.0% 78.4% 80.8% 83.5% 85.6%
Combined
68.0% 72.2% 81.5% 83.8%
by chance on an assessment. An experiment with K ≥
incorrect. Table 5 shows the results. The strict measure
0.8 is generally viewed as reliable and 0.67 < K < 0.8 counts a score of 1 for each correctly judged instance
allows tentative conclusions. The Kappa statistic for our
and 0 otherwise. The lenient measure also gives a score
experiment is K = 0.72.
of 0.5 for each instance judged partially correct.
The human labeling is at a disadvantage since only
Many of the CBC concepts contain noise. For ex-
one label was generated per concept. Therefore, the ample, the wine cluster:
human scores either 1 or 0 for each concept. Our sys-
Zinfandel, merlot, Pinot noir, Chardonnay,
tem’s highest ranking name was correct 72% of the
Cabernet Sauvignon, cabernet, riesling,
time. Table 4 shows the percentage of semantic classes
Sauvignon blanc, Chenin blanc, sangiovese,
with a correct label in the top 1-5 ranks returned by our
syrah, Grape, Chianti ...
system.
contains some incorrect instances such as grape, appe-
Overall, 41.8% of the top-5 names extracted by our lation, and milk chocolate. Each of these instances will
system were judged correct. The overall accuracy for generate incorrect hyponyms such as grape is wine and
the top-4, top-3, top-2, and top-1 names are 44.4%, milk chocolate is wine. This hyponym extraction task
48.8%, 58.5%, and 72% respectively. Hence, the name would likely serve well for evaluating the accuracy of
ranking of our algorithm is effective.
lists of semantic classes.
Table 5 shows that the hyponyms involving proper
4.3
Hyponym Precision
nouns are much more reliable than common nouns.
The 1432 CBC concepts contain 18,000 unique words. Since WordNet contains poor coverage of proper nouns,
For each concept to which a word belongs, we extracted
these relationships could be useful to enrich it.
up to 3 hyponyms, one for each of the top-3 labels for
the concept. The result was 159,000 hyponym relation-
4.4
Recall
ships. 24 are shown in the Appendix.
Semantic extraction tasks are notoriously difficult to
Two judges annotated two random samples of 100 evaluate for recall. To approximate recall, we conducted
relationships: one from all 159,000 hyponyms and one two question answering (QA) tasks: answering
from the subset of 65,000 proper nouns. For each in-
definition questions and performing QA information
stance, the judges were asked to decide whether the retrieval.
hyponym relationship was correct, partially correct or
Table 6. Percentage of correct answers in the Top-1 and Table 7. Percentage of questions where the passage retrieval
Top-5 returned answers on 50 definition questions.
module returns a correct answer in the Top-1 and Top-100
ranked passages (with and without semantic indexing).
SYSTEM Top-1 Top-5
CORRECT TOP-1 Correct
Top-100
Strict Lenient Strict Lenient
WordNet
38% 38% 38% 38% With semantic
43 / 179
134 / 179
Fleischman
36% 40% 42% 44% indexing
Without semantic
36 / 179
131 / 179
Our System
36% 44% 60% 62% indexing
We compared the passages returned by the passage
Definition Questions
retrieval module with and without the semantic index-
We chose the 50 definition questions that appeared in ing. We counted how many of the 179 questions had a
the QA track of TREC2003 (Voorhees, 2003). For ex-
correct answer returned in the top-1 and top-100 pas-
ample: “Who is Aaron Copland?” and “What is the sages. Table 7 shows the results.
Kama Sutra?” For each question we looked for at most
Our system shows small gains in the performance of
five corresponding concepts in our hyponym list. For the IR output. In the top-1 category, the performance
example, for Aaron Copland, we found the following improved by 20%. This may lead to better answer selec-
hypernyms: composer, music, and gift. We compared tions.
our system with the concepts in WordNet and Fleisch-
man et al.’s instance/concept relations (Fleischman et al.
5 Conclusions and Future Work
2003). Table 6 shows the percentage of correct answers
in the top-1 and top-5 returned answers from each sys-
Current state of the art concept discovery algorithms
tem. All systems seem to have similar performance on generate lists of instances of semantic classes but stop
the top-1 answers, but our system has many more an-
short of labeling the classes with concept names. Class
swers in the top-5. This shows that our system has com-
labels would serve useful in applications such as ques-
paratively higher recall for this task.
tion answering to map a question concept into a seman-
tic class and then search for answers within that class.
Information (Passage) Retrieval
We propose here an algorithm for automatically label-
ing concepts that searches for syntactic patterns within a
Passage retrieval is used in QA to supply relevant in-
grammatical template for a class. Of the 1432 noun con-
formation to an answer pinpointing module. The higher
cepts discovered by CBC, our system labelled 98.5% of
the performance of the passage retrieval module, the them with an MRR score of 77.1% in a human evalua-
higher will be the performance of the answer pinpoint-
tion.
ing module.
Hyponym relationships were then easily extracted,
The passage retrieval module can make use of the one for each instance/concept label pair. We extracted
hyponym relationships that are discovered by our sys-
159,000 hyponyms and achieved a precision of 68%. On
tem. Given a question such as “What color …”, the like-
a subset of 65,000 proper names, our performance was
lihood of a correct answer being present in a retrieved 81.5%.
passage is greatly increased if we know the set of all
This work forms an important attempt to building
possible colors and index them in the document collec-
large-scale semantic knowledge bases. Without being
tion appropriately.
able to automatically name a cluster and extract hypo-
We used the hyponym relations learned by our sys-
nym/hypernym relationships, the utility of automatically
tem to perform semantic indexing on a QA passage re-
generated clusters or manually compiled lists of terms is
trieval task. We selected the 179 questions from the QA
limited. Of course, it is a serious open question how
track of TREC-2003 that had an explicit semantic an-
many names each cluster (concept) should have, and
swer type (e.g. “What band was Jerry Garcia with?”
how good each name is. Our method begins to address
and “What color is the top stripe on the U.S. flag?”).
this thorny issue by quantifying the name assigned to a
For each expected semantic answer type corresponding class and by simultaneously assigning a number that can
to a given question (e.g. band and color), we indexed be interpreted to reflect the strength of membership of
the entire TREC-2002 IR collection with our system’s each element to the class. This is potentially a signifi-
hyponyms.
cant step away from traditional all-or-nothing seman-
tic/ontology representations to a concept representation
Appendix. Sample hyponyms discovered by our system.
Han, J. and Kamber, M. 2001. Data Mining – Concepts and
Techniques. Morgan Kaufmann.
INSTANCE
CONCEPT
INSTANCE
CONCEPT
Hearst, M. 1992. Automatic acquisition of hyponyms from
large text corpora. In COLING-92. pp. 539–545. Nantes,
actor hero price
support
benefit
France.
Hindle, D. 1990. Noun classification from predicate-argument
Ameritrade brokerage republican politician
structures. In Proceedings of ACL-90. pp. 268–275. Pitts-
Arthur
pitcher Royal
Air
force
burgh, PA.
Rhodes
Force
bebop MUSIC
Rwanda
city
Leacock, C.; Chodorow, M.; and Miller; G. A. 1998. Using
corpus statistics and WordNet relations for sense identifica-
Buccaneer team
Santa
Ana city
tion. Computational Linguistics, 24(1):147–165.
Congressional agency shot-blocker
player
Lenat, D. 1995. CYC: A large-scale investment in knowledge
Research
infrastructure. Communications of the ACM, 38(11):33–38.
Service
Cuba country
slavery
issue Lin, D. 1994. Principar - an efficient, broad-coverage, princi-
ple-based parser. Proceedings of COLING-94. pp. 42–48.
Dan Petrescu
midfielder
spa
facility
Kyoto, Japan.
Hercules aircraft taxi
vehicle
Lin, D. 1998. Automatic retrieval and clustering of similar
words. In Proceedings of COLING/ACL-98. pp. 768–774.
Moscow city
Terrence director
Montreal, Canada.
Malick
Nokia COMPANY
verbena
tree
Lin, D. and Pantel, P. 2001. Induction of semantic classes
from natural language text. In Proceedings of SIGKDD-01.
nominee candidate Wagner composer
pp. 317–322. San Francisco, CA.
Mann, G. S. 2002. Fine-Grained Proper Noun Ontologies
for Question Answering. SemaNet’ 02: Building and
scheme that is more nuanced and admits multiple names
Using Semantic Networks, Taipei, Taiwan.
and graded set memberships.
Miller, G. 1990. WordNet: An online lexical database. Inter-
national Journal of Lexicography, 3(4).
Acknowledgements
Pasca, M. and Harabagiu, S. 2001. The informative role of
WordNet in Open-Domain Question Answering. In Pro-
The authors wish to thank the reviewers for their helpful
ceedings of NAACL-01 Workshop on WordNet and Other
comments. This research was partly supported by NSF
Lexical Resources. pp. 138–143. Pittsburgh, PA.
grant #EIA-0205111.
Pantel, P. and Lin, D. 2002. Discovering Word Senses from
Text. In Proceedings of SIGKDD-02. pp. 613–619. Edmon-
References
ton, Canada.
Barwise, J. and Perry, J. 1985. Semantic innocence and un-
Riloff, E. and Shepherd, J. 1997. A corpus-based approach for
compromising situations. In: Martinich, A. P. (ed.) The
building semantic lexicons. In Proceedings of EMNLP-
Philosophy of Language. New York: Oxford University
1997.
Press. pp. 401–413.
Riloff, E. and Jones, R. 1999. Learning dictionaries for infor-
Berland, M. and E. Charniak, 1999. Finding parts in very large
mation extraction by multi-level bootstrapping. In Proceed-
corpora. In ACL-1999. pp. 57–64. College Park, MD.
ings of AAAI-99. pp. 474–479. Orlando, FL.
Church, K. and Hanks, P. 1989. Word association norms, mu-
Salton, G. and McGill, M. J. 1983. Introduction to Modern
tual information, and lexicography. In Proceedings of ACL-
Information Retrieval. McGraw Hill
89. pp. 76–83. Vancouver, Canada.
Schank, R. and Abelson, R. 1977. Scripts, Plans, Goals and
Fleischman, M.; Hovy, E.; and Echihabi, A. 2003. Offline
Understanding: An Inquiry into Human Knowledge Struc-
strategies for online question answering: Answering ques-
tures. Lawrence Erlbaum Associates.
tions before they are asked. In Proceedings of ACL-03. pp.
Siegel, S. and Castellan Jr., N. J. 1988. Nonparametric Statis-
1–7. Sapporo, Japan.
tics for the Behavioral Sciences. McGraw-Hill.
Girju, R.; Badulescu, A.; and Moldovan, D. 2003. Learning
Voorhees, E. 2003. Overview of the question answering track.
semantic constraints for the automatic discovery of part-
To appear in Proceedings of TREC-12 Conference. NIST,
whole relations. In Proceedings of HLT/NAACL-03. pp.
Gaithersburg, MD.
80–87. Edmonton, Canada.