Learning Bilingual Lexicons From Monolingual Corpora
Learning Bilingual Lexicons from Monolingual Corpora
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein
Computer Science Division, University of California at Berkeley
{ aria42,pliang,tberg,klein }@cs.berkeley.edu
Abstract
pairs deemed to be word-level translations. Preci-
sion and recall are then measured over these bilin-
We present a method for learning bilingual
gual lexicons. This setting has been considered be-
translation lexicons from monolingual cor-
fore, most notably in Koehn and Knight (2002) and
pora. Word types in each language are charac-
Fung (1995), but the current paper is the first to use
terized by purely monolingual features, such
a probabilistic model and present results across a va-
as context counts and orthographic substrings.
riety of language pairs and data conditions.
Translations are induced using a generative
model based on canonical correlation analy-
In our method, we represent each language as a
sis, which explains the monolingual lexicons
monolingual lexicon (see figure 2): a list of word
in terms of latent matchings. We show that
types characterized by monolingual feature vectors,
high-precision lexicons can be learned in a va-
such as context counts, orthographic substrings, and
riety of language pairs and from a range of
so on (section 5). We define a generative model over
corpus types.
(1) a source lexicon, (2) a target lexicon, and (3) a
matching between them (section 2). Our model is
1
Introduction
based on canonical correlation analysis (CCA)1 and
explains matched word pairs via vectors in a com-
Current statistical machine translation systems use
mon latent space. Inference in the model is done
parallel corpora to induce translation correspon-
using an EM-style algorithm (section 3).
dences, whether those correspondences be at the
Somewhat surprisingly, we show that it is pos-
level of phrases (Koehn, 2004), treelets (Galley et
sible to learn or extend a translation lexicon us-
al., 2006), or simply single words (Brown et al.,
ing monolingual corpora alone, in a variety of lan-
1994). Although parallel text is plentiful for some
guages and using a variety of corpora, even in the
language pairs such as English-Chinese or English-
absence of orthographic features. As might be ex-
Arabic, it is scarce or even non-existent for most
pected, the task is harder when no seed lexicon is
others, such as English-Hindi or French-Japanese.
provided, when the languages are strongly diver-
Moreover, parallel text could be scarce for a lan-
gent, or when the monolingual corpora are from dif-
guage pair even if monolingual data is readily avail-
ferent domains. Nonetheless, even in the more diffi-
able for both languages.
cult cases, a sizable set of high-precision translations
In this paper, we consider the problem of learning
can be extracted. As an example of the performance
translations from monolingual sources alone. This
of the system, in English-Spanish induction with our
task, though clearly more difficult than the standard
best feature set, using corpora derived from topically
parallel text approach, can operate on language pairs
similar but non-parallel sources, the system obtains
and in domains where standard approaches cannot.
89.0% precision at 33% recall.
We take as input two monolingual corpora and per-
haps some seed translations, and we produce as out-
put a bilingual lexicon, defined as a list of word
1See Hardoon et al. (2003) for an overview.
771
Proceedings of ACL-08: HLT, pages 771–779,
Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics
s
m
t
matched to at most one other word type.2 We take
MATCHING-PRIOR to be uniform over M.3
state
sociedad
Then, for each matched pair of word types (i, j) ∈
society
estado
m, we need to generate the observed feature vectors
of the source and target word types, fS(si) ∈ RdS
enlarge-
amplifi-
and f
ment
cación
T (tj ) ∈ RdT . The feature vector of each word
.
.
.
type is computed from the appropriate monolin-
.
.
.
gual corpus and summarizes the word’s monolingual
import-
control
characteristics; see section 5 for details and figure 2
ancia
for an illustration. Since si and tj are translations of
import-
control
each other, we expect f
ance
S (si) and fT (tj ) to be con-
nected somehow by the generative process. In our
model, they are related through a vector zi,j ∈ Rd
Figure 1: Bilingual lexicon induction: source word types
representing the shared, language-independent con-
s are listed on the left and target word types t on the
cept.
right. Dashed lines between nodes indicate translation
pairs which are in the matching m.
Specifically, to generate the feature vectors, we
first generate a random concept zi,j ∼ N (0, Id),
where Id is the d × d identity matrix. The source
2
Bilingual Lexicon Induction
feature vector fS(si) is drawn from a multivari-
ate Gaussian with mean W
As input, we are given a monolingual corpus S (a
S zi,j and covariance ΨS ,
where W
sequence of word tokens) in a source language and
S is a dS × d matrix which transforms the
language-independent concept z
a monolingual corpus T in a target language. Let
i,j into a language-
dependent vector in the source space. The arbitrary
s = (s1, . . . , sn ) denote n
S
S word types appearing
covariance parameter Ψ
in the source language, and t = (t
S
0 explains the source-
1, . . . , tn ) denote
T
specific variations which are not captured by W
word types in the target language. Based on S and
S ; it
T
does not play an explicit role in inference. The target
, our goal is to output a matching m between s
f
and t. We represent m as a set of integer pairs so
T (tj ) is generated analogously using WT and ΨT ,
conditionally independent of the source given z
that (i, j) ∈ m if and only if s
i,j
i is matched with tj .
(see figure 2). For each of the remaining unmatched
2.1
Generative Model
source word types si which have not yet been gen-
erated, we draw the word type features from a base-
We propose the following generative model over
line normal distribution with variance σ2Id , with
matchings m and word types (s, t), which we call
S
hyperparameter σ2
0; unmatched target words
matching canonical correlation analysis (MCCA).
are similarly generated.
If two word types are truly translations, it will be
MCCA model
better to relate their feature vectors through the la-
m ∼ MATCHING-PRIOR
[matching m]
tent space than to explain them independently via
For each matched edge (i, j) ∈ m:
the baseline distribution. However, if a source word
−zi,j ∼ N (0, Id)
[latent concept]
−f
type is not a translation of any of the target word
S(si) ∼ N (WSzi,j, ΨS)
[source features]
−f
types, we can just generate it independently without
T (ti) ∼ N (WT zi,j, ΨT )
[target features]
For each unmatched source word type i:
requiring it to participate in the matching.
−fS(si) ∼ N (0, σ2Id )
[source features]
S
For each unmatched target word type j:
−f
2
T (tj) ∼ N (0, σ2Id )
[target features]
Our choice of M permits unmatched word types, but does
T
not allow words to have multiple translations. This setting facil-
itates comparison to previous work and admits simpler models.
First, we generate a matching m ∈ M, where M
3However, non-uniform priors could encode useful informa-
is the set of matchings in which each word type is
tion, such as rank similarities.
772
Canonical
fS(si)
Rd
fT (tj)
Space
es
1.0
#ti
#ti
1.0
1.0
ime
z
mpo
1.0
me#
pe#
Orthographic Featur {
1.0
1.0
.
.
.
.
.
.
es
s
20.0
change
i
tj
suficiente 40.0
time
tiempo
dawn
5.0
período 65.0
period
mismo
120.0
necessary
Contextual Featur {
100.0
50.0
Source
Target
45.0
adicional
Space Rds
Space
Rdt
Figure 2: Illustration of our MCCA model. Each latent concept zi,j originates in the canonical space. The observed
word vectors in the source and target spaces are generated independently given this concept.
3
Inference
RdS×d of the source and UT ∈ RdT ×d of the tar-
get such that the components of the projections
Given our probabilistic model, we would like to
U f
f
maximize the log-likelihood of the observed data
S
S (si) and UT
T (tj ) are maximally correlated.4
(
U
s, t):
S and UT can be found by solving an eigenvalue
problem (see Hardoon et al. (2003) for details).
(θ) = log p(s, t; θ) = log
p(m, s, t; θ)
Then the maximum likelihood estimates are as fol-
m
lows: WS = CSSUSP 1/2, WT = CT T UT P 1/2,
ΨS = CSS − WSW , and ΨT = CTT − WT W ,
with respect to the model parameters θ
=
S
T
where P is a d × d diagonal matrix of the canonical
(WS, WT , ΨS, ΨT ).
correlations, CSS = 1
fS(si)fS(si) is
We use the hard (Viterbi) EM algorithm as a start-
|m|
(i,j)∈m
the empirical covariance matrix in the source do-
ing point, but due to modeling and computational
main, and CT T is defined analogously.
considerations, we make several important modifi-
cations, which we describe later. The general form
E-step
To perform a conventional E-step, we
of our algorithm is as follows:
would need to compute the posterior over all match-
ings, which is #P-complete (Valiant, 1979). On the
Summary of learning algorithm
other hand, hard EM only requires us to compute the
E-step: Find the maximum weighted (partial) bi-
best matching under the current model:5
partite matching m ∈ M
M-step: Find the best parameters θ by performing
m = argmax log p(m , s, t; θ).
(2)
canonical correlation analysis (CCA)
m
We cast this optimization as a maximum weighted
M-step
Given a matching m, the M-step opti-
bipartite matching problem as follows. Define the
mizes log p(m, s, t; θ) with respect to θ, which can
edge weight between source word type i and target
be rewritten as
word type j to be
max
log p(si, tj; θ).
(1)
wi,j = log p(si, tj; θ)
(3)
θ
(i,j)∈m
− log p(si; θ) − log p(tj; θ),
This objective corresponds exactly to maximizing
4Since dS and dT can be quite large in practice and of-
the likelihood of the probabilistic CCA model pre-
ten greater than |m|, we use Cholesky decomposition to re-
sented in Bach and Jordan (2006), which proved
represent the feature vectors as |m|-dimensional vectors with
the same dot products, which is all that CCA depends on.
that the maximum likelihood estimate can be com-
5If we wanted softer estimates, we could use the agreement-
puted by canonical correlation analysis (CCA). In-
based learning framework of Liang et al. (2008) to combine two
tuitively, CCA finds d-dimensional subspaces US ∈
tractable models.
773
which can be loosely viewed as a pointwise mutual
are presented for other languages in section 6. In
information quantity. We can check that the ob-
this section, we describe the data and experimental
jective log p(m, s, t; θ) is equal to the weight of a
methodology used throughout this work.
matching plus some constant C:
4.1
Data
log p(m, s, t; θ) =
wi,j + C.
(4)
Each experiment requires a source and target mono-
(i,j)∈m
lingual corpus. We use the following corpora:
To find the optimal partial matching, edges with
• EN-ES-W: 3,851 Wikipedia articles with both
weight wi,j < 0 are set to zero in the graph and the
English and Spanish bodies (generally not di-
optimal full matching is computed in O((nS +nT )3)
rect translations).
time using the Hungarian algorithm (Kuhn, 1955). If
• EN-ES-P: 1st 100k sentences of text from the
a zero edge is present in the solution, we remove the
parallel English and Spanish Europarl corpus
involved word types from the matching.6
(Koehn, 2005).
Bootstrapping
Recall that the E-step produces a
• EN-ES(FR)-D: English: 1st 50k sentences of
partial matching of the word types.
If too few
Europarl; Spanish (French): 2nd 50k sentences
word types are matched, learning will not progress
of Europarl.7
quickly; if too many are matched, the model will be
• EN-CH-D: English: 1st 50k sentences of Xin-
swamped with noise. We found that it was helpful
hua parallel news corpora;8 Chinese: 2nd 50k
to explicitly control the number of edges. Thus, we
sentences.
adopt a bootstrapping-style approach that only per-
•
mits high confidence edges at first, and then slowly
EN-AR-D: English: 1st 50k sentences of 1994
permits more over time. In particular, we compute
proceedings of UN parallel corpora;9 Ara-
the optimal full matching, but only retain the high-
bic: 2nd 50k sentences.
est weighted edges. As we run EM, we gradually
• EN-ES-G: English: 100k sentences of English
increase the number of edges to retain.
Gigaword; Spanish: 100k sentences of Spanish
In our context, bootstrapping has a similar moti-
Gigaword.10
vation to the annealing approach of Smith and Eisner
(2006), which also tries to alter the space of hidden
Note that even when corpora are derived from par-
outputs in the E-step over time to facilitate learn-
allel sources, no explicit use is ever made of docu-
ing in the M-step, though of course the use of boot-
ment or sentence-level alignments. In particular, our
strapping in general is quite widespread (Yarowsky,
method is robust to permutations of the sentences in
1995).
the corpora.
4
Experimental Setup
4.2
Lexicon
Each experiment requires a lexicon for evaluation.
In section 5, we present developmental experiments
Following Koehn and Knight (2002), we consider
in English-Spanish lexicon induction; experiments
lexicons over only noun word types, although this
6Empirically, we obtained much better efficiency and even
is not a fundamental limitation of our model. We
increased accuracy by replacing these marginal likelihood
consider a word type to be a noun if its most com-
weights with a simple proxy, the distances between the words’
mon tag is a noun in our monolingual corpus.11 For
mean latent concepts:
7
w
Note that the although the corpora here are derived from a
i,j = A − ||z∗
i − z∗
j ||2,
(5)
parallel corpus, there are no parallel sentences.
where A is a thresholding constant, z∗
8LDC catalog # 2002E18.
i
= E(zi,j | fS(si)) =
P 1/2U
9LDC catalog # 2004E13.
S fS (si), and z∗
j is defined analogously. The increased
accuracy may not be an accident: whether two words are trans-
10These corpora contain no parallel sentences.
lations is perhaps better characterized directly by how close
11We use the Tree Tagger (Schmid, 1994) for all POS tagging
their latent concepts are, whereas log-probability is more sensi-
except for Arabic, where we use the tagger described in Diab et
tive to perturbations in the source and target spaces.
al. (2004).
774
1
EN-ES-P
Setting
p0.1
p0.25 p0.33 p0.50 Best-F1
EN-ES-W
0.95
EDITDIST
58.6
62.6
61.1
—-
47.4
0.9
ORTHO
76.0
81.3
80.1
52.3
55.0
0.85
CONTEXT
91.1
81.3
80.2
65.3
58.0
0.8
MCCA
87.2
89.7
89.0
89.7
72.0
Precision 0.75
Table 1: Performance of EDITDIST and our model with
0.7
various features sets on EN-ES-W. See section 5.
0.65
0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Recall
naturally produces lexicons in which each entry is
associated with a weight based on the model, we can
Figure 3: Example precision/recall curve of our system
give a full precision/recall curve (see figure 3). We
on EN-ES-P and EN-ES-W settings. See section 6.1.
summarize these curves with both the best F1 over
all possible thresholds and various precisions px at
all languages pairs except English-Arabic, we ex-
recalls x. All reported numbers exclude evaluation
tract evaluation lexicons from the Wiktionary on-
on the seed lexicon entries, regardless of how those
line dictionary. As we discuss in section 7, our ex-
seeds are derived or whether they are correct.
tracted lexicons have low coverage, particularly for
In all experiments, unless noted otherwise, we
proper nouns, and thus all performance measures are
used a seed of size 100 obtained from Le and
(sometimes substantially) pessimistic. For English-
considered lexicons between the top n = 2, 000
Arabic, we extract a lexicon from 100k parallel sen-
most frequent source and target noun word types
tences of UN parallel corpora by running the HMM
which were not in the seed lexicon; each system
intersected alignment model (Liang et al., 2008),
proposed an already-ranked one-to-one translation
adding (s, t) to the lexicon if s was aligned to t at
lexicon amongst these n words. Where applica-
least three times and more than any other word.
ble, we compare against the EDITDIST baseline,
Also, as in Koehn and Knight (2002), we make
which solves a maximum bipartite matching prob-
use of a seed lexicon, which consists of a small, and
lem where edge weights are normalized edit dis-
perhaps incorrect, set of initial translation pairs. We
tances. We will use MCCA (for matching CCA) to
used two methods to derive a seed lexicon. The
denote our model using the optimal feature set (see
first is to use the evaluation lexicon L
section 5.3).
e and select
the hundred most common noun word types in the
5
Features
source corpus which have translations in Le. The
second method is to heuristically induce, where ap-
In this section, we explore feature representations of
plicable, a seed lexicon using edit distance, as is
word types in our model. Recall that fS(·) and fT (·)
done in Koehn and Knight (2002). Section 6.2 com-
map source and target word types to vectors in RdS
pares the performance of these two methods.
and RdT , respectively (see section 2). The features
used in each representation are defined identically
4.3
Evaluation
and derived only from the appropriate monolingual
We evaluate a proposed lexicon Lp against the eval-
corpora. For a concrete example of a word type to
uation lexicon Le using the F1 measure in the stan-
feature vector mapping, see figure 2.
dard fashion; precision is given by the number of
proposed translations contained in the evaluation
5.1
Orthographic Features
lexicon, and recall is given by the fraction of pos-
For closely related languages, such as English and
sible translation pairs proposed.12 Since our model
Spanish, translation pairs often share many ortho-
12
graphic features. One direct way to capture ortho-
We should note that precision is not penalized for (s, t) if
s does not have a translation in L
graphic similarity between word pairs is edit dis-
e, and recall is not penalized
for failing to recover multiple translations of s.
tance. Running EDITDIST (see section 4.3) on EN-
775
ES-W yielded 61.1 p0.33, but precision quickly de-
(a) Corpus Variation
grades for higher recall levels (see EDITDIST in ta-
Setting
p0.1
p0.25 p0.33 p0.50 Best-F1
ble 1). Nevertheless, when available, orthographic
EN-ES-G
75.0
71.2
68.3
—-
49.0
clues are strong indicators of translation pairs.
EN-ES-W
87.2
89.7
89.0
89.7
72.0
We can represent orthographic features of a word
EN-ES-D
91.4
94.3
92.3
89.7
63.7
type w by assigning a feature to each substring of
EN-ES-P
97.3
94.8
93.8
92.9
77.0
length ≤ 3. Note that MCCA can learn regular or-
(b) Seed Lexicon Variation
thographic correspondences between source and tar-
Corpus
p0.1
p0.25 p0.33 p0.50 Best-F1
get words, which is something edit distance cannot
EDITDIST
58.6
62.6
61.1
—
47.4
capture (see table 5). Indeed, running our MCCA
MCCA
91.4
94.3
92.3
89.7
63.7
model with only orthographic features on EN-ES-
MCCA-AUTO
91.2
90.5
91.8
77.5
61.7
W, labeled ORTHO in table 1, yielded 80.1 p0.33, a
(c) Language Variation
31% error-reduction over EDITDIST in p0.33.
Languages
p0.1
p0.25 p0.33 p0.50 Best-F1
EN-ES
91.4
94.3
92.3
89.7
63.7
5.2
Context Features
EN-FR
94.5
89.1
88.3
78.6
61.9
While orthographic features are clearly effective for
EN-CH
60.1
39.3
26.8
—-
30.8
historically related language pairs, they are more
EN-AR
70.0
50.0
31.1
—-
33.1
limited for other language pairs, where we need to
appeal to other clues. One non-orthographic clue
Table 2: (a) varying type of corpora used on system per-
that word types s and t form a translation pair is
formance (section 6.1), (b) using a heuristically chosen
that there is a strong correlation between the source
seed compared to one taken from the evaluation lexicon
(section 6.2), (c) a variety of language pairs (see sec-
words used with s and the target words used with t.
tion 6.3).
To capture this information, we define context fea-
tures for each word type w, consisting of counts of
nouns which occur within a window of size 4 around
to our model using both orthographic and context
w. Consider the translation pair (time, tiempo)
features.
illustrated in figure 2. As we become more con-
fident about other translation pairs which have ac-
6
Experiments
tive period and periodico context features, we
In this section we examine how system performance
learn that translation pairs tend to jointly generate
varies when crucial elements are altered.
these features, which leads us to believe that time
and tiempo might be generated by a common un-
6.1
Corpus Variation
derlying concept vector (see section 2).13
There are many sources from which we can derive
Using context features alone on EN-ES-W, our
monolingual corpora, and MCCA performance de-
MCCA model (labeled CONTEXT in table 1) yielded
pends on the degree of similarity between corpora.
a 80.2 p0.33. It is perhaps surprising that context fea-
We explored the following levels of relationships be-
tures alone, without orthographic information, can
tween corpora, roughly in order of closest to most
yield a best-F1comparable to EDITDIST.
distant:
5.3
Combining Features
• Same Sentences: EN-ES-P
We can of course combine context and orthographic
• Non-Parallel Similar Content: EN-ES-W
features.
Doing so yielded 89.03 p0.33 (labeled
MCCA in table 1); this represents a 46.4% error re-
• Distinct Sentences, Same Domain: EN-ES-D
duction in p0.33 over the EDITDIST baseline. For the
• Unrelated Corpora: EN-ES-G
remainder of this work, we will use MCCA to refer
13
Our results for all conditions are presented in ta-
It is important to emphasize, however, that our current
model does not directly relate a word type’s role as a partici-
ble 2(a). The predominant trend is that system per-
pant in the matching to that word’s role as a context feature.
formance degraded when the corpora diverged in
776
content, presumably due to context features becom-
(a) English-Spanish
ing less informative. However, it is notable that even
Rank
Source
Target
Correct
1.
education
educación
Y
in the most extreme case of disjoint corpora from
2.
pacto
pact
Y
different time periods and topics (e.g. EN-ES-G),
3.
stability
estabilidad
Y
we are still able to recover lexicons of reasonable
6.
corruption
corrupción
Y
accuracy.
7.
tourism
turismo
Y
9.
organisation
organización
Y
10.
convenience
conveniencia
Y
6.2
Seed Lexicon Variation
11.
syria
siria
Y
All of our experiments so far have exploited a small
12.
cooperation
cooperación
Y
seed lexicon which has been derived from the eval-
14.
culture
cultura
Y
21.
protocol
protocolo
Y
uation lexicon (see section 4.3). In order to explore
23.
north
norte
Y
system robustness to heuristically chosen seed lexi-
24.
health
salud
Y
cons, we automatically extracted a seed lexicon sim-
25.
action
reacción
N
ilarly to Koehn and Knight (2002): we ran EDIT-
(b) English-French
DIST on EN-ES-D and took the top 100 most con-
Rank
Source
Target
Correct
fident translation pairs. Using this automatically de-
3.
xenophobia
xénophobie
Y
rived seed lexicon, we ran our system on EN-ES-
4.
corruption
corruption
Y
5.
subsidiarity
subsidiarité
Y
D as before, evaluating on the top 2,000 noun word
6.
programme
programme-cadre
N
types not included in the automatic lexicon.14 Us-
8.
traceability
traçabilité
Y
ing the automated seed lexicon, and still evaluat-
ing against our Wiktionary lexicon, MCCA-AUTO
(c) English-Chinese
Rank
Source
Target
Correct
yielded 91.8 p0.33 (see table 2(b)), indicating that
1.
prices
!"
Y
our system can produce lexicons of comparable ac-
2.
network
#$
Y
curacy with a heuristically chosen seed. We should
3.
population
%&
Y
note that this performance represents no knowledge
4.
reporter
'
N
5.
oil
()
Y
given to the system in the form of gold seed lexicon
entries.
Table 3: Sample output from our (a) Spanish, (b) French,
6.3
Language Variation
and (c) Chinese systems. We present the highest con-
We also explored how system performance varies
fidence system predictions, where the only editing done
for language pairs other than English-Spanish. On
is to ignore predictions which consist of identical source
and target words.
English-French, for the disjoint EN-FR-D corpus
(described in section 4.1), MCCA yielded 88.3 p0.33
(see table 2(c) for more performance measures).
D and EN-FR-D, presumably due in part to the
This verified that our model can work for another
lack of orthographic features. However, MCCA still
closely related language-pair on which no model de-
achieved surprising precision at lower recall levels.
velopment was performed.
For instance, at p0.1, MCCA yielded 60.1 and 70.0
One concern is how our system performs on lan-
on Chinese and Arabic, respectively. Figure 3 shows
guage pairs where orthographic features are less ap-
the highest-confidence outputs in several languages.
plicable. Results on disjoint English-Chinese and
English-Arabic are given as EN-CH-D and EN-AR
6.4
Comparison To Previous Work
in table 2(c), both using only context features. In
There has been previous work in extracting trans-
these cases, MCCA yielded much lower precisions
lation pairs from non-parallel corpora (Rapp, 1995;
of 26.8 and 31.0 p0.33, respectively. For both lan-
Fung, 1995; Koehn and Knight, 2002), but gener-
guages, performance degraded compared to EN-ES-
ally not in as extreme a setting as the one consid-
14
ered here. Due to unavailability of data and speci-
Note that the 2,000 words evaluated here were not identical
to the words tested on when the seed lexicon is derived from the
ficity in experimental conditions and evaluations, it
evaluation lexicon.
is not possible to perform exact comparisons. How-
777
(a) Example Non-Cognate Pairs
English-Spanish lexicon produced by our system
health
salud
on EN-ES-W. Of the top 100 errors: 21 were cor-
traceability
rastreabilidad
rect translations not contained in the Wiktionary
youth
juventud
lexicon (e.g.
pintura to painting), 4 were
report
informe
purely morphological errors (e.g.
airport to
advantages
ventajas
aeropuertos), 30 were semantically related (e.g.
(b) Interesting Incorrect Pairs
basketball to b´
eisbol), 15 were words with
liberal
partido
strong orthographic similarities (e.g.
coast to
Kirkhope
Gorsel
costas), and 30 were difficult to categorize and
action
reacci´
on
fell into none of these categories. Since many of
Albanians
Bosnia
our ‘errors’ actually represent valid translation pairs
a.m.
horas
not contained in our extracted dictionary, we sup-
Netherlands
Breta˜
na
plemented our evaluation lexicon with one automat-
ically derived from 100k sentences of parallel Eu-
Table 4: System analysis on EN-ES-W: (a) non-cognate
roparl data. We ran the intersected HMM word-
pairs proposed by our system, (b) hand-selected represen-
alignment model (Liang et al., 2008) and added
tative errors.
(s, t) to the lexicon if s was aligned to t at least
(a) Orthographic Feature
three times and more than any other word. Evaluat-
Source Feat.
Closest Target Feats.
Example Translation
ing against the union of these lexicons yielded 98.0
#st
#es, est
(statue, estatua)
p
ty#
ad#, d#
(
0.33, a significant improvement over the 92.3 us-
felicity, felicidad)
ogy
g´
ıa, g´
ı
(geology, geolog´
ıa)
ing only the Wiktionary lexicon. Of the true errors,
the most common arose from semantically related
(b) Context Feature
words which had strong context feature correlations
Source Feat.
Closest Context Features
(see table 4(b)).
party
partido, izquierda
We also explored the relationships our model
democrat
socialistas, dem´
ocratas
learns between features of different languages. We
beijing
pek´
ın, kioto
projected each source and target feature into the
shared canonical space, and for each projected
Table 5: Hand selected examples of source and target fea-
tures which are close in canonical space: (a) orthographic
source feature we examined the closest projected
feature correspondences, (b) context features.
target features. In table 5(a), we present some of
the orthographic feature relationships learned by our
system. Many of these relationships correspond to
ever, we attempted to run an experiment as similar
phonological and morphological regularities such as
as possible in setup to Koehn and Knight (2002), us-
the English suffix ing mapping to the Spanish suf-
ing English Gigaword and German Europarl. In this
fix g´
ıa. In table 5(b), we present context feature
setting, our MCCA system yielded 61.7% accuracy
correspondences. Here, the broad trend is for words
on the 186 most confident predictions compared to
which are either translations or semantically related
39% reported in Koehn and Knight (2002).
across languages to be close in canonical space.
7
Analysis
8
Conclusion
We have presented a novel generative model for
bilingual lexicon induction and presented results un-
We have presented a generative model for bilingual
der a variety of data conditions (section 6.1) and lan-
lexicon induction based on probabilistic CCA. Our
guages (section 6.3) showing that our system can
experiments show that high-precision translations
produce accurate lexicons even in highly adverse
can be mined without any access to parallel corpora.
conditions. In this section, we broadly characterize
It remains to be seen how such lexicons can be best
and analyze the behavior of our system.
utilized, but they invite new approaches to the statis-
We manually examined the top 100 errors in the
tical translation of resource-poor languages.
778
References
Francis R. Bach and Michael I. Jordan. 2006. A proba-
bilistic interpretation of canonical correlation analysis.
Technical report, University of California, Berkeley.
Peter F. Brown, Stephen Della Pietra, Vincent J. Della
Pietra, and Robert L. Mercer. 1994. The mathematic
of statistical machine translation: Parameter estima-
tion. Computational Linguistics, 19(2):263–311.
Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004.
Automatic tagging of arabic text: From raw text to
base phrase chunks. In HLT-NAACL.
Pascale Fung. 1995. Compiling bilingual lexicon entries
from a non-parallel english-chinese corpus. In Third
Annual Workshop on Very Large Corpora.
Michel Galley, Jonathan Graehl, Kevin Knight, Daniel
Marcu, Steve DeNeefe, Wei Wang, and Ignacio
Thayer.
2006.
Scalable inference and training
of context-rich syntactic translation models.
In
COLING-ACL.
David R. Hardoon, Sandor Szedmak, and John Shawe-
Taylor.
2003.
Canonical correlation analysis an
overview with application to learning methods. Tech-
nical Report CSD-TR-03-02, Royal Holloway Univer-
sity of London.
Philipp Koehn and Kevin Knight.
2002.
Learning a
translation lexicon from monolingual corpora. In Pro-
ceedings of ACL Workshop on Unsupervised Lexical
Acquisition.
P. Koehn.
2004.
Pharaoh: A beam search decoder
for phrase-based statistical machine translation mod-
els. In Proceedings of AMTA 2004.
Philipp Koehn. 2005. Europarl: A parallel corpus for
statistical machine translation. In MT Summit.
H. W. Kuhn. 1955. The Hungarian method for the as-
signment problem.
Naval Research Logistic Quar-
terly.
P. Liang, D. Klein, and M. I. Jordan. 2008. Agreement-
based learning. In NIPS.
Reinhard Rapp. 1995. Identifying word translation in
non-parallel texts. In ACL.
Helmut Schmid. 1994. Probabilistic part-of-speech tag-
ging using decision trees. In International Conference
on New Methods in Language Processing.
N. Smith and J. Eisner. 2006. Annealing structural bias
in multilingual weighted grammar induction. In ACL.
L. G. Valiant.
1979.
The complexity of computing
the permanent. Theoretical Computer Science, 8:189–
201.
D. Yarowsky. 1995. Unsupervised word sense disam-
biguation rivaling supervised methods. In ACL.
779