Generalizing Local And Non Local Word Reordering Patterns For ...
Generalizing Local and Non-Local Word-Reordering Patterns for
Syntax-Based Machine Translation
Bing Zhao
Yaser Al-onaizan
IBM T.J. Watson Research
IBM T.J. Watson Research
Yorktown Heights, NY-10598
Yorktown Heights, NY-10598
zhaob@us.ibm.com
onaizan@us.ibm.com
Abstract
The idea of “syntactic cohesion” (Fox, 2002) is
characterized by its simplicity, which has attracted
Syntactic word reordering is essential for
researchers for years. Previous works include sev-
translations across different grammar struc-
eral approaches of incorporating syntactic informa-
tures between syntactically distant language-
pairs.
In this paper, we propose to em-
tion to preprocess the source sentences to make them
bed local and non-local word reordering de-
more like the target language in structure. Xia and
cisions in a synchronous context free gram-
McCord (2004) (Niessen and Ney, 2004; Collins et
mar, and leverages the grammar in a chart-
al., 2005) described approaches applied to language-
based decoder. Local word-reordering is ef-
pairs such as French-English and German-English.
fectively encoded in Hiero-like rules; whereas
Later, Wang et al. (2007) presented specific rules
non-local word-reordering, which allows for
to pre-order long-range movements of words, and
long-range movements of syntactic chunks,
is represented in tree-based reordering rules,
improved the translations for Chinese-to-English.
which contain variables correspond to source-
Overall, these works are similar, in that they design
side syntactic constituents. We demonstrate
a few language-specific and linguistically motivated
how these rules are learned from parallel cor-
reordering rules, which are generally simple. The
pora. Our proposed shallow Tree-to-String
eleven rules described in Wang et al. (2007) are ap-
rules show significant improvements in trans-
pealing, as they have rather simple structure, mod-
lation quality across different test sets.
eling only NP, VP and LCP via one-level sub-tree
structure with two children, in the source parse-tree
1 Introduction
(a special case of ITG (Wu, 1997)). It effectively en-
One of the main issues that a translator (human or
hances the quality of the phrase-based translation of
machine) must address during the translation pro-
Chinese-to-English. One major weakness is that the
cess is how to match the different word orders be-
reordering decisions were done in the preprocessing
tween the source language and the target language.
step, therefore rendering the decoding process un-
Different language-pairs require different levels of
able to recover the reordering errors from the rules if
word reordering. For example, when we translate
incorrectly applied to. Also the reordering decisions
between English and Spanish (or other Romance
are made without the benefits of additional models
languages), most of the word reordering needed
(e.g., the language models) that are typically used
is local because of the shared syntactical features
during decoding.
(e.g., Spanish noun modifier constructs are written
Another method to address the re-ordering prob-
in English as modifier noun). However, for syn-
lem in translation is the Hiero model proposed by
tactically distant language-pairs such as Chinese-
Chiang (2005), in which a probabilistic synchronous
English, long-range reordering is required where
context free grammar (PSCFG) was applied to guide
whole phrases are moved across the sentence.
the decoding. Hiero rules generalize phrase-pairs
572
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 572–581,
Honolulu, October 2008. c 2008 Association for Computational Linguistics
by introducing a single generic nonterminal (i.e., a
were disambiguated with additional binary feature
variable) [X]. The combination of variables and lex-
functions, with their weights optimized in standard
icalized words in a Hiero rule nicely captures local
MER training. The combinatorial effects of the
word and phrase reordering (modeling an implicit
added feature functions can make the feature se-
reordering window of max-phrase length). These
lection and optimization of the weights rather dif-
rules are then applied in a CYK-style decoder. In
ficult. Since the grammar is essentially the same
Hiero rules, any nested phrase-pair can be general-
as the Hiero ones, a standard CYK decoder can be
ized as variables [X]. This usually leads to too many
simply applied in their work. Word reordering can
redundant translations, which worsens the spurious
also be addressed via distortion models. Work in
ambiguities (Chiang, 2005) problems for both de-
(Al-Onaizan and Kishore, 2006; Xiong et al., 2006;
coding and optimization (i.e., parameter tuning). We
Zens et al., 2004; Kumar and Byrne, 2005; Tillmann
found thatvariables (nonterminal [X]) in Hiero rules
and Zhang, 2005) modeled the limited information
offer a generalization too coarse to improve the ef-
available at phrase-boundaries. Syntax-based ap-
fectiveness of hierarchical models’ performance.
proaches such as (Yamada and Knight, 2001; Graehl
We propose to enrich the variables in Hiero rules
and Knight, 2004; Liu et al., 2006) heavily rely on
with additional source syntactic reordering informa-
the parse-tree to constrain the search space by as-
tion, in the form of shallow Tree-to-String syntactic
suming a strong mapping of structures across distant
structures. The syntactic information is represented
language-pairs. Their algorithms are also subject to
by flat one-level sub-tree structures, with Hiero-like
parsers’ performances to a larger extent, and have
nonterminal variables at the leaf nodes. The syntac-
high complexity and less scalability in reality. In Liu
tic rules, proposed in this paper, are composed of
et al. (2007), multi-level tree-structured rules were
(possibly lexicalized) source treelets and target sur-
designed, which made the decoding process very
face strings, with one or more variables that help
complex, and auxiliary rules have to be designed
capture local-reordering similar to the Hiero rules.
and incorporated to shrink multiple source nonter-
Variables in a given rule are derived not only from
minals into one target nonterminal. From our em-
the embedded aligned blocks (phrase-pairs), but also
pirical observations, most of the time, however, the
from the aligned source syntactic constituents. The
multi-level tree-structure is broken in the translation
aligned constituents, as in our empirical observa-
process, and POS tags are frequently distorted. In-
tions for Chinese-English, tend to move together in
deed, strictly following the source parse tree is usu-
translations. The decoder is guided by these rules to
ally not necessary, and maybe too expensive for the
reduce spurious derivations; the rules also constrain
translation process.
the exploration of the search space toward better
The remainder of this paper is structured as fol-
translation quality and sometime improved speed by
lows: in section § 2, we define the notations in our
breaking long sentences into pieces. Overall, what
synchronous context free grammar, in section § 3,
we want is to enable the long-range reordering deci-
the rule extractions are illustrated in details, in sec-
sions to be local in a chart-based decoder.
tion § 4, the decoding process of applying these rules
To be more specific, we think the simple shal-
is described. Experiments in § 5 were carried out
low syntactic structure is powerful enough for cap-
using GALE Dev07 datasets. Improved translation
turing the major structure-reordering patterns, such
qualities were obtained by applying the proposed
as NP, VP and LCP structures. We also use sim-
Tree-to-String rules. Conclusions and discussions
ple frequency-based feature functions, similar to the
are given in § 6.
blocks used in phrase-based decoder, to further im-
prove the rules’ representation power. Overall, this
2 Shallow Tree-to-String Rules
enables us to avoid either a complex decoding pro-
cess to generate the source parse tree, or difficult
Our proposed rules are in the form of probabilis-
combinatorial optimizations for the feature func-
tic synchronous context free grammar (PSCFG). We
tions associated with rules.
adopt the notations used in (Chiang, 2005). Let N
In Marton and Resnik (2008), hiero variables
be a set of nonterminals, a rule has the following
573
form:
those in Wang et al. (2007). If γ contains vari-
X →< ; γ; α; ∼; ¯
w >,
(1)
ables from POS tags, this essentially produces
a superset of the monolingual side POS-based
where X abstracts nonterminal symbols in N ; γ ∈
reordering rules explored in Tillmann (2008).
[N, VS]+ is a sequence of one or more source 1
words (as in the vocabulary of VS) and nonterminal
We focus on the third category — a syntactic label
symbols in N ; α ∈ [N, VT ]+ is a sequence of one
over the span of γ, indicating the covered source
or more target words (in VT ) and nonterminals in N
words consist of a linguistically well-defined phrase.
. ∼ is the one-to-one alignment of the nonterminals
together with γ define a tree-like structure: the root
between γ and α; ¯
w contains non-negative weights
node is , and the aligned children are nonterminals
associated with each rule; is a label-symbol speci-
in γ. The structure information is encoded in ( ,
fying the root node of the source span covering γ. In
γ) pair-wise connections, and the variables keep the
our grammar, is one of the labels (e.g., NP) defined
generalizations over atomic translation-pairs similar
in the source treebank tagset (in our case UPenn
to Hiero models. When the rule is applied during
Chinese tagset) indicating that the source span γ is
decoding time, the labels, the tree-structure and the
rooted at . Additionally, a NULL tag Ø in denotes
lexical items need to be all matched.
a flat structure of γ, in which no constituent structure
was found to cover the span, and we need to back
3 Learning and Applying Rules
off to the normal Hiero-style rules. Our nonterminal
symbols include the labels and the POS tags in the
A parser is assumed for the source language in the
source parse trees.
parallel data. In our case, a Chinese parser is applied
In the following, we will illustrate the Tree-to-
for training and test data. A word alignment model is
String rules we are proposing. At the same time, we
used to align the source words with the target words.
will describe the extraction algorithm, with which
we derive our rules from the word-aligned source-
3.1 Extractions
parsed parallel text. Our nonterminal set N is a re-
Our rule extraction is a three-step process. First, tra-
duced set of the treebank tagset (Xue et al., 2005). It
ditional blocks (phrase-pairs) extraction is carried
consists of 17 unique labels.
out. Secondly, Tree-to-String rules, are then ex-
The rules we extract belong to one of the follow-
tracted from the aligned blocks, of which the source
ing categories:
side is covered by a complete subtree, with different
• γ contains only words, and is NULL; this cor-
permutations of the embedded aligned constituents,
responds to the general blocks used in phrase-
or partially lexicalized constituents. Otherwise, the
based decoder (Och and Ney, 2004);
Hiero-like rules will be extracted when there is no
sub-tree structure identified, in our final step. Fre-
• γ contains words and variables of [X,0] and
quencies of extracted rules were counted to compute
[X,1], and
is NULL; this corresponds to the
feature functions.
Hiero rules as in Chiang (2005);
Figure 1-(a) shows that a subtree (with root at
VP) is aligned to the English string. Considering the
• γ contains words and variables in the form
huge quantity of all the permutations of the aligned
of [X,TAG2], in which TAG is from the LDC
constituents under the tree, only part of the Tree-to-
tagset; this defines a well formed subtree, in
String rules extracted are shown in Figure 1-(c). The
which at least one child (constituent) is aligned
variables incorporate linguistic information in the
to continuous target ngrams. If γ contains only
assigned tag by the parser. When there is no aligned
variables from LDC tag set, this indicates all
constituent for further generalization, the variables,
the constituents (children) in the subtree are
defined in our grammar, back off to the Hiero-like
aligned. This is a superset of rules generalizing
ones without any label-identity information. One
1we use end-user terminologies for source and target.
such example is in the rule “在 [X,0] 前 [X,VP] →
2we index the tags for multiple occurrences in one rule
[X,VP] before the [X,0]”, in which the Hiero-style
574
VP
before the sunrise
[X,PP] [X,VP]
[X,VP] [X,PP]
PP
VP
March
[X,PP]
March [X,PP]
sunrise
[X,VP]
[X,VP] before the sunrise
March before the sunrise
[X,0]
[X,VP]
[X,VP ] before the [X,0]
March before the sunrise
(a) Parse-Tree Alignment
(b) Blocks Alignment
(c) Tree-to-String rules with root of VP
Figure 1: Example rules extracted. (a) the aligned source parse tree with target string; (b) general blocks alignment;
(c) Tree-to-String rules, with root of VP. The tree structure is aligned with target strings
Translations of “
”:
IP
triggered a huge shock in the local
NP
VP
locally triggered an enormous
shock
PP
VP
DP
NP
P
NP
VV
NP
DT NN
ADJP
NP
NN
JJ
NN
This case
locally
triggered a huge a great shock
The case
in the local
great shocks
This case was
local
a huge shock
This
case
in
local
triggered
an enormous
shock
The
cases
In the
locally
trigger
tremendous
shocked
In the local
a huge
shocks
this
case
in
the locals
triggered
enormous
shock
Figure 2: Subtree of “VP(PP,VP)” triggered a reordering pattern of swapping the order of the two children PP and VP
in the source parse tree. This will move the translation “in the local” after the translation of “triggered a huge shock”,
to form the preferred translation in the highlighted cell: “triggered a huge shock in the local”.
variable [X,0] and the label-based variable [X,VP]
rule will generate a spontaneous word “is” from the
co-exist in our proposed rule.
given subtree structure. Usually, it is very hard to
We illustrate several special cases of our extracted
align the spontaneous word correctly, and the rules
Tree-to-String rules in the following. We index the
we proposed indicate that spontaneous words are
variables with their positions to indicate the align-
generated directly from the source sub-tree struc-
ment ∼, and skip the feature function ¯
w to simplify
ture, and they might not necessarily get aligned to
the notations.
some particular source words.
A second example is shown in Eqn. 3, which is
similar to the Hiero rules:
X
→< [X, IP ]; [X, N P 0] [X, V P 0];
(2)
[X, N P 0] is [X, V P 0] > .
X
→< Ø; [X, 0] zhiyi;
(3)
one of the [X, 0] > .
The rule in Eqn. 2 shows that a source tree rooted
at IP, with two children of NP and VP generalized
The rule in Eqn. 3 shows that when there is
into variables [X,NP] and [X,VP]; they are rewritten
no linguistically-motivated root covering the span,
into “[X,NP] is [X,VP]”, with the spontaneous word
([X,NULL] is then assigned), we simply back
is inserted. Such rules are not allowed in Hiero-style
off to the Hiero rules.
In this case, the source
models, as there is no lexical item between the two
span of [X, 0] zhiyi is rewritten into the target
variables (Chiang, 2005) in the source side. This
“one of the [X, 0]”, without considering the map-
575
ping of the root of the span. In this way, the repre-
4 Chart-based Decoder
sentation power is kept in the variables in our rules,
even if the source subtree is aligned to a discontin-
Given the source sentence, with constituent parse-
uous sequence on the target side. This is important
trees, the decoder is to find the best derivation D∗
for Chinese-to-English, because the grammar struc-
which yield the English string e∗:
ture is so different that more than 40% of the subtree
structures were not kept during the translation in our
study on hand-aligned data. Following strictly the
e∗ = arg max {φ(D)φ(e)φ(f |e)},
(5)
D∗
source side syntax will derail from these informative
translation patterns.
where φ(D) is the cost for each of the derivations
that lead to e from a given source-parsed f ; φ(e)
X
→< [X, N P ]; [X, N N 1][X, N N 2][X, N N 3];
is for cost functions from the standard n-gram lan-
[X, N N 3][X, N N 1][X, N N 2] > .
(4)
guage models; φ(f |e) is the cost for the standard
translation models, including general blocks. We
separate the costs for normal blocks and the general-
Eqn. 4. is a POS-based rule — a special case in
ized rules explicitly here, because the blocks contain
our proposed rules. This rule shows the reorder-
stronger lexical evidences observed directly from
ing patterns for three adjacent NN’s. POS based
data, and we assign them with less cost penalties
rules can be very informative for some language-
via a different weight factor visible for optimization,
pairs such as Arabic-to-English, where the ADJ is
and prefer the lexical match over the derived paths
usually moved before NN during the translations.
during the decoding.
As also shown in Eqn. 4 for POS sequences, in the
UPenn treebank-style parse trees, a root usually have
Our decoder is a chart-based parser with beam-
more than two variables. Our rule set for subtree,
search for each cell in a chart. Because the tree-
therefore, contain more than two variables: “X →<
structure can have more than two children, there-
[X, IP ]; [X, ADV P 0][X, N P 0][X, V P 0]; [X, N P 0] fore, the Tree-to-String rules extracted usually con-
[X, ADV P 0][X, V P 0] >”. A CYK-style decoder
tain more than two variables. Slightly different from
has to rely on binarization to preprocess the
the decoder in (Chiang, 2005), we implemented
grammar as did in (Zhang et al., 2006) to handle
the dotted-rule in Early-style parser to handle rules
multi-nonterminal rules. We adopt the so-called
containing more than two variables.
Our cube-
dotted-rule or dotted-production, similar to the
expansion, implemented the cube-pruning in Chiang
Early-style algorithm (Earley, 1970), to handle the
(2007), and integrated piece-wise cost computations
multi-nonterminal rules in our chart-based decoder.
for language models via LM states. The intermedi-
ate hypotheses were merged (recombined) accord-
3.2 Feature Functions
ing to their LM states and other cost model states.
As used in most of the SMT decoders for a phrase-
We use MER (Och, 2003) to tune the decoder’s pa-
pair, a set of standard feature functions are applied
rameters using a development data set.
in our decoder, including IBM Model-1 like scores
Figure 2 shows an example of a tree-based rule
in both directions, relative frequencies in both direc-
fired at the subtree of VP covering the highlighted
tions. In addition to these features, a counter is as-
cell. When a rule is applied at a certain cell in the
sociated to each rule to collect how many rules were
chart, the covered source ngram should match not
applied so far to generate a hypothesis. The stan-
only the lexical items in the rules, but also the tree-
dard Minimum Error Rate training (Och, 2003) was
structures as well. The two children under the sub-
applied to tune the weights for all feature types.
tree root VP are PP (“在当地”: in the local) and VP
The number of extracted rules from the GALE
(“引发巨大震动”: triggered a huge shock ). This
data is generally large. We pruned the rules accord-
rule triggered a swap of these children to generate
ing to their frequencies, and only keep at most the
the correct word order in the translation: “triggered
top-50 frequent candidates for each source side.
a huge shock in the local”.
576
5 Experiments
The NIST 2003 MT Evaluation (MT03) is used
as our development data set to tune the decoder’s
Our training data consists of two corpora: the GALE
parameters toward better BLEU score. The text part
Chinese-English parallel corpus and the LDC hand-
of GALE 2007 Chinese-to-English Development set
aligned corpus1. The Chinese side of these two cor-
(GALE DEV07) is used as our test set. MT03 con-
pora were parsed using a constituency parser (Luo,
sists of 919 sentences, whereas GALE DEV07 con-
2003). The average labeled F-measure of the parser
sists of 2303 sentences under two genres: NewsWire
is 81.4%.
and WebLog. Both have four human reference trans-
Parallel sentences were first word-aligned using
lations.
a MaxEnt aligner (Ittycheriah and Roukos, 2005).
Then, phrase-pairs that overlap with our develop-
5.2 Details of Extracted Rules
ment and test set were extracted from the word
From the hand-aligned data, the rules we extracted
alignments (from both hand alignments and auto-
fall into three categories: regular blocks (phrase-
matically aligned GALE corpora) based on the pro-
pairs), Hiero-like rules, and Tree-to-String rules.
jection principle (Tillmann, 2003). Besides the regu-
The statistics of the extracted rules are shown in Ta-
lar phrase-pairs, we also extracted the Tree-to-String
ble 2
rules from the two corpora. The detailed statistics
are shown in Table 1. Our re-implementation of Hi-
Table 2: Rules extracted from hand-aligned data
ero system is the baseline. We integrated the eleven
Types
Frequency
reordering rules described in (Wang et al., 2007),
in our chart-based decoder. In addition, we report
Block
846965
the results of using the Tree-to-String rules extracted
Hiero
508999
from the hand-aligned training data and the automat-
Tree-to-String
409767
ically aligned training data. We also report the result
Total
1765731
of our translation quality in terms of both BLEU (Pa-
pineni et al., 2002) and TER (Snover et al., 2006)
We focus on Tree-to-String rules. Table 3 shows
against four human reference translations.
the detailed statistics of the Tree-to-String rules ex-
5.1 The Data
tracted from the Chinese-to-English hand-aligned
training data. The following section provides a de-
Table 1 shows the statistics of our training, develop-
tailed analysis of the most frequent subtrees ob-
ment and test data. As our word aligner (Ittycheriah
served in our training data.
and Roukos, 2005) can introduce errors in extracting
Tree-to-String rules, we use a small hand-aligned
5.2.1 Frequent Subtrees: NP, VP, and DNP
data set “CE16K”, which consists of 16K sentence-
The majority of Tree-to-String rules we extracted
pairs, to get relatively clean rules, free from align-
are rooted at the following labels: NP (46%),
ment errors. A much larger GALE data set, which
VP(22.8%), DNP (2.23%), and QP(2.94%).
consists of 10 million sentence-pairs, is used to in-
Wang et al. (2007) covers only subtrees of NP,
vestigate the scalability of our proposed approach.
VP, and LCP, which are a subset of our proposed
Tree-to-String rules here. They apply these rules as
a pre-processing step to reorder the input sentences
Table 1: Training and Test Data
Train/test
sentences
src words
tgt words
with hard decisions. Our proposed Tree-to-String
rules, on the contrary, are applied during the de-
CE16K
16379
380103
477801
coding process which allows for considering many
GALE
10.5M
274M
310M
possible competing reordering options for the given
MT03
919
24099
-
sentences, and the decoder will choose the best one
Dev07
2303
61881
-
according to the cost functions.
Table 4 shows the statistics of reordering rules
1LDC2006E93
for subtrees rooted at VP. The statistics suggest that
577
Table 5: Hiero, Tree-Based (eleven rules in Wang et al. (2007)), and Tree-to-String Rules with “DE”
Ruleset
Root
Src
Tgt
Frequency
NULL
[X,0] 的 [X,1]
[X,0] ’s [X,1]
347
Hiero
NULL
[X,0] 的 [X,1]
[X,1] of [X,0]
306
NULL
[X,0] 的 [X,1]
[X,0] of [X,1]
174
NP
DNP(NP) NP
NP DNP(NP)
-
Tree-Based
NP
DNP(PP) NP
NP DNP(PP)
-
NP
DNP(LCP) NP
NP DNP(LCP)
-
[X,DNP]
[X,NP] [X,DEG]
[X,NP] [X,DEG]
580
Tree-to-String
[X,DNP]
[X,NP] [X,DEG]
[X,DEG] [X,NP]
2163
[X,DNP]
[X,NP] [X,DEG]
[X,NP] , [X,DEG]
4
curacies, the statistics we collected from the GALE
Table 3: Distributions of the NP, VP, QP, LCP rules
Root
Frequency
Percentage (%)
dataset, containing 10 million sentence-pairs, show
that the children in the subtree VP(PP,VP) is trans-
NP
189616
46.2
lated monotonically 126310 times, while reordered
VP
93535
22.8
of only 22144 times. However, the hand-aligned
IP
68341
16.6
data support the swap for 1245 times, and monotoni-
PP
18519
4.51
cally for only 168 times. Part of this disagreement is
DNP
9141
2.23
due to the word segmentation errors, incorrect word
QP
12064
2.94
alignments and unreliable parsing results.
LCP
4127
1.00
Another observations through our extracted Tree-
CP
2994
0.73
to-String rules is on the controlled insertion of the
PRN
2810
0.68
target spontaneous2 (function) words. Instead of hy-
DP
1415
0.34
pothesizing spontaneous words based only on the
Others
6879
1.67
language model or only on observing in phrase-
Total
409767
-
pairs, we make use of the Tree-to-String rules to get
suggestion on the insertion of spontaneous words.
In this way, we can make sure that the spontaneous
Table 4: Distribution of the reordering rules for subtrees
words are generated from the structure information,
rooted at VP: [X,VP]; [X,PP] [X,VP]; statistics are col-
as opposed to those from a pure hypothesis. The ad-
lected from GALE training data
vantage of this method is shown in Table 4. For in-
Root
Target
Frequency
stance, the word “that” and the punctuation “,” were
[X,PP] [X,VP]
126310
generated in the target side of the rule. This proves
[X,VP] [X,PP]
22144
that our model can provide a more principled way to
VP
[X,PP] , [X,VP]
1524
generate spontaneous words needed for fluent trans-
[X,PP] that [X,VP]
1098
lations.
[X,PP] and [X,VP]
831
5.2.2 DEG and DEC
An interesting linguistic phenomenon that we in-
it is impossible to come up with a reordering rule
vestigated is the Chinese word DE “的”. “的” is an
that is always applicable. For instance, (Wang et
informative lexical clue that indicates the need for
al., 2007) will always swap the children of the sub-
long range phrasal movements. Table 5 shows a few
tree VP(PP,VP). However, the statistics shown in Ta-
2Target spontaneous words are function words that do not
ble 4 suggest that might not be best way. In fact,
have specific lexical source informants and are needed to make
due to parser’s performance and word alignment ac-
the target translation fluent.
578
high-frequent reordering rules that contain the Chi-
larger GALE training set with about ten million
nese word “DE”.
sentence-pairs, we achieved significant improve-
The three type of rules handle “DE” differently. A
ments from both genres (newswire and web data).
major difference is the structure in the source side.
The improvements are significant in both BLEU
Hiero rules do not consider any structure, and ap-
and TER. BLEU improved from 32.44 to 33.51 on
ply the rule of “[X,0] 的 [X,1]”. Tree-based rules,
newswire, and from 25.88 to 27.91 on web data.
as described in Wang et al. (2007) do not handle
Similar improvements were found in TER as shown
的 directly; they are often implicitly taken care of
in the table. The gain came mostly from the richer
when reordering DNPs instead. Our proposed Tree-
extracted rule set, which not only presents robust
to-String rules model 的 directly in a subtree con-
statistics for reordering patterns, but also offers more
taining DEG/DEC, which triggers word reordering
target spontaneous words generated from the syntac-
within the structure. Our rule set includes all the
tic structures. Since the top-frequent rules extracted
above three rule-types with the associated frequen-
are NP, VP, and IP as shown in Table 3, our proposed
cies, this enriched the reordering choices to be cho-
rules will be able to win the correct word order with
sen by the chart-based decoder, guided by the statis-
reliable statistics, as long as the parser shows accept-
tics collected from the data and the language model
able performances on these structures. This is espe-
costs.
cially important for weblog data, where the parser’s
overall accuracy potentially might not be very good.
5.3 Evaluation
Table 7 shows the translations from different
We tuned the decoding parameters using the MT03
grammars for the same source sentence. Both Tree-
data set, and applied the updated parameters to the
based and Tree-to-String methods get the correct re-
GALE evaluation set. The eleven rules of VP, NP,
ordering, while the latter can suggest insertions of
and LCP (tree-based) improved the Hiero baseline3
target spontaneous words like “a” to allow the trans-
from 32.43 to 33.02 on BLEU. The reason, the tree-
lation to run more fluently.
reordering does not gain much over Hiero baseline,
is probably that the reordering patterns covered by
6 Conclusion and Discussions
tree-reordering rules, are potentially handled in the
standard Hiero grammar.
In this paper, we proposed our approach to model
A small but noticeable further improvement over
both local and non-local word-reordering in one
tree-based rules, from 33.02 to 33.26, was ob-
probabilistic synchronous CFG. Our current model
tained on applying Tree-to-String rules extracted
incorporates source-side syntactic information, to
from hand-aligned dataset. We think that the Tree-
model the observations that the source syntactic con-
based rules covers major reordering patterns for
stituent tends to move together during translations.
Chinese-English, and our hand-aligned dataset is
The proposed rule set generalizes over the variables
also too small to capture representative statistics and
in Hiero-rules, and we also showed the special cases
more reordering patterns. A close check at the rules
of the Tree-based rules and the POS-based rules.
we learned from the hand-aligned data shows that
Since the proposed rules has at most one-level tree
the tree-based rules are simply the subset of the
structure, they can be easily applied in a chart-based
rules extracted. The Tree-to-String grammar im-
decoder. We analyzed the statistics of our rules,
proved the Hiero baseline from 32.43 to 33.26 on
qualitatively and quantitatively. Next, we compared
BLEU; considering the effects from the tree-based
our work with other research, especially with the
rules only, the additional information improved the
work in Wang et al. (2007). Finally, we reported
BLEU scores from 33.02 to 33.26. Similar pictures
our empirical results on Chinese-English transla-
of improvements were observed for the two unseen
tions. Our Tree-to-String rules showed significant
tests of newswire and weblog in GALE data.
improvements over the Hiero baseline on the GALE
DEV07 test set.
When applying the rules extracted from the much
Given the low accuracy of the parsers, and the po-
3Hiero results are from our own re-implementation.
tential errors from Chinese word-segmentations, and
579
Table 6: Hiero, Tree-Based (NP, VP, LCP), and Tree-to-String rules extracted from hand-aligned data (H) or from
GALE training data (G)
MT03
GALE07-NewsWire
GALE07-Weblog
Setup
BLEUr4n4
TER
BLEUr4n4
TER
BLEUr4n4
TER
Hiero
32.43
59.75
31.68
61.45
25.99
65.65
Tree-based
33.02
59.84
32.22
61.46
25.67
65.64
Tree-to-String (H)
33.26
61.04
32.44
61.36
25.88
65.54
Tree-to-String (G)
35.51
57.28
33.51
59.71
27.91
62.88
Table 7: Hiero, Tree-Based (NP, VP, LCP), Tree-to-String Translations
Src-Sent
此案在当地引发巨大震动。
Hiero
in this case local triggered shock .
Tree-Based
the case triggered uproar in the local.
Tree-to-String
the case triggered a huge uproar in the local .
word-alignments, our rules learned are still noisy.
David Chiang. 2007. Hierarchical phrase-based transla-
Exploring better cost functions associate each rule
tion. In Computational Linguistics.
might lead to further improvement.
Because of
Michael Collins, Philipp Koehn, and Ivona Kucerova.
the relative high accuracy of English parsers, many
2005.
Clause restructuring for statistical machine
works such as Zollmann and Venugopal (2006) and
translation. In Proceedings of ACL.
Shen et al. (2008) emphasize on using syntax in tar-
Jay Earley. 1970. An efficient context-free parsing al-
get languages, to directly influence the fluency as-
gorithm. In Communications of the ACM., volume 13,
pect of the translation output. In future, we plan to
pages 94–102.
incorporate features from target-side syntactic infor-
Heidi J. Fox. 2002. Phrasal cohesion and statistical
mation, and connect them with the source informa-
machine translation. In Proc. of the Conference on
Empirical Methods in Natural Language Processing,
tion explored in this paper, to model long-distance
pages 304–311, Philadelphia, PA, July 6-7.
reordering for better translation quality.
Jonathan Graehl and Kevin Knight. 2004. Training tree
Acknowledgments
transducers. In Proc. NAACL-HLT.
Abraham Ittycheriah and Salim Roukos. 2005. A maxi-
The authors would like to thank the anonymous
mum entropy word aligner for arabic-english machine
reviewers for their comments to improve this pa-
translation. In HLT/EMNLP.
per. This work was supported by DARPA GALE
Shankar Kumar and William Byrne. 2005. Local phrase
program under the contract number HR0011-06-2-
reordering models for statistical machine translation.
0001.
In HLT/EMNLP 2005, Vancouver, B.C., Canada.
Yang Liu, Qun Liu, and Shouxun Lin. 2006. Tree-to-
string alignment template for statistical machine trans-
References
lation. In ACL-Coling.
Yaser Al-Onaizan and Papineni. Kishore. 2006. Distor-
Yang Liu, Yun Huang, Qun Liu, and Shouxun Lin. 2007.
tion models for statistical machine translation. In Pro-
Forest-to-string statistical translation rules. In 45th
ceedings of ACL-COLING, pages 529–536.
Annual Meeting of the Association for Computational
Linguistics.
David Chiang. 2005. A hierarchical phrase-based model
for statistical machine translation. In Proceedings of
Xiaoqiang Luo. 2003. A maximum entropy chinese
the 43rd Annual Meeting of the Association for Com-
character-based parser. In Proc. of ACL.
putational Linguistics (ACL’05), pages 263–270, Ann
Yuval Marton and Philip Resnik. 2008. Soft syntactic
Arbor, Michigan, June. Association for Computational
constraints for hierarchical phrased-based translation.
Linguistics.
In ACL.
580
Sonja Niessen and Hermann Ney.
2004.
Statistical
Deyi Xiong, Qun Liu, and Shouxun Lin. 2006. Maxi-
machine translation with scarce resources using mor-
mum entropy based phrase reordering model for sta-
phosyntactic information. In Computational Linguis-
tistical machine translation. In ACL-Coling.
tics.
Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha
Franz J. Och and Hermann Ney. 2004. The alignment
Palmer. 2005. The penn chinese treebank: Phrase
template approach to statistical machine translation.
structure annotation of a large corpus. In Natural Lan-
In Computational Linguistics, volume 30, pages 417–
guage Engineering, volume 11, pages 207–238.
449.
K. Yamada and Kevin. Knight. 2001. Syntax-based Sta-
Franz J. Och. 2003. Minimum error rate training for
tistical Translation Model. In Proceedings of the Con-
statistical machine translation. In Proc. of the 41st
ference of the Association for Computational Linguis-
Annual Meeting of the Association for Computational
tics (ACL-2001).
Linguistics, Japan, Sapporo, July.
Richard Zens, E. Matusov, and Hermmann Ney. 2004.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Improved word alignment using a symmetric lexicon
Jing Zhu. 2002. Bleu: a method for automatic evalu-
model. In Proceedings of the 20th International Con-
ation of machine translation. In Proc. of the 40th An-
ference on Computational Linguistics (CoLing 2004),
nual Conf. of the Association for Computational Lin-
pages 36–42, Geneva, Switzerland, Auguest.
guistics (ACL 02), pages 311–318, Philadelphia, PA,
Hao Zhang, Liang Huang, Daniel Gildea, and Kevin
July.
Knight. 2006. Synchronous binarization for machine
translation. In Proceedings of the HLT-NAACL.
Libin Shen, Jinxi Xu, and Ralph Weischedel. 2008. A
Andreas Zollmann and Ashish Venugopal. 2006. Syn-
new string-to-dependency machine translation algo-
tax augmented machine translation via chart parsing.
rithm with a target dependency language model. In
In Proc. of NAACL 2006 - Workshop on statistical ma-
Proceedings of ACL.
chine translation.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
nea Micciulla, and John Makhoul. 2006. A study of
translation edit rate with targeted human annotation.
In AMTA.
Christoph Tillmann and Tong Zhang. 2005. A localized
prediction model for statistical machine translation. In
Proceedings of the 43rd Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL’05), pages
557–564, Ann Arbor, Michigan, June. Association for
Computational Linguistics.
Christoph Tillmann. 2003. A projection extension algo-
rithm for statistical machine translation. In Proc. of
the Conference on Empirical Methods in Natural Lan-
guage Processing.
Christoph Tillmann. 2008. A rule-driven dynamic pro-
gramming decoder for statistical mt. In HLT Second
Workshop on Syntax and Structure in Statistical Trans-
lation.
Chao Wang, Michael Collins, and Phillip Koehn. 2007.
Chinese syntactic reordering for statistical machine
translation. In proceedings of EMNLP.
Dekai Wu.
1997.
Stochastic inversion transduction
grammars and bilingual parsing of parallel corpora. In
Computational Linguistics, volume 23(3), pages 377–
403.
Fei Xia and Michael McCord. 2004. Improving a sta-
tistical mt system with automatically learned rewrite
patterns.
In the 20th International Conference on
Computational Linguistics (COLING 2004), Geneva,
Switzerland, Aug 22-29.
581