Original PDF Flash format slovene-word-sketches  


Slovene Word Sketches

Slovene Word Sketches
Simon Krek,* Adam Kilgarriff**

* Faculty of Arts
University of Ljubljana
Ljubljana, Slovenia
simon.krek@guest.arnes.si

** Lexical Computing Ltd
Brighton, United Kingdom
adam@lexmasterclass.com

Abstract

Word sketches are one-page automatic, corpus-based summaries of a word's grammatical and collocational behaviour. They were first
used in the production of the Macmillan English Dictionary (Rundell 2002). At that point, they only existed for English. Today, the
Sketch Engine is available, a corpus tool which takes as input a corpus of any language and corresponding grammar patterns and
which generates word sketches for the words of that language. It also automatically generates a thesaurus and 'sketch differences',
which specify similarities and differences between near-synonyms. The FidaPLUS corpus, a morpho-syntactically tagged corpus of
Slovene was loaded into the Sketch Engine software. We shall demonstrate the Slovene word sketches, and show how they can be
used in lexicography and for other linguistic purposes. The results show that word sketches could significantly facilitate lexicographic
work in Slovene as they have for English.

Besedne skice v slovenščini

Besedne skice (Word sketches) so avtomatski na korpusu temelječi sežetki slovničnega in kolokacijskega vedenja neke besede. Prvič
so bile uporabljene pri sestavljanju enojezičnega angleškega slovarja založbe Macmillan (Rundell 2002). Takrat so obstajale le za
angleški jezik. Zdaj je na voljo programski modul Sketch Engine, korpusno orodje, ki na vhodu sprejme korpus kateregakoli jezika ter
njegove slovnične vzorce, iz njih pa ustvari besedne skice za besede tega jezika. Hkrati avtomatsko generira tezaver in "razlikovalne
skice", ki izpostavljajo podobnosti in razlike med bližnjimi sopomenkami. V programski modul Sketch Engine smo naložili korpus
FidaPLUS, oblikoslovno-skladenjsko označeni korpus slovenščine. Prikazali bomo slovenske besedne skice in pokazali, kako jih je
mogoče uprabiti za leksikografske in druge jezikoslovne namene. Rezultati kažejo, da besedne skice znatno olajšajo delo
leksikografom slovenskega jezika, tako kot se je izkazalo pri angleščini.

.
their primary tool for finding out how a word behaves.
1. Introduction
Later, with the growth of corpora, lexical statistics had to
Word sketches are one-page automatic, corpus-based
be applied to manage the abundant data and highlight the
summaries of a word's grammatical and collocational most salient combinations and collocations. Today, state-
behaviour. Their value for lexicographic work in English
of-the-art CQSs allow the lexicographer great flexibility in
and other languages, as well as the background of the use
searching for phrases, collocates, grammatical patterns,
of corpora in lexicography, have been described elsewhere
sorting concordances according to a wide range of criteria,
(Kilgarriff and Tugwell 2001, Kilgarriff and Rundell identifying ‘subcorpora’ for searching in only spoken text,
2002, Kilgarriff et al. 2004).
or only fiction. Available systems include WordSmith,
First, we shall introduce corpus query systems and the
MonoConc, and the Stuttgart Workbench among others.
basic idea of word sketches. Next, we shall concentrate on
Specifically for the two large Slovene corpora, there
the application of word sketches to the Slovene language
are also two different on-line concordancers available:
in the Sketch Engine software.
ASP32 for the FidaPLUS corpus1 and NEVA for Nova
The FidaPLUS corpus of Slovene will also be briefly
beseda, with a more detailed description available in Krek
described, with special attention to the tagging problems
(2003).2
which could affect its use within the Sketch Engine.
2.2. Sketch Engine
2. Word sketches
2.2.1. Description
2.1. Corpus query systems
The Sketch Engine is a corpus query system which
allows the user to use the familiar CQS functions:
Different corpus query systems have been used to
check the corpus evidence since the rise of the first
electronic corpora. Ever since the COBUILD project, 1 http://www.fidaplus.net
lexicographers have been using KWIC concordances as
2 http://bos.zrc-sazu.si/s_beseda.html

– concordances with lemma, phrase, word form and CQL
2.2.3. Lemmatization & POS-tagging
search,
The Sketch Engine does not support the process of

lemmatization; various tools are available for linguists to

develop lemmatizers, and they are available for a number

of languages. If no lemmatizer is available, it is possible

to apply the Sketch Engine to word forms, which, while

not optimal, will still be a useful lexicographic tool.

Similarly for part of speech (POS) tagging, also known

as POS-disambiguation. This is the task of deciding the

correct word class for each word in the corpus – of

determining whether an occurrence of "brez" in Slovene is

an occurrence of a noun "breza" in plural, genitive case,
such as word sketches, thesaurally similar words, and also
or a preposition. A tagger presupposes a linguistic
together with the context control filter
analysis of the language which has given rise to a set of

the syntactic categories of the language, or tagset. Tagsets

and taggers exist for a number of languages, and there are

assorted well-tried methods for developing taggers. The

Sketch Engine assumes tagged input.

As the FidaPLUS corpus is both lemmatized and POS-

tagged but not syntactically annotated, Slovene word

sketches are based on a lemmatized and POS-tagged
and the usual viewing and sorting options:
corpus, with grammatical relations defined on the basis of

POS-tag information.


2.3. Grammatical relations

Grammatical relations are defined as regular

expression over POS-tags. For example, if we wish to

include the grammatical relation between a noun and its

adjectives in modifying position, we define the head of the

noun phrase, a noun ("S" in the FidaPLUS tagset) and one

or more preceding adjectives ("P") with the possibility of

allowing the intervening comma and the particles "se" and

"si":


However, the features of the Sketch Engine which are of
=a_modifier/modifies
special interest in this article are not part of standard
2: [tag="P.*"] [tag="P.*" | word="," | word="se" |
concordancing programs. These features include Word
word="si"] {0,5} 1: [tag="S.*"]
Sketch, Sketch Difference and Thesaurus which will be

described later. All these features are fully integrated with
The first line, following the =, gives two names for the
standard concordancing.
grammatical relation. The first, before the slash, is the
name when the arguments are in the one order, and the
2.2.2. Word Sketch
other is when the arguments are in the other.
To identify a word's grammatical and collocational
The 1: and 2: mark the words to be extracted as the
behaviour, the Sketch Engine needs to know how to find
first and second arguments. |, ., (), and * are standard
words connected by a grammatical relation. It allows two
regular expression metacharacters. {0,5} indicates that the
possibilities.
preceding term occurs between zero and five times.
In the first, the input corpus has been parsed and the
information about which word-instances stand in which
3. Slovene Word Sketches
grammatical relations with which other word-instances is
embedded in the corpus. Currently, dependency-based
syntactically annotated corpora are supported. Phrase-
3.1. Slovene Corpus
structured trees need heads of phrases to be marked.
In the second, the input corpus is loaded into the
3.1.1. FIDA corpus
sketch engine POS-tagged but not parsed, and the sketch
The FIDA corpus is the precursor of the FidaPLUS
engine supports the process of identifying grammatical
corpus which was used in the Sketch Engine software. It
relation instances. Each grammatical relation will be
was compiled in a joint project involving four partners,
defined, using the Sketch Engine to test and develop it. two from the academic/research sphere: (the Faculty of
When the developer is happy with the definition of each
Arts, University of Ljubljana, the Jožef Stefan Institute)
grammatical relation, they save the definitions in a and two commercial ones (DZS publishing house and
“gramrel” file. The Sketch Engine then compiles this file
Amebis software company). Corpus compilation started in
and finds all instances of all grammatical relations in the
1997 and was concluded in 2000. The corpus was just
corpus. It puts them in a gramrels database and users than
over 100 million words and was a balanced corpus of texts
have access to word sketches.
in the Slovene language mainly from the 1990s.

The corpus was lemmatized and POS-tagged but the
process was limited to the lexicon of word forms available

at Amebis at the time. The disambiguation of multiple
the FidaPLUS corpus, 7334 instances of this particular
possible morphosyntactic descriptions, (MSDs) for grammatical relation can be found for the lemma "čas".
ambiguous wordforms such as brez was not performed, a
Lemmas are ranked according to the salience score
considerable drawback when using the corpus for (Kilgarriff and Tugwell 2001). The user can click on the
automatic linguistic analysis.
number next to a lemma to see the relevant concordance.
We used four summetrical relations..
3.1.2. FidaPLUS corpus
The problems of lemmatization and POS-tagging,
3.2.2. Dual Example
together with the size, balance and up-to-dateness were
Dual relations are most
addressed in the subsequent project, "Language Resources
common in the gramrel file.
for Slovene", funded by the Slovene Ministry of Higher
There are eleven of them,
Education, Science and Technology and co-funded by
covering relations expressed by
DZS and Amebis. Project partners included the Faculty of
means of grammatical case in
Arts (University of Ljubljana) as the leading partner, the
Slovene as well as modifying
Faculty of Social Sciences (University of Ljubljana) and
structures as shown before. The
the Jožef Stefan Institute. Its aim was a three hundred
corresponding part of the word
million word corpus with complete lemmatization and
sketch for the lemma "glava" is
POS-tagging.
shown on the left.
The FidaPLUS corpus used for testing in the Sketch
Relations covering
Engine is the preliminary result of the project. In terms of
grammatical cases are defined in
size it is similar to the FIDA corpus, but the lemmatization
the following fashion:
and POS-tagging have been improved. Lemmatization is

both lexicon-based and statistical, aiming at lemmatization
=is_obj4_of/has_obj4
of all items in the corpus. POS-disambiguation uses the
*DUAL
tools developed by Amebis.
2:[tag="Gpp.*" &
!(lemma = "biti" | lemma =
3.2. Slovene grammatical relations
"imeti" | lemma = "hoteti" |
The Slovene "gramrel" file was based on the Czech
lemma = "morati" | lemma =
example (Kilgarriff et al. 2004), since Czech, like Slovene
"smeti") ] [tag!="
but unlike English, is a relatively free word order
[SGDVLMOZ].*" & tag!=""]
language.
{0,5} 1: [tag="S...t.*"]
The grammatical relations in the Slovene gramrel file
2:[tag="G.d.*" &
include three types: symmetric, between two items with
!(lemma = "biti" | lemma =
equal status, dual, between two items with dependent
"imeti" | lemma = "hoteti" |
relations and trinary, between three dependent items.
lemma = "morati" | lemma =
"smeti") ]
3.2.1. Symmetric Example
[tag!="[SGDVLMOZ].*" &
One example of the
tag!=""] {0,5} 1:[tag="S...t.*"]
symmetric relation is various

coordinate structures with
There are two variants of the
conjunctions "and" or "or", as
particular relation: either a verb
well as two-word coordinate
has an object in the oblique case
structures such as "niti-niti",
or the noun is itself an object in
"ali-ali".
the same case, in relation to a

verb. The example on the left
=coord
shows a list of verbs where the
*SYMMETRIC
lemma "glava" is predominantly

1:[] [word = "in" |
used in the oblique case within a
word = "ali"] 2:[]
window of five items from a

[word = "niti"] 1:[]
verb. All the verbs from the
[word = "niti"] 2:[]
beginning of the list indicate

[word = "ali"] 1:[]
structures which are
[word = "ali"] 2:[]
lexicographically relevant

[word = "bodisi"] 1:[]
because of their either central or
[word = "bodisi"] 2:[]
additional metaphorical

[word = "tako"] 1:[]
meaning. Thus the concordances
[word = "kakor"] 2:[]
of the structure "skloniti glavo"

[word = "tako"] 1:[]
show that besides the literal
[word = "kot"] 2:[]
meaning "to bow one's head", there are many examples of

the metaphorical extension "to give up" or "to concede
The result of this defeat". The next one indicates the structure "beliti si
grammatical relation can be glavo" which is thoroughly idiomatic: "to worry about, to
viewed as part of the word agonize over". The same is true for "razbijati si glavo",
sketch. The result shows that in
"tiščati glave (skupaj)", "stakniti glave" etc.

3.2.3. Trinary Example
(and weighting them according to salience, following the
Trinary relations indicate the relations between three
method developed by Lin (1998)), we identify the near
grammatical categories. In the Slovene gramrel file, they
neighbours for each. The Sketch Engine does this and the
are mainly used to extract prepositional patterns where the
result for the lemma "kriza" can be seen in the Appendix
grammatical case – in Slovene the instrumental and 2.
locative cases – is expressed by means of prepositional
As there is no thesaurus available for the Slovene
phrases.
language, it is not possible to compare it to the human
*TRINARY
assessment of the word's synonymic relations, but it is
=prec_%s
immediately clear that the software shows a number of
2:[tag="S.*"] 3:[tag="D.*"]
relevant items such as "konflikt", "spor", "spopad" etc.,
[tag="P.*" | word= "," | word =
indicating one semantic direction, "problem", "težava",
"se" | word = "si"] {0,5} "zaplet" etc., indicating another, and "stiska", "izguba"
1:[tag="S.*"]
indicating a more intimate human sentiment.
2:[tag="G.*"] 3:[tag="D.*"]
One can explore each of the relations with the sketch
[tag="P.*" | word= "," | word=
differences feature.
"se" | word="si"] {0,5}
1:[tag="S.*"]
4. Conclusion and further work
In the case shown on the
Testing of the 100-million FidaPLUS corpus in the
left, the grammatical relation is
Sketch Engine has shown it to be an exceptionally useful
established between the lemma
tool for exploring typical grammatical and lexical
"glava" preceded by the relations in the Slovene language. To be able to take full
preposition "po", and the advantage of the software, it is important to have a corpus
"glava" word sketch indicates
which is lemmatized and POS-tagged as accurately as
salient combinations with verbs
possible, and that is one area where there is room for
on the left. Again, together with
improvement. We would like to further explore Slovene
the frequent but semantically grammatical relations and their implementation in the
transparent combinations there
gramrel file, and also the possibility a Slovene
are numerous idiomatic dependency-parser.
expressions such as However even in its present form the Sketch Engine is
"rojiti/motati/poditi po glavi" a valuable tool, particularly for lexicographic use.
and the more informal "srati po

glavi".
5. References
3.3. Sketch Differences
Kilgarriff, A., Tugwell, D. (2001). WORD SKETCH:
The sketch differences feature in the Sketch Engine
Extraction and Display of Significant Collocations for
specifies, for two semantically related words, what
Lexicography.
Proc. ACL workshop on
behaviour they share and how they differ. Synonymous
COLLOCATION: Computational Extraction, Analysis
words tend to share some of the collocates but not all. The
and Exploitation. Toulouse. 32-28.
sketch differences show the patterns which are shared by
Kilgarriff, A., Rundell, M. (2002). Lexical profiling
both synonyms and presents the information also in a
software and its lexicographic applications - a case
colour scheme for the user to grasp immediately if and
study. Proc EURALEX. Copenhagen. 807-818.
where the lemmas are synonymous. For the Slovene Kilgarriff, A., Rychly, P., Smrž, P., Tugwell, D. (2004)
language, this is particularly useful in cases where there
The Sketch Engine. Proc. Euralex. Lorient, France.
are two competing synonyms, one etymologically foreign
105-116.
and the other of Slavic origin. The more normatively-
Lin, D, (1998). Automatic retrieval and clustering of
minded usually argue for abolition of the foreign lemma
similar words. COLING-ACL, Montreal. 768-774.
and non-discriminatory use of the Slavic form. The Krek, S. (2003). Jezikovni priročniki in novi mediji. Jezik
example of "cona" and "območje" in the Appendix 1
in slovstvo, letn. 48, št. 3-4, 29-46.
shows the differences. In the FidaPLUS corpus, only Rundell, M. (ed) (2001). Macmillan English Dictionary
"operativen" is distributed evenly between the two
for Advanced Learners. Macmillan Education.
synonyms. A milder bias towards "območje" is indicated
in the cases of "demilitariziran" and "turističen" and a
stronger one with "zaprt" and "obmejen". The opposite is
true with more fixed "erogena cona", "obrtna cona",
"industrijska cona" etc. and less fixed "carinska cona /
carinsko območje", "tamponska cona / tamponsko
območje", also "tamponski", "brezcarinski", "siv" etc.
3.4. Thesaurus
The similarity is based on ‘shared triples’. "Cona",
"območje" both occur as the second term in the triple
<modifier, ?, “tamponska”>, and this provides one small
piece of evidence that the two words are close in meaning.
By simply gathering together all such pieces of evidence

Appendix 1: Sketch difference – lemma_1 “cona”,
letmma_2 “območje”





green

green

green

light green

green

green

green

red

green

green

light red

extra light green

extra light red

red

extra light green

light green

white

light red

red

light green

extra light red

light red

extra light red

light red

light green



Appendix 2: Thesaurus – lemma “kriza”