Szeged Treebank 2.0
A Hungarian natural language database with detailed syntactic analysis


Introduction
In Hungarian, like in many other languages, the role of morphemes and syntagmas in sentences and their relation to one another is of key importance. Syntactic analysis and annotation, that is the marking of different syntactic units (e.g. nouns or adjectival phrases, postpositional structures, verbs and their arguments). A treebank representation that describes the syntactic structure of sentences already exists for most Western European languages and a number of Middle and Eastern European languages, so it is time to create a precisely analyzed Hungarian treebank as well.

We relied on known sources and already existing theories when forming the treebank. After studying and comparing them, our linguistic experts developed a consistent syntactic system of rules. The defined syntactic units were marked by an automatic pre-annotating unit on texts of the Szeged Corpus 2.0, then linguistic experts checked and corrected the marked structures. Szeged Treebank 2.0 is based on the first version of the treebank, so it contains its NP and CP annotations.

The database formed I this way forms a reliable basis for the development of different computer applications. The determination of marked syntagmas and their relationship helps further linguistic processing, among others the semantic analysis of texts. We marked syntactic structures on 82 000 sentences (1.2 million word entries + 250 thousand punctuation marks) of the Szeged Corpus 2. 0 file. Treebank files are stored in XML-format, their inner structure is described by TEI P4 DTD (Document Type Definition) scheme.

Texts of Szeged Treebank 2.0

Text files of Szeged Treebank 1.0 correspond to Szeged Corpus 2.0. Texts from six different topics were selected, with each topic containing roughly 200.000 words. The topics are as follows:

Further information on the types and sizes of texts is available in the description of Szeged Corpus 2.0.

The formation process of Szeged Treebank 2.0

1. Preliminaries
Szeged Treebank 2.0 is based on Szeged Corpus 2.0, which describes the sentences in the following way. Each sentence is surrounded by <s> and </s> XML tags. Within the tags, the entire text of the sentence can be found first, then the words and punctuation marks of the sentence is listed among <w> and </w> and <c> and </c> tags, respectively. Within <w> and </w> XML tags, the actual word form can be found first, which is followed by all possible POS-tags of the word together with their stems. Each <w> tag, that is, word form contains all possible morpho-syntactic (MSD) codes of the word form (together with stems) between <anav> tags. The MSD code selected from possible codes on the basis of the context is always given in <ana> tags together with the stem.
For the partial syntactic annotation of the texts we used internationally accepted
We used the internationally accepted NP (noun phrase) and CP (clausal phrase) tags to label the texts with syntactic tags. It was evident to build the coding of syntactic structure on the basis of strings of <w> and <c> tags, considering them as terminal symbols. The string consisting of <w> and <c> tags within an <s> tag is between <CP> and </CP> tags separated from the text of the sentence. Coordinated and subordinated clauses also received a CP label. Within <CP> tags created in this way, noun phrases had to be determined between <NP> and </NP> tags. During the annotation process, annotators of the group used not only <CP> and <NP> but <XP> tags as well, which were to separate parts of the text not being in close connection with the main body of the text (interpolations between dashes or parentheses, resolution of an abbreviation given in parentheses etc.).This NP, CP and XP tag file was at hand at the beginning of syntactic analysis. For further information on the marking of clauses and noun phrases see the description of the 1.0 version of Szeged Treebank.

2. Linguistic considerations

There has been considerable effort made in the research on Hungarian sentence structure for the formation of a consistent syntactic rule system, however it is still not available in an implementable form. For this reason, taking already existing results and experiences into consideration, we tried to construct such a criteria system for the annotation, which is best adaptable to principles of computer processing.

2.1. Theory
The broad theory chosen is Hungarian generative syntax. The output of syntactic analysis is a (or more) syntactic tree. As a first step, rewriting rules and the lexicon create the initial or deep structure of the sentence. From this, the final or so-called surface structure is realizable through transformations (movements, deletions). In the deep structure of the Hungarian sentence the verb precedes its complements, which follow in an optional order.

             S

                  QP

                      FP

                         VP

                     V     XP    XP    XP    XP    ...


Denotations:        S:     sentence
            QP:    quantifier position
            FP:    focus position
            VP:    verb phrase
            V:    verb
            XP:    optional complement phrase:
                NP (noun phrase)
                ADVP (adverbial phrase)
                PP (postpositional phrase)

Free, that is leafless, branches are left blank in the deep structure, they are so-called functional positions, and components can be moved to these places from behind the verb. The components moved leave so-called traces in their original places. Traces are indexed with the moved components, so movements need not be indicated with arrows. The syntactic tree - neither its deep nor its surface structure - does not show which of the verb arguments are complements. This information can be found in the lexicon. Let us see an example, in which some components are moved from behind the verb:

             S

        Ágii        QP

         minden rokonátj    FP

             tegnapelőttk    VP

                 V        NP    NP NP ADVP


                 látta vendégül __i __j __k

(Ági hosted each of her relatives the day after tomorrow.)

The analysis can be carried on beyond sentence level, too - the inner structure of sentence-level components is also revealable. This, however, does not influence sentence level analysis, it does not overwrite it.

2.2. Practical realization
The linguists working on the building of Szeged Treebank 2.0 decided not to represent empty categories. Empty categories can be the traces left by moved components, or phonologically not empty pronouns (pro, PRO, or phonologically empty deictic words). Thus, theory is not curtailed since traces are recoverable: components preceding the verb leave their traces behind the verb (in an arbitrary order). Phonologically empty pronouns can also be generated from personal suffixes.

Another important difference from theory is the omission of the representation of functional positions (projections). In some cases their saturation can be concluded from the position of the verb and the verb modifier (verbal prefix or singular common noun without an article), in other cases it depends on the prosodic features of live speech, which are not coded in the written texts of the corpus.

Noun phrases are handled uniformly; definite (DP), indefinite (NUMP), and predicative (predNP) noun phrases are not distinguished.

Syntactic trees in the treebank do not appear as trees, but they are realized with labelled parenthesizing for technical reasons and the sake of simplicity only. Labelled parenthesizing was realized in the widely used XML format. Labelled parenthesizing and tree structure are equivalent with each other:

    <NP>
        <NP>                            NP
            Pista            ?    ..
        </NP>                        NP
        kabátja
    </NP>                         Pista kabátja

(Pista's coat)

Inventory and short overview of the use of syntactic labels used in the corpus

ADJP: boundary of attributive adjectives
ADVP: boundary of adverbial phrases; adverbial adjectives (gyorsan [quickly], kétségtelenül [undoubtedly]), postpositional personal pronouns (e.g. mögötte [behind him/her], utánatok [after you]), and tokens not belonging to any other category (szervusz [hi], igen [yes])
c: punctuation mark
C0: conjunction
CP: boundary of clauses; also the realisation of the starting symbol of theory, S in the corpus; in the case of subordinate sentences it is the deictic word that is represented as the verb complement, or the CP in case it is missing. (For further information on the marking of CPs see the description of the 1.0 version of Szeged Treebank.)
INF_: boundary of the infinitive and its complement list
NEG: negative particle
NP: boundary of noun phrases; we considered only movable, noun-headed sentences as noun phrases; predicative (non-attributive) adjectives and inflected personal pronouns (nekem [for me], tőlünk [from us]) are also marked as noun phrases. (For further information on the marking of NPs see the description of the 1.0 version of Szeged Treebank.)
PA_: boundary of the adverbial participle and its complement list
PP: boundary of postpositional structures
PREVERB: verbal prefix
V_: boundary of the verb and its complement list
XP: an interjected (not organic) part of the sentence, e.g. interjection in parentheses, between dashes.

List of attributes:
id: automatically generated identifier of uppermost level components
preverb_ref: attribute and value of verb-like labels (V0, INF0, PA0) is the identifier of inseparable verbal prefixes
preverb_body: attribute and value of verb-like labels (V0, INF0, PA0) is the lower-case form of inseparable verbal prefixes
idref: attribute and value of NODE is the identifier of the complement
type: attribute and value of NODE is the type of the complement label
role: attribute and value of NODE is the morpho-syntactic and semantic role of the complement. The role type attribute can take the values represented in the chart below:
Case, description MSD Inflection, example role
nominative n O NOM
accusative a -t ACC
genitive g O, -nak/-nek GEN
dative d -nak/-nek DAT
instrumental i -val/-vel INS
illative x -ba/-be ILL
inessive 2 -ban/-ben INE
elative e -ból/-ből ELA
allative t -hoz/-hez/-höz ALL
adessive 3 -nál/-nél ADE
ablative b -tól/-től ABL
sublative s -ra/-re SUB
superessive p -n/-on/-en/-ön SUP
delative h -ról/-ről DEL
terminative 9 -ig TER
essive w -ul/-ül ESS
(essive-)formal f -ként, -képp(en) FOR
temporalis m -kor TEM
causalis c -ért CAU
sociative q -stul/-stül SOC
factive y -vá/-vé FAC
distributive u -nként DIS
locativus l -tt LOC
place: point of location there; under the tree LOCY
place: endpoint there; under the tree TO
place: starting point from there; from under the tree FROM
time: point of location today; during the meeting TLOCY
time: endpoint next year; till then TTO
time: starting point from that time TFROM
predikatív nominal PRED
question word whether QUE
result infinitive GOAL
other ill, because of him MODE

Table 1.: Possible values of the role attribute

Thus, the analysis of the sentence above in the treebank looks as follows:

<CP id="fajl.1.1">
    <NP id="fajl.1.2">
        Ági
    </NP>
    <NP id="fajl.1.3">
        <ADJP>
            minden
        </ADJP>
        rokonát
    </NP>
    <ADVP id="fajl.1.4">
        tegnapelőtt
    </ADVP>
    <V_ id="fajl.1.5">
        <V0>
            látta
        </V0>
        <CHILDREN>
            <NODE idref="fajl.1.2" type="NP" role="NOM">
            </NODE>
            <NODE idref="fajl.1.3" type="NP" role="ACC">
            </NODE>
            <NODE idref="fajl.1.4" type="ADVP" role="TLOCY">
            </NODE>
            <NODE idref="fajl.1.6" type="NP" role="ESS">
            </NODE>
        </CHILDREN>
    </V_>
    <NP id="fajl.1.6">
        vendégül
    </NP>
    <c>
        .
    </c>
</CP>

The equivalent tree is:

                        CP


NP    NP        ADVP                V_            NP     c


ADJP                V0        CHILDREN
                            

                        NODE NODE NODE NODE



Ági minden rokonát tegnapelőtt látta                    vendégül    

2.3. Further possible developments

3. Text processing
3.1. Preliminary analysis of syntactic structures
Since NPs and CPs were already marked in texts deriving from the 1.0 version of the Szeged Treebank, only a preliminary annotation of the remaining syntactic structures was necessary. The syntactic role of words could mostly be given with the help of its MSD code, that is, its morpho-syntactic traits. The use of regular rules defined by experts was not necessary here. For the automatic annotation of syntactic units, we used a self-developed program. Naturally, the program was not expected to work with a hundred percent precision in the definition of the structures, so the control and correction of experts could not be omitted in this phase, either.

3.2. The process of manual annotation
The next step of the process was the control and correction of automatically developed syntactic annotation. To simplify the task, we adapted the earlier developed program - for the annotation of clauses and noun phrases - to the purpose. The window, in which annotators could not only see the XML structure of sentences but also their syntactic tree structures, can be seen in figure 1 below.

Figure 1. The editing window of the XML Editor

4. Treebank data in numbers
We did statistic measuring on the whole treebank file in order to be able to examine the distribution of different features of the entire syntax tree. The following two tables summarize these results according to topics.

4.1. The depth of the entire syntax tree
The depth of the entire syntax tree is the length of the longest way from the leaf to the root (symbol S), that is, the number of tree levels. The table below refers to the whole syntax tree of all the sentences in the treebank summarized according to topics. The columns of the table comprise the frequency of trees of certain depth. Depth data are represented per level up to a depth of 5 levels, from 6 level on they appear contracted into increasingly larger groups. Distribution is greatest in the case of 4 level, complete syntax trees.
Syntax tree depth
1 2 3 4 5 6-7 8-10 11-20
Short essays 141 2922 7898 8388 3942 1380 62 0
Legal texts 2 110 687 1554 2127 3346 1337 115
Newspaper articles 29 577 1466 2469 2545 2567 534 24
Business news 0 75 864 2396 2844 2933 455 10
Fiction 493 4649 5230 4170 2373 1495 152 2
Computer technology 9 541 1133 2413 2654 2638 373 7
All 674 8874 17278 21390 16485 14359 2913 158
Table 2. Distribution of the entire syntax tree depth in treebank sentences

4.2. The width of the entire syntax tree
The width of the whole syntax tree is practically equivalent with the length of the sentences, that is, how many words and punctuation marks there are in the sentence. The table below refers to the whole syntax tree of all the sentences in the treebank summarized according to topics. The columns of the table comprise the frequency of trees of certain width. Width data are represented separately up to 5 words, from 6 words on they appear contracted into increasingly larger groups. Distribution according to width spreads more than that of depth. A syntax tree width of 21-50 is the most common, however a width of above 50 also occurs in every topic.
Syntax tree width
1 2 3 4 5 6-7 8-10 11-20 21-50 50-
Short essays 25 126 319 578 1109 2811 4738 11309 3667 51
Legal texts 20 56 60 72 48 147 429 2640 5153 653
Newspaper articles 1 83 97 120 156 438 1000 3693 4401 222
Business news 1 0 2 11 158 114 502 3741 5006 42
Fiction 15 434 1099 1336 1397 2691 3095 5487 2864 146
Computer technology 104 142 108 80 130 266 681 3643 4430 184
All 166 841 1685 2197 2998 6467 10445 30513 25521 1298
Table 3. Distribution of the entire syntax tree width in treebank sentences


Creators of Szeged Treebank 2.0
Consortium partners:
Project Leaders
János Csirik University of Szeged e-mail: csirik@inf.u-szeged.hu
Tibor Gyimóthy University of Szeged e-mail: gyimothy@inf.u-szeged.hu
Gábor Prószéky MorphoLogic Ltd. e-mail: proszeky@morphologic.hu
Balázs Kis MorphoLogic Ltd. e-mail: kis@morphologic.hu
Tamás Váradi Research Institute for Linguistics e-mail: varadi@nytud.hu
Further project members from the University of Szeged
Zoltán Alexin University of Szeged e-mail: alexin@inf.u-szeged.hu
Dóra Csendes University of Szeged e-mail: dcsendes@inf.u-szeged.hu
Richárd Farkas University of Szeged e-mail: rfarkas@inf.u-szeged.hu
András Hócza University of Szeged e-mail: hocza@inf.u-szeged.hu
András Kocsor University of Szeged e-mail: kocsor@inf.u-szeged.hu
Kornél Kovács University of Szeged e-mail: kkornel@inf.u-szeged.hu
György Szarvas University of Szeged e-mail: szarvas@inf.u-szeged.hu
Annotators (students of Hungarian linguistics at the University of Szeged)
Anikó Formanek, Kinga Konczer, Ildikó Korpa, Éva Nagy, Krisztián Pálmai, Ágnes Szabó, Bernadett Szőke, Csilla Tóth, Veronika Vincze
Programmers (students of informatics at the University of Szeged)
András Appelshoffer, Tibor Bakota, Csongor Barta, Szabolcs Iván, András Miháczi, Miklós Rácz, György Soponyai

ta, Csongor Barta, Szabolcs Iván, András Miháczi, Miklós Rácz, György Soponyai