Szeged Treebank 2.0

A Hungarian natural language database with detailed syntactic analysis

Introduction

In Hungarian, like in many other languages, the role of morphemes and syntagmas in sentences and their relation to one another is of key importance. Syntactic analysis and annotation, that is the marking of different syntactic units (e.g. nouns or adjectival phrases, postpositional structures, verbs and their arguments). A treebank representation that describes the syntactic structure of sentences already exists for most Western European languages and a number of Middle and Eastern European languages, so it is time to create a precisely analyzed Hungarian treebank as well.

We relied on known sources and already existing theories when forming the treebank. After studying and comparing them, our linguistic experts developed a consistent syntactic system of rules. The defined syntactic units were marked by an automatic pre-annotating unit on texts of the Szeged Corpus 2.0, then linguistic experts checked and corrected the marked structures. Szeged Treebank 2.0 is based on the first version of the treebank, so it contains its NP and CP annotations.

The database formed I this way forms a reliable basis for the development of different computer applications. The determination of marked syntagmas and their relationship helps further linguistic processing, among others the semantic analysis of texts. We marked syntactic structures on 82 000 sentences (1.2 million word entries + 250 thousand punctuation marks) of the Szeged Corpus 2. 0 file. Treebank files are stored in XML-format, their inner structure is described by TEI P4 DTD (Document Type Definition) scheme.

Texts of Szeged Treebank 2.0

Text files of Szeged Treebank 1.0 correspond to Szeged Corpus 2.0. Texts from six different topics were selected, with each topic containing roughly 200.000 words. The topics are as follows:

Fiction
Short essays of 14 to 16 year-old students
Newspaper articles: experts from Népszabadság, Népszava, Magyar Hírlap and HVG
Texts related to computer science
Legal texts
Economic and financial short news

Further information on the types and sizes of texts is available in the description of Szeged Corpus 2.0.

The formation process of Szeged Treebank 2.0

1. Preliminaries

Szeged Treebank 2.0 is based on Szeged Corpus 2.0, which describes the sentences in the following way. Each sentence is surrounded by <s> and </s> XML tags. Within the tags, the entire text of the sentence can be found first, then the words and punctuation marks of the sentence is listed among <w> and </w> and <c> and </c> tags, respectively. Within <w> and </w> XML tags, the actual word form can be found first, which is followed by all possible POS-tags of the word together with their stems. Each <w> tag, that is, word form contains all possible morpho-syntactic (MSD) codes of the word form (together with stems) between <anav> tags. The MSD code selected from possible codes on the basis of the context is always given in <ana> tags together with the stem.

For the partial syntactic annotation of the texts we used internationally accepted

We used the internationally accepted NP (noun phrase) and CP (clausal phrase) tags to label the texts with syntactic tags. It was evident to build the coding of syntactic structure on the basis of strings of <w> and <c> tags, considering them as terminal symbols. The string consisting of <w> and <c> tags within an <s> tag is between <CP> and </CP> tags separated from the text of the sentence. Coordinated and subordinated clauses also received a CP label. Within <CP> tags created in this way, noun phrases had to be determined between <NP> and </NP> tags. During the annotation process, annotators of the group used not only <CP> and <NP> but <XP> tags as well, which were to separate parts of the text not being in close connection with the main body of the text (interpolations between dashes or parentheses, resolution of an abbreviation given in parentheses etc.).This NP, CP and XP tag file was at hand at the beginning of syntactic analysis. For further information on the marking of clauses and noun phrases see the description of the 1.0 version of Szeged Treebank.

2. Linguistic considerations

There has been considerable effort made in the research on Hungarian sentence structure for the formation of a consistent syntactic rule system, however it is still not available in an implementable form. For this reason, taking already existing results and experiences into consideration, we tried to construct such a criteria system for the annotation, which is best adaptable to principles of computer processing.

2.1. Theory

The broad theory chosen is Hungarian generative syntax. The output of syntactic analysis is a (or more) syntactic tree. As a first step, rewriting rules and the lexicon create the initial or deep structure of the sentence. From this, the final or so-called surface structure is realizable through transformations (movements, deletions). In the deep structure of the Hungarian sentence the verb precedes its complements, which follow in an optional order.

V XP XP XP XP ...

Denotations:        S:     sentence
            QP:    quantifier position
            FP:    focus position
            VP:    verb phrase
            V:    verb
            XP:    optional complement phrase:
                NP (noun phrase)
                ADVP (adverbial phrase)
                PP (postpositional phrase)

Free, that is leafless, branches are left blank in the deep structure, they are so-called functional positions, and components can be moved to these places from behind the verb. The components moved leave so-called traces in their original places. Traces are indexed with the moved components, so movements need not be indicated with arrows. The syntactic tree - neither its deep nor its surface structure - does not show which of the verb arguments are complements. This information can be found in the lexicon. Let us see an example, in which some components are moved from behind the verb:

Ágii QP

minden rokonátj FP

tegnapelőttk VP

V NP NP NP ADVP

látta vendégül __i __j __k

(Ági hosted each of her relatives the day after tomorrow.)

The analysis can be carried on beyond sentence level, too - the inner structure of sentence-level components is also revealable. This, however, does not influence sentence level analysis, it does not overwrite it.

2.2. Practical realization

The linguists working on the building of Szeged Treebank 2.0 decided not to represent empty categories. Empty categories can be the traces left by moved components, or phonologically not empty pronouns (pro, PRO, or phonologically empty deictic words). Thus, theory is not curtailed since traces are recoverable: components preceding the verb leave their traces behind the verb (in an arbitrary order). Phonologically empty pronouns can also be generated from personal suffixes.

Another important difference from theory is the omission of the representation of functional positions (projections). In some cases their saturation can be concluded from the position of the verb and the verb modifier (verbal prefix or singular common noun without an article), in other cases it depends on the prosodic features of live speech, which are not coded in the written texts of the corpus.

Noun phrases are handled uniformly; definite (DP), indefinite (NUMP), and predicative (predNP) noun phrases are not distinguished.

Syntactic trees in the treebank do not appear as trees, but they are realized with labelled parenthesizing for technical reasons and the sake of simplicity only. Labelled parenthesizing was realized in the widely used XML format. Labelled parenthesizing and tree structure are equivalent with each other:

<NP>

<NP> NP

Pista ? ..

</NP> NP

kabátja

</NP> Pista kabátja

(Pista's coat)

Inventory and short overview of the use of syntactic labels used in the corpus

ADJP: boundary of attributive adjectives

ADVP: boundary of adverbial phrases; adverbial adjectives (gyorsan [quickly], kétségtelenül [undoubtedly]), postpositional personal pronouns (e.g. mögötte [behind him/her], utánatok [after you]), and tokens not belonging to any other category (szervusz [hi], igen [yes])

c: punctuation mark

C0: conjunction

CP: boundary of clauses; also the realisation of the starting symbol of theory, S in the corpus; in the case of subordinate sentences it is the deictic word that is represented as the verb complement, or the CP in case it is missing. (For further information on the marking of CPs see the description of the 1.0 version of Szeged Treebank.)

INF_: boundary of the infinitive and its complement list

INF0: boundary of the infinitive

CHILDREN: boundary of the complement list

NODE: label of the attributes of a given complement

NEG: negative particle

NP: boundary of noun phrases; we considered only movable, noun-headed sentences as noun phrases; predicative (non-attributive) adjectives and inflected personal pronouns (nekem [for me], tőlünk [from us]) are also marked as noun phrases. (For further information on the marking of NPs see the description of the 1.0 version of Szeged Treebank.)

PA_: boundary of the adverbial participle and its complement list

PA0: boundary of adverbial participle

CHILDREN: boundary of the complement list

NODE: label of the attributes of a given complement

PP: boundary of postpositional structures

PREVERB: verbal prefix

V_: boundary of the verb and its complement list

V0: boundary of the verb; it contains past tense conditional composite verb form in one

CHILDREN: boundary of the complement list

NODE: label of the attributes of a given complement

XP: an interjected (not organic) part of the sentence, e.g. interjection in parentheses, between dashes.

List of attributes:

id: automatically generated identifier of uppermost level components

preverb_ref: attribute and value of verb-like labels (V0, INF0, PA0) is the identifier of inseparable verbal prefixes

preverb_body: attribute and value of verb-like labels (V0, INF0, PA0) is the lower-case form of inseparable verbal prefixes

idref: attribute and value of NODE is the identifier of the complement

type: attribute and value of NODE is the type of the complement label

role: attribute and value of NODE is the morpho-syntactic and semantic role of the complement. The role type attribute can take the values represented in the chart below:

Case, description	MSD	Inflection, example	role
nominative	n	O	NOM
accusative	a	-t	ACC
genitive	g	O, -nak/-nek	GEN
dative	d	-nak/-nek	DAT
instrumental	i	-val/-vel	INS
illative	x	-ba/-be	ILL
inessive	2	-ban/-ben	INE
elative	e	-ból/-ből	ELA
allative	t	-hoz/-hez/-höz	ALL
adessive	3	-nál/-nél	ADE
ablative	b	-tól/-től	ABL
sublative	s	-ra/-re	SUB
superessive	p	-n/-on/-en/-ön	SUP
delative	h	-ról/-ről	DEL
terminative	9	-ig	TER
essive	w	-ul/-ül	ESS
(essive-)formal	f	-ként, -képp(en)	FOR
temporalis	m	-kor	TEM
causalis	c	-ért	CAU
sociative	q	-stul/-stül	SOC
factive	y	-vá/-vé	FAC
distributive	u	-nként	DIS
locativus	l	-tt	LOC
place: point of location		there; under the tree	LOCY
place: endpoint		there; under the tree	TO
place: starting point		from there; from under the tree	FROM
time: point of location		today; during the meeting	TLOCY
time: endpoint		next year; till then	TTO
time: starting point		from that time	TFROM
predikatív nominal			PRED
question word		whether	QUE
result		infinitive	GOAL
other		ill, because of him	MODE

Table 1.: Possible values of the role attribute

Thus, the analysis of the sentence above in the treebank looks as follows:

Ági

</NP>

<ADJP>

minden

</ADJP>

rokonát

</NP>

tegnapelőtt

</ADVP>

<V_ id="fajl.1.5">

<V0>

látta

</V0>

</NODE>

</NODE>

</NODE>

</NODE>

</CHILDREN>

</V_>

vendégül

</NP>

<c>

</c>

</CP>

The equivalent tree is:

NP NP ADVP V_ NP c

ADJP V0 CHILDREN

NODE NODE NODE NODE

Ági minden rokonát tegnapelőtt látta vendégül

2.3. Further possible developments

Indication of the components of nonverbal components
Indexing members of possessive structures with the inflection -nak/-nek together
Indexing subordinate clauses and their representative deictic words together
Classification of NPs (DP, NUMP, predNP)

3. Text processing

3.1. Preliminary analysis of syntactic structures

Since NPs and CPs were already marked in texts deriving from the 1.0 version of the Szeged Treebank, only a preliminary annotation of the remaining syntactic structures was necessary. The syntactic role of words could mostly be given with the help of its MSD code, that is, its morpho-syntactic traits. The use of regular rules defined by experts was not necessary here. For the automatic annotation of syntactic units, we used a self-developed program. Naturally, the program was not expected to work with a hundred percent precision in the definition of the structures, so the control and correction of experts could not be omitted in this phase, either.

3.2. The process of manual annotation

The next step of the process was the control and correction of automatically developed syntactic annotation. To simplify the task, we adapted the earlier developed program - for the annotation of clauses and noun phrases - to the purpose. The window, in which annotators could not only see the XML structure of sentences but also their syntactic tree structures, can be seen in figure 1 below.

Figure 1. The editing window of the XML Editor

4. Treebank data in numbers

We did statistic measuring on the whole treebank file in order to be able to examine the distribution of different features of the entire syntax tree. The following two tables summarize these results according to topics.

4.1. The depth of the entire syntax tree

The depth of the entire syntax tree is the length of the longest way from the leaf to the root (symbol S), that is, the number of tree levels. The table below refers to the whole syntax tree of all the sentences in the treebank summarized according to topics. The columns of the table comprise the frequency of trees of certain depth. Depth data are represented per level up to a depth of 5 levels, from 6 level on they appear contracted into increasingly larger groups. Distribution is greatest in the case of 4 level, complete syntax trees.

Syntax tree depth
	1	2	3	4	5	6-7	8-10	11-20
Short essays	141	2922	7898	8388	3942	1380	62	0
Legal texts	2	110	687	1554	2127	3346	1337	115
Newspaper articles	29	577	1466	2469	2545	2567	534	24
Business news	0	75	864	2396	2844	2933	455	10
Fiction	493	4649	5230	4170	2373	1495	152	2
Computer technology	9	541	1133	2413	2654	2638	373	7
All	674	8874	17278	21390	16485	14359	2913	158

Table 2. Distribution of the entire syntax tree depth in treebank sentences

4.2. The width of the entire syntax tree

The width of the whole syntax tree is practically equivalent with the length of the sentences, that is, how many words and punctuation marks there are in the sentence. The table below refers to the whole syntax tree of all the sentences in the treebank summarized according to topics. The columns of the table comprise the frequency of trees of certain width. Width data are represented separately up to 5 words, from 6 words on they appear contracted into increasingly larger groups. Distribution according to width spreads more than that of depth. A syntax tree width of 21-50 is the most common, however a width of above 50 also occurs in every topic.

Syntax tree width
	1	2	3	4	5	6-7	8-10	11-20	21-50	50-
Short essays	25	126	319	578	1109	2811	4738	11309	3667	51
Legal texts	20	56	60	72	48	147	429	2640	5153	653
Newspaper articles	1	83	97	120	156	438	1000	3693	4401	222
Business news	1	0	2	11	158	114	502	3741	5006	42
Fiction	15	434	1099	1336	1397	2691	3095	5487	2864	146
Computer technology	104	142	108	80	130	266	681	3643	4430	184
All	166	841	1685	2197	2998	6467	10445	30513	25521	1298

Table 3. Distribution of the entire syntax tree width in treebank sentences

Creators of Szeged Treebank 2.0

Consortium partners:

University of Szeged, Department of Informatics, HLT Group

MorphoLogic Ltd. Budapest

Research Institute for Linguistics at the Hungarian Academy of Sciences, Department of Corpus Linguistics

Project Leaders

János Csirik	University of Szeged	e-mail: csirik@inf.u-szeged.hu
Tibor Gyimóthy	University of Szeged	e-mail: gyimothy@inf.u-szeged.hu
Gábor Prószéky	MorphoLogic Ltd.	e-mail: proszeky@morphologic.hu
Balázs Kis	MorphoLogic Ltd.	e-mail: kis@morphologic.hu
Tamás Váradi	Research Institute for Linguistics	e-mail: varadi@nytud.hu

Further project members from the University of Szeged

Zoltán Alexin	University of Szeged	e-mail: alexin@inf.u-szeged.hu
Dóra Csendes	University of Szeged	e-mail: dcsendes@inf.u-szeged.hu
Richárd Farkas	University of Szeged	e-mail: rfarkas@inf.u-szeged.hu
András Hócza	University of Szeged	e-mail: hocza@inf.u-szeged.hu
András Kocsor	University of Szeged	e-mail: kocsor@inf.u-szeged.hu
Kornél Kovács	University of Szeged	e-mail: kkornel@inf.u-szeged.hu
György Szarvas	University of Szeged	e-mail: szarvas@inf.u-szeged.hu

Annotators (students of Hungarian linguistics at the University of Szeged)

Anikó Formanek, Kinga Konczer, Ildikó Korpa, Éva Nagy, Krisztián Pálmai, Ágnes Szabó, Bernadett Szőke, Csilla Tóth, Veronika Vincze

Programmers (students of informatics at the University of Szeged)

András Appelshoffer, Tibor Bakota, Csongor Barta, Szabolcs Iván, András Miháczi, Miklós Rácz, György Soponyai

ta, Csongor Barta, Szabolcs Iván, András Miháczi, Miklós Rácz, György Soponyai