University of Szeged Natural Language Processing Group Hungarian Academy of Sciences

Szeged Treebank

The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language. It contains 82,000 sentences, 1.2 million words and 250,000 punctuation marks. Texts were selected from six different domains, ~200,000 words in size from each. The domains are the following:

  • fiction
  • compositions of pupils between 14-16 years of age
  • newspaper articles (from the newspapers Népszabadság, Népszava, Magyar Hírlap, HVG)
  • texts in informatics
  • legal texts
  • business and financial news

The treebank exists in three versions:

  • Szeged Treebank 1.0 is annotated for noun phrases and clauses;
  • Szeged Treebank 2.0 contains a deep phrase-structured syntactic analysis for all sentences;
  • Szeged Dependency Treebank contains dependency-style annotation of all sentences.

A morphologically reannotated version of the corpus, Szeged Corpus 2.5 has just been released, where participles, causative, frequentative and model verbs are distinctively marked, and unknown or misspelled words have been corrected, along with some minor morphological modifications.
If you are interested in Szeged Corpus 2.5, please contact Veronika Vincze.

Baseline experiments

We conducted baseline experiments on the Szeged Dependency Treebank with three state-of-the-art dependency parsers: MALT (Nivre et al. 2004), MST (McDonald et al. 2005) and the Bohnet parser (Bohnet 2010). Results are presented in Farkas et al. (2012) and the training/development/test and the crossvalidation splits can be also accessed by sending a licence agreement (see below). A detailed classification of parsing errors can be downloaded here.

Conversion from constituency to dependency

The two sets of manual annotations for both constituency and dependency syntax on the same bunch of texts make it possible to evaluate the quality of a rule-based automatic conversion from constituency to dependency trees. We automatically converted the constituency treebank into dependency trees following the principles described here. The accuracy of the conversion was 96.51 (ULA) and 93.85 (LAS). For a detailed error analysis please refer to Simkó et al. (2014).

Coreference-annotated version

A section of the Szeged Treebank has been manually annotated for coreference relations. It is freely available for research and educational purposes. If you are interested in this version, please contact Veronika Vincze.

References

  • Szeged Treebank 1.0:
    Csendes, Dóra; Csirik, János; Gyimóthy, Tibor 2004: The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus. In: Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora (LINC 2004) at The 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, 23-29 August, pp. 19-23.
  • Szeged Treebank 2.0:
    Csendes, Dóra; Csirik, János; Gyimóthy, Tibor; Kocsor, András 2005: The Szeged Treebank. In: Matoušek, Václav et al. (eds.): Proceedings of the 8th International Conference on Text, Speech and Dialogue (TSD 2005), Karlovy Vary, Czech Republic, September 12-16, 2005, Springer LNAI 3658, pp. 123-131.
  • Szeged Dependency Treebank:
    Vincze, Veronika; Szauter, Dóra; Almási, Attila; Móra, György; Alexin, Zoltán; Csirik, János 2010: Hungarian Dependency Treebank. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta.
  • Dependency parsing:
    Farkas, Richárd; Vincze, Veronika; Schmid, Helmut 2012: Dependency Parsing of Hungarian: Baseline Results and Challenges. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), pp. 55-65.
  • Conversion from constituency to dependency:
    Simkó, Katalin Ilona; Vincze, Veronika; Szántó, Zsolt; Farkas, Richárd 2014: An Empirical Evaluation of Automatic Conversion from Constituency to Dependency in Hungarian. Accepted to: COLING 2014.

Download

In order to have access to the corpora, a signed licence agreement should be sent to Veronika Vincze (vinczev AT inf.u-szeged.hu or fax number: +3662546737).

Licences:

Szeged Treebank 1.0
Szeged Treebank 2.0
Szeged Dependency Treebank