Szeged Treebank
The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language. It contains 82,000 sentences, 1.2 million words and 250,000 punctuation marks. Texts were selected from six different domains, ~200,000 words in size from each. The domains are the following:
- fiction
- compositions of pupils between 14-16 years of age
- newspaper articles (from the newspapers Népszabadság, Népszava, Magyar Hírlap, HVG)
- texts in informatics
- legal texts
- business and financial news
The treebank exists in three versions:
- Szeged Treebank 1.0 is annotated for noun phrases and clauses;
- Szeged Treebank 2.0 contains a deep phrase-structured syntactic analysis for all sentences;
- Szeged Dependency Treebank contains dependency-style annotation of all sentences.
Baseline experiments
We conducted baseline experiments on the Szeged Dependency Treebank with three state-of-the-art dependency parsers: MALT (Nivre et al. 2004), MST (McDonald et al. 2005) and the Bohnet parser (Bohnet 2010). Results are presented in Farkas et al. (2012) and the training/development/test and the crossvalidation splits can be also accessed by sending a licence agreement (see below).
A detailed classification of parsing errors can be downloaded here.
References
- Szeged Treebank 1.0:
Csendes, Dóra; Csirik, János; Gyimóthy, Tibor 2004: The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus. In: Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora (LINC 2004) at The 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, 23-29 August, pp. 19-23.
- Szeged Treebank 2.0:
Csendes, Dóra; Csirik, János; Gyimóthy, Tibor; Kocsor, András 2005: The Szeged Treebank. In: Matoušek, Václav et al. (eds.): Proceedings of the 8th International Conference on Text, Speech and Dialogue (TSD 2005), Karlovy Vary, Czech Republic, September 12-16, 2005, Springer LNAI 3658, pp. 123-131.
- Szeged Dependency Treebank:
Vincze, Veronika; Szauter, Dóra; Almási, Attila; Móra, György; Alexin, Zoltán; Csirik, János 2010: Hungarian Dependency Treebank. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta.
- Dependency parsing:
Farkas, Richárd; Vincze, Veronika; Schmid, Helmut 2012: Dependency Parsing of Hungarian: Baseline Results and Challenges. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), pp. 55-65.
Download
In order to have access to the corpora, a signed licence agreement should be sent to Veronika Vincze (vinczev AT inf.u-szeged.hu or fax number: +3662546737).
Licences:
Szeged Treebank 1.0
Szeged Treebank 2.0
Szeged Dependency Treebank