-------------------------------------------------------------- -------------------------------------------------------------- Documentation for HunOr, a Hungarian-Russian parallel corpus -------------------------------------------------------------- -------------------------------------------------------------- ---------------------- Introduction ---------------------- The HunOr corpus currently comprises approximately 800 thousand words, but is undergoing continuous enlargement. Texts of the corpus are from various sources, for instance, printed version, electronic publication etc. Corpus texts are morphologically analyzed and some of the parts are manually aligned and annotated for Named Entities. ---------------------- Data ---------------------- The HunOr corpus consists of three subcorpora on the basis of the text genres: literature, scientific and official language subcorpora. Nevertheless, the corpus is going to be extended with a newspaper subcorpus within a short period of time. Literary texts Boris Akunin - Grigory Chartishvili: Kladbisenskie istorii 'Cemetery Stories' Fyodor Mikhaylovich Dostoevsky: Zapiski iz podpolya 'Notes from Underground' Ilya Ilf, Yevgeny Petrov: Dvenadtsat stulyev 'The Twelve Chairs' Isaak Emmanuilovich Babel: Konarmija 'Red Cavalry' Nikolay Vasilyevich Gogol: Zapiski sumasshedshego 'Diary of a Madman' Frigyes Karinthy: Tanár úr, kérem 'Please Sir' Ferenc Móra: Aranykoporsó 'The Gold Coffin' Géza Gárdonyi: Egri csillagok 'Stars of Eger' Kálmán Mikszáth: A fekete város 'The Black Town' Jenő Rejtő: A tizennégy karátos autó 'The 14-carat roadster' Scientific texts Vitaly Orlov: Hranitel nenuzhnih veshey 'The keeper of needless things' Nikolay Berdyaev: O vecno-babyom v russkoy duse 'About the "eternal femininity" in the Russian soul' Official texts A magyar kultúra ezer esztendeje 'One thousand years of Hungarian culture' Nemzeti jelképek, nemzeti ünnepek 'National symbols, national days' Magyar Nobel-díjasok egy jobb világért 'Nobel laureates from Hungary for a better world' Törvény a szomszédos államokban élő magyarokról: érdekek és célok 'Act on Hungarians living in neighbouring countries: interests and goals' Corpus texts were automatically split into sentences and POS-tagged: see the directories "ru_mor_sb" and "hu_mor_sb". A bunch of texts were manually aligned (in the directory "parallel") and Named Entities were also marked in a couple of texts (directories "hu_ne" and "ru_ne"). The trilingual (i.e. English-Russian-Hungarian) part of the corpus can be found in directory "3", together with the morphologically analysed texts. ---------------------- Corpus statistics ---------------------- Text genre Tokens Sentences Russian Hungarian Russian Hungarian Literature 789,001 798,641 67,021 61,505 Scientific 6,683 7,228 370 348 Official 14,774 13,522 668 568 Total 810,458 819,391 68,059 62,421 ---------------------- Contact ---------------------- For further information please contact Veronika Vincze (vinczev AT inf.u-szeged.hu). ---------------------- References ---------------------- Szabó, Martina Katalin; Vincze, Veronika; Nagy T., István 2012: HunOr: A Hungarian-Russian Parallel Corpus. In: Proceedings of LREC 2012.