University of Szeged Natural Language Processing Group Hungarian Academy of Sciences

A toolkit for linguistic processing of Hungarian

The toolkit called magyarlanc aims at the basic linguistic processing of Hungarian texts. The toolkit consists of only JAVA modules (there are no wrappers for other programming languages), which guarantees its platform independency and its ability to be integrated into bigger systems (e.g. web servers).

The modules of the toolkit are:

  • Sentence splitter (a modified version of the SentenceSplitter of morphadorner, adapted to Hungarian)
  • Tokenizer (a modified version of the Tokenizer of morphadorner, adapted to Hungarian)
  • POS tagger and lemmatizer
    • A modified version of the Stanford POS tagger, which uses all the possible tags offered by the morphological parser for unknown words
    • The morphological parser is a code based on the finite state automata written by György Gyepesi, which was built on the resource morphdb.hu.
    • The result of the morphological parsing (KR code) is converted to MSD code.
    • The model was trained on the Szeged Treebank on a reduced set of MSD codes.
    • Lemmas also contain derivational suffixes (a convention in the MSD sytem).
  • Stopword filtering

UIMA adapters

The UIMA (Unstructured Information Management Application) framework aims at supporting the development of software architectures that want to process a huge amount of unstructured data. Apache UIMA is an open source implementation of the UIMA specification, which is especially tailored to the processing of textual documents.

The UIMA framework is platform independent and it prefers to apply standard solutions to the greatest extent possible. Its main goal is to achieve that each processing module can be easily integrated into parsing chains ("just download and use") and to make it easy for the user to select the most appropriate component (components fulfilling the same role are interchangeable).

The framework makes it possible to divide a complex problem into several smaller subproblems such as: sentence splitting, tokenization, named entity recognition. Each processing unit implements a specific interface (in Java or C++), the framework supervises the construction of the processing chain and its running, besides, it is also responsible for the data flow between units and for measuring the performance of the system etc.

Download

  • magyarlanc v0.6 (2010.06.24) [jar]
  • morphadorner adapted to Hungarian [jar]
  • magyarlanc in UIMA
    • magyarlanc (sentence splitter, tokenizer, POS tagger) [pear]
    • sentence splitter [pear]
    • tokenizer [pear]

Usability

The toolkit can be used free of charge under the licence Creative Commons Attribution Share Alike.

Please refer to:

Zsibrita, János; Nagy, István; Farkas, Richárd 2009: Magyar nyelvi elemző modulok az UIMA keretrendszerhez. In: Tanács Attila, Szauter Dóra, Vincze Veronika (eds.): VI. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp. 394-395.

Zsibrita, János; Vincze, Veronika; Farkas, Richárd 2010: Ismeretlen kifejezések és a szófaji egyértelműsítés. In: Tanács Attila, Vincze Veronika (eds.): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp. 275-283.

For further information please contact Richárd Farkas (rfarkas AT inf.u-szeged.hu).