University of Szeged Natural Language Processing Group Hungarian Academy of Sciences

Multiword expressions

Multiword expressions are lexical units that consist of two or more words (tokens), however, they exhibit special syntactic, semantic, pragmatic or statistical features. From an NLP point of view, their treatment is not free of problems since - on the one hand - the system should recognize that they count as one lexical unit (and not two or more words connected) therefore it is advisable to store them as one unit in the lexicon. On the other hand, special rules for their treatment should also be included in the system.

Identifying multiword expressions is not unequivocal since constructions with similar syntactic structure (e.g. verb + noun combinations) can belong to different subclasses on the productivity scale (i.e. productive combinations, light verb constructions and idioms). That is why well-designed and tagged corpora of multiword expressions are invaluable resources for training and testing algorithms that are able to identify multiword expressions.

A compound is a lexical unit that consists of two or more elements that exist on their own. Orthographically, a compound may include spaces (high school) or hyphen (well-known) or none of them (headmaster).
Light verb constructions (LVCs) consist of a nominal and a verbal component where the noun is usually taken in one of its literal senses but the verb usually loses its original sense to some extent e.g. to give a lecture, to come into bloom, the problem lies (in).
Verb-particle constructions (VPCs, also called phrasal verbs or phrasal-prepositional verbs) are combined of a verb and a particle/preposition that can be adjacent (as in put off) or separated by an intervening object (turn the light off).

Corpora

Several manually annotated corpora have been created by us, listed below.

SzegedParalellFX
The SzegedParalell English-Hungarian parallel corpus constitutes the basis of the SzegedParalellFX, in which light verb constructions are manually annotated. Three novels, texts from magazines and language books and economic and legal texts were selected for annotation. Light verb constructions are annotated in both languages. The corpus has 14,261 sentence alignment units, which contain 1370 occurrences of light verb constructions.

Szeged Treebank FX
The Szeged Treebank - a database in which words are morphosyntactically tagged and sentences are syntactically parsed - was annotated for light verb constructions manually. Corpus texts involve the following topics: student essays, short business news, newspaper texts, laws, computer texts, literature. This version of the Treebank contains 6734 occurrences of 1215 light verb constructions altogether in 82,099 sentences.

Wiki50
The Wiki50 corpus contains 50 English Wikipedia articles (4350 sentences), in which several types of multiword expressions and four classes of Named Entities were manually annotated by professional linguists. This is the first corpus in which multiword expressions and named entities are annotated at the same time. Corpus data make it possible to investigate the co-occurrences of different types of MWEs and NEs within the same domain and also to train and evaluate MWE detectors and NER applications.

4FX
The 4FX corpus contains English, Spanish, German and Hungarian legislative texts from the JRC-Acquis Multilingual Parallel Corpus, which are manually annotated for light verb constructions, following standardized annotation principles. The corpus contains 673 LVCs in English, 806 in German, 938 in Spanish and 1059 in Hungarian.

CoNLL-2003 dataset annotated for LVCs
The CoNLL-2003 dataset was originally developed for named entity recognition in short news domain. 500 randomly selected pieces of short news were taken from the CoNLL-2003 dataset and LVCs in them were annotated. This corpus contains 381 occurrences of manually annotated LVCs in 8,467 sentences.

Downloads

Corpora


The resources can be used free of charge under the licence Creative Commons Attribution Share Alike.
Corpus NC VPC LVC
Wiki50 EN EN EN
SzegedParalellFX EN,HU
Szeged Treebank FX HU
4FX EN, DE, ES, HU
CoNLL-2003 EN
Article NC VPC LVC
RANLP 2011 a
LREC 2014
ACM 2013
ACL 2013
IJCNLP 2013 a
IJCNLP 2013 b
LREC 2012
RANLP 2011 b
TSD 2013
RANLP 2013 c
RANLP 2013 d
TSD 2011
MWE 2014

References

Vincze, Veronika; Nagy T., István; Berend, Gábor 2011: Multiword expressions and Named Entities in the Wiki50 corpus. In: Proceedings of RANLP 2011. Hissar, Bulgaria, pp. 289-295.

Light Verb Constructions

Rácz, Anita; Nagy T., István; Vincze, Veronika 2014: 4FX: Light Verb Constructions in a Multilingual Parallel Corpus. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), ELRA, Reykjavik, Iceland, pp. 710-715.

Vincze, Veronika; Nagy T., István; Farkas, Richárd 2013: Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach. In: Proceedings of ACL 2013 (Volume 2: Short Papers), pp. 255-261.

Vincze, Veronika; Nagy T., István; Zsibrita, János 2013: Learning to Detect English and Hungarian Light Verb Constructions. ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use. Part 1, 10(2), Article 6.

Nagy T., István; Vincze, Veronika; Farkas, Richárd 2013: Full-coverage Identification of English Light Verb Constructions. In: Proceedings of IJCNLP 2013, pp. 329-337.

Vincze, Veronika; Zsibrita, János; Nagy T., István 2013: Dependency Parsing for Identifying Hungarian Light Verb Constructions. In: Proceedings of IJCNLP 2013, pp. 207-215.

Vincze, Veronika 2012: Light Verb Constructions in the SzegedParalellFX English-Hungarian Parallel Corpus. In: Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC 2012). Istanbul, Turkey, pp. 2381-2388.

Nagy T., István; Berend, Gábor; Móra, György; Vincze, Veronika 2011: Domain-dependent detection of light verb constructions. In: Proceedings of the Student Research Workshop associated with RANLP 2011. Hissar, Bulgaria, pp. 1-8.

Nominal Compounds

Nagy T., István; Vincze, Veronika 2013: English Noun Compound Detection With Wikipedia-Based Methods. In: Proceedings of TSD 2013.

Nagy T., István; Berend, Gábor; Vincze, Veronika 2011: Noun Compound and Named Entity Recognition and their Usability in Keyphrase Extraction. In: Proceedings of RANLP 2011. Hissar, Bulgaria, pp. 162-169.

Nagy T., István; Vincze, Veronika; Berend, Gábor 2011: Domain-dependent identification of multiword expressions. In: Proceedings of RANLP 2011. Hissar, Bulgaria, pp. 622-627.

Verb Particle Constructions

Nagy T., István; Vincze, Veronika 2011: Identifying verbal collocations in Wikipedia articles. In: Habernal, Ivan; Matoušek, Václav (eds.): Proceedings of the 14th International Conference on Text, Speech and Dialogue (TSD2011). Berlin, Heidelberg, Springer Verlag, LNAI 6836, pp. 179-186.

Nagy T., István; Vincze, Veronika 2014: VPCTagger: Detecting Verb-Particle Constructions With Syntax-Based Methods. In: Proceedings of the 10th Workshop on Multiword Expressions (MWE), ACL, Gothenburg, Sweden, pp. 17-25.

For further information please contact Veronika Vincze (vinczev AT inf.u-szeged.hu).