Multiword expressions are lexical units that consist of two or more words (tokens), however, they exhibit special syntactic, semantic, pragmatic or statistical features. From an NLP point of view, their treatment is not free of problems since - on the one hand - the system should recognize that they count as one lexical unit (and not two or more words connected) therefore it is advisable to store them as one unit in the lexicon. On the other hand, special rules for their treatment should also be included in the system.
Identifying multiword expressions is not unequivocal since constructions with similar syntactic structure (e.g. verb + noun combinations) can belong to different subclasses on the productivity scale (i.e. productive combinations, light verb constructions and idioms). That is why well-designed and tagged corpora of multiword expressions are invaluable resources for training and testing algorithms that are able to identify multiword expressions.
Our research group developed a version of the Szeged Treebank 2.0 that is annotated for light verb constructions, and light verb constructions are also marked in some parts of the SzegedParalell. Besides, 50 articles from the English Wikipedia were also annotated for several types of multiword expressions and NEs.
We also implemented a rule-based system that is able to identify noun compounds and light verb constructions in raw texts. The system is described in detail in Nagy T. et al. (2011) and Vincze et al. (2011). The system was evaluated on the Wiki50 and SzegedParalellFX corpora. As the annotations of SzegedParalellFX have been recently revised, we hereby update the results obtained on the corpus (originally reported in Nagy et al. (2011), Table 3):
|Wiki 50 base||Wiki50 syntax||SzegedParalellFX base||SzegedParalellFX adapted||SzegedParalellFX base syntax||SzegedParalellFX adapted syntax|
|Suffix AND MFV||10.05/44.05/16.37||9.24/46.58/15.42||10.68/39.61/16.82||10.42/49.38/17.20||10.55/44.51/17.05||10.29/62.20/17.65|
|Suffix OR MFV||61.41/19.76/29.89||57.88/23.99/33.92||68.36/19.27/30.06||60.29/21.14/31.30||65.49/23.14/34.19||57.81/26.70/36.53|
|Suffix AND stem||11.96/10.35/11.10||11.14/12.28/11.68||11.20/12.41/11.77||11.20/11.75/11.47||11.07/15.68/12.98||11.07/15.68/12.98|
|Suffix OR stem||57.61/8.88/15.38||54.35/11.46/18.93||64.97/8.86/15.59||64.97/8.86/15.59||62.37/11.85/19.92||62.37/11.85/19.92|
|MFV AND stem||36.96/39.53/38.20||34.78/46.55/39.81||45.44/32.74/38.06||41.80/39.29/40.50||43.10/37.49/40.10||39.84/48.65/43.81|
|MFV OR stem||68.75/10.42/18.09||64.67/13.36/22.15||79.56/9.90/17.61||74.87/9.91/17.50||76.43/13.22/22.55||71.74/13.35/22.51|
|Suffix AND MFV AND stem||7.34/47.37/12.71||6.79/50.00/11.96||7.68/42.14/13.00||8.07/50.00/13.90||7.55/48.33/13.06||7.94/64.89/14.15|
|Suffix OR MFV OR stem||72.28/10.16/17.82||68.21/13.04/21.89||80.47/9.62/17.19||76.43/9.65/17.14||77.34/12.79/21.95||73.31/12.91/21.95|
Nagy T., István; Vincze, Veronika 2011: Identifying verbal collocations in Wikipedia articles. In: Habernal, Ivan; Matoušek, Václav (eds.): Proceedings of the 14th International Conference on Text, Speech and Dialogue (TSD2011). Berlin, Heidelberg, Springer Verlag, LNAI 6836, pp. 179-186.
Nagy T., István; Vincze, Veronika; Berend, Gábor 2011: Domain-dependent identification of multiword expressions. In: Proceedings of RANLP 2011. Hissar, Bulgaria, pp. 622-627.
Vincze, Veronika 2011: Semi-Compositional Noun + Verb Constructions: Theoretical Questions and Computational Linguistic Analyses. PhD thesis, University of Szeged, August 2011.
Vincze, Veronika; Csirik, János 2010: Hungarian Corpus of Light Verb Constructions. In: Proceedings of COLING 2010, Beijing, China, pp. 1110-1118.
Vincze, Veronika; Nagy T., István; Berend, Gábor 2011: Detecting noun compounds and light verb constructions: a contrastive study. In: Proceedings of ACL Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Portland, Oregon, USA, pp. 116-121.
Vincze, Veronika; Nagy T., István; Berend, Gábor 2011: Multiword expressions and Named Entities in the Wiki50 corpus. In: Proceedings of RANLP 2011. Hissar, Bulgaria, pp. 289-295.
Vincze, Veronika 2012: Light Verb Constructions in the SzegedParalellFX English-Hungarian Parallel Corpus. In: Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC 2012). Istanbul, Turkey, pp. 2381-2388.
For further information please contact Veronika Vincze (vinczev AT inf.u-szeged.hu).